Automated Extraction of Insurance Policy Information: Natural Language Processing techniques to automate the process of extracting information about the insurance coverage from unstructured insurance policy documents.
2023 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE credits
Student thesis
Abstract [en]
This thesis investigates Natural Language Processing (NLP) techniques to extract relevant information from long and unstructured insurance policy documents. The goal is to reduce the amount of time required by readers to understand the coverage within the documents. The study uses predefined insurance policy coverage parameters, created by industry experts to represent what is covered in the policy documents. Three NLP approaches are used to classify the text sequences as insurance parameter classes. The thesis shows that using SBERT to create vector representations of text to allow cosine similarity calculations is an effective approach. The top scoring sequences for each parameter are assigned that parameter class. This approach shows a significant reduction in the number of sequences required to read by a user but misclassifies some positive examples. To improve the model, the parameter definitions and training data were combined into a support set. Similarity scores were calculated between all sequences and the support sets for each parameter using different pooling strategies. This few-shot classification approach performed well for the use case, improving the model’s performance significantly. In conclusion, this thesis demonstrates that NLP techniques can be applied to help understand unstructured insurance policy documents. The model developed in this study can be used to extract important information and reduce the time needed to understand the contents of aninsurance policy document. A human expert would however still be required to interpret the extracted text. The balance between the amount of relevant information and the amount of text shown would depend on how many of the top-scoring sequences are classified for each parameter. This study also identifies some limitations of the approach depending on available data. Overall, this research provides insight into the potential implications of NLP techniques for information extraction and the insurance industry.
Place, publisher, year, edition, pages
2023. , p. 60
Series
UPTEC STS, ISSN 1650-8319 ; 23023
Keywords [en]
NLP, SBERT, AI, Insurance, Semantic similarity
National Category
Language Technology (Computational Linguistics) Computer Sciences
Identifiers
URN: urn:nbn:se:uu:diva-506167OAI: oai:DiVA.org:uu-506167DiVA, id: diva2:1774296
External cooperation
Insurely
Educational program
Systems in Technology and Society Programme
Presentation
2023-06-01, 22:36 (Swedish)
Supervisors
Examiners
2023-06-272023-06-252023-06-27Bibliographically approved