uu.seUppsala University Publications
Change search
ReferencesLink to record
Permanent link

Direct link
Automatic de-identification of case narratives from spontaneous reports in VigiBase
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems.
2015 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The use of patient data is essential in research but it is on the other hand confidential and can only be used after acquiring approval from an Ethical Board and informed consent from the individual patient. A large amount of patient data is therefore difficult to obtain if sensitive information, such as names, id numbers and contact details, are not removed from the data, by so called de-identification. Uppsala Monitoring Centre maintains the world's larges database of individual case reports of any suspected adverse drug reaction. There exists, of today, no method for efficiently de-identifying the narrative text included in these which causes countries like the United States of America reports to exclude the narratives in the reports.

The aim of this thesis is to develop and evaluate a method for automatic de-identification of case narratives in reports from the WHO Global Individual Case Safety Report Database System, VigiBase. This report compares three different models, namely Regular Expressions, used for text pattern matching, and the statistical models Support Vector Machine (SVM) and Conditional Random Fields (CRF). Performance, advantages and disadvantages are discussed as well as how identified sensitive information is handled to maintain readability of the narrative text. The models developed in this thesis are also compared to existing solutions to the de-identification problem.

The 400 reports extracted from VigiBase were already well de-identified in terms of names, ID numbers and contact details, making it difficult to train statistical models on these categories. The reports did however, contain plenty of dates and ages. For these categories Regular Expression would be sufficient to achieve a good performance. To identify entities in other categories more advanced methods such as the SVM and CRF are needed and will require more data. This was prominent when applying the models on the more information rich i2b2 de-identification challenge benchmark data set where the statistical models developed in this thesis performed at a competing level with existing models in the literature.

Place, publisher, year, edition, pages
UPTEC F, ISSN 1401-5757 ; 15054
Keyword [en]
de-identification, svm, crf, regex, VigiBase, i2b2
National Category
Computer and Information Science
URN: urn:nbn:se:uu:diva-262158OAI: oai:DiVA.org:uu-262158DiVA: diva2:852410
External cooperation
Uppsala Monitoring Centre
Educational program
Master Programme in Engineering Physics
2015-02-26, 10:07 (Swedish)
Available from: 2015-09-24 Created: 2015-09-09 Last updated: 2015-09-24Bibliographically approved

Open Access in DiVA

fulltext(3793 kB)88 downloads
File information
File name FULLTEXT01.pdfFile size 3793 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
Division of Computer Systems
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 88 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 269 hits
ReferencesLink to record
Permanent link

Direct link