Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Extraction of Psychosis Prodromal Symptoms from Medical Texts for Training Dataset Creation

Name

Kristel Agu

Abstract

The current master thesis aimed to create three annotated training datasets for the extraction of psychosis prodromal symptoms from medical texts using semi-automatic methods. For this purpose, a dataset of medical documents from 10% randomly selected Estonian population in the years 2012-2019 was used. These documents were filtered by the ICD-10 diagnoses evident during psychosis prodrome (2780 texts) and split into sentences (31 009) for simplification of the further workflow. A dataset was created from the sentences, which were filtered using a regular expression and annotated manually by the author, and used to train an initial logistic regression model. To create the features for the logistic regression model, word embeddings were found for each word in a sentence using the Word2Vec model pre-trained on the Estonian Reference Corpus and an average embedding was calculated for the whole sentence. After that, an iterative process was initiated, where more sentences containing the symptom were predicted from the remaining data, annotated by the author, added to the existing dataset and repeated until the model finds no new sentences. Using the logistic regression model for the extraction of psychosis prodromal symptoms simplified the dataset creation process and reduced the amount of work put into searching the sentences manually. As a result of this master thesis, an annotated training dataset with 799 sentences for extracting the psychosis prodrome symptom “odd behaviour”, a dataset with 643 sentences for the symptoms “depersonalization” and/or “derealization” and a dataset with 1176 sentences for the symptoms “paranoid delusions” and/or “suspiciousness” were created.

Graduation Thesis language

Estonian

Graduation Thesis type

Master - Conversion Master in IT

Supervisor(s)

Sulev Reisberg, Kairit Sirts

Defence year

2024

PDF

UT Institute of Computer Science Graduation Theses Registry

Extraction of Psychosis Prodromal Symptoms from Medical Texts for Training Dataset Creation