Patient Treatment Trajectories Using Vector Embeddings

Name
Õie Renata Siimon
Abstract
In this thesis, data from Estonian Health Insurance Fund (Haigekassa) in 2010–2019 was used to construct vector representations of patient treatment trajectories with BERT, and for comparison, with word2vec. The goal was to see how well such natural language processing (NLP) models perform when sequences of medical services are used as input instead of sentences, and if BERT performs better than word2vec. So far, research on how well NLP models work with non-natural language sequences is limited, and this thesis contributes to filling this gap. In this thesis, treatment trajectories were built as sequences of service codes appearing on 41 million medical invoices. Models in this thesis were constructed in two stages. First, service code embeddings were trained with BERT and word2vec. Then, classification models were built by fine-tuning BERT and training KNN and SVM classifiers on top of word2vec embeddings. Results showed that despite poor performance of BERT in pre-training stage, it outperformed models built on top of word2vec embeddings in all seven classification tasks. The highest accuracy (0.9918) was achieved in classifying treatment types (5 classes) and the lowest (0.4121) in classifying diagnosis (174 classes). It was concluded that BERT indeed proved useful with this type of non-natural language input data, and that the contextual embeddings of BERT worked better than non-contextual ones of word2vec. From among the four BERT models built in this thesis, the second largest was the overall best, showing that if the ‘language’ used is simpler than natural language, then BERT models with reduced dimensions might work better.
Graduation Thesis language
English
Graduation Thesis type
Master - Data Science
Supervisor(s)
Sven Laur
Defence year
2023
 
PDF