Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Framework for Neural Network Based Fact Extraction Workflows

Name

Hendrik Šuvalov

Abstract

Medical texts, such as diagnoses and epicrises, are due to their nature often unstructured, sometimes in the form of free text. The common practice for extracting useful information from them, such as named entities (e.g. drug or disease names) and their semantic relations, is by using rule or pattern-based extraction, namely regular expressions. In most cases, this is the fastest and most effective approach, however, in certain circumstances this can be difficult, for example, if the text contains misspellings of words or in cases where we do not know the patterns to look for in the first place but could detect them once we saw them. This is a task for which neural network language models could prove to be useful, as they are capable of understanding the meaning of words based on the context they appear in. The main result of this thesis is a pipeline for implementing fact extraction tasks on medical texts. It uses EstMedBERT, a Bidirectional Encoder Representations from Transformers (BERT) model specifically pre-trained on Estonian medical texts, which can be fine-tuned to classify tokens using labelled data given by the user implementing the task. Having initially learned the task, the model will continue labelling new data under the supervision of the user, who will correct any mistakes and, using active learning, retrain the model. This is considered a human-in-the-loop approach for training neural networks. This approach could be a more effective solution to some fact extraction tasks in the medical field and implementing new tasks using this pipeline is easier to the user on a technical level, making it more accessible to people in medical domains. Moreover, in addition to providing the pipeline, as a result of this thesis, an example task has also been implemented using this approach and both the process and results have been analyzed.

Graduation Thesis language

Estonian

Graduation Thesis type

Master - Data Science

Supervisor(s)

Dage Särg, Raivo Kolde, Sven Laur

Defence year

2022

PDF

UT Institute of Computer Science Graduation Theses Registry

Framework for Neural Network Based Fact Extraction Workflows