Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Context embeddings for natural language clustering

Name

Robert Roosalu

Abstract

Semantic awareness of natural language is an important step towards general artificial intelligence. A part of which could be embedding words and documents into vector space. We selected most of the common methods for doing so and ran a vast selection of different clustering experiments on word contexts extracted from the Estonian reference corpus. After a total of 20 thousand different experiments, we found that the skip-gram word vector model combined with Spectral clustering yields the best results. The word vectors could simply be averaged, or they could be used as input to recurrent autoencoders. The latter achieved best results overall and hint towards future work of employing more complex sequence to sequence recurrent models. The newly found knowledge is implemented into our custom built application, named PatternExaminer, which is used in the pipeline of extracting factual data from medical records. This brings us closer to achievements such as advanced personal medicine and automated clinical trials.

Graduation Thesis language

English

Graduation Thesis type

Master - Computer Science

Supervisor(s)

Sven Laur

Defence year

2017

PDF

UT Institute of Computer Science Graduation Theses Registry

Context embeddings for natural language clustering