Context embeddings for natural language clustering

Robert Roosalu
Semantic awareness of natural language is an important step towards general artificial intelligence. A part of which could be embedding words and documents into vector space. We selected most of the common methods for doing so and ran a vast selection of different clustering experiments on word contexts extracted from the Estonian reference corpus. After a total of 20 thousand different experiments, we found that the skip-gram word vector model combined with Spectral clustering yields the best results. The word vectors could simply be averaged, or they could be used as input to recurrent autoencoders. The latter achieved best results overall and hint towards future work of employing more complex sequence to sequence recurrent models. The newly found knowledge is implemented into our custom built application, named PatternExaminer, which is used in the pipeline of extracting factual data from medical records. This brings us closer to achievements such as advanced personal medicine and automated clinical trials.
Graduation Thesis language
Graduation Thesis type
Master - Computer Science
Sven Laur
Defence year