Named Entity Recognition for the Estonian Language

Aleksandr Tkatšenko
In this thesis we study the applicability of recent statistical methods to extraction of named entities from Estonian texts. In particular, we explore two fundamental design challenges: choice of inference algorithm and text representation. We compare two state-of-the-art supervised learning methods, Linear Chain Conditional Random Fields (CRF) and Maximum Entropy Model (MaxEnt). In representing named entities, we consider three sources of information: 1) local features, which are based on the word itself, 2) global features extracted from other occurrences of the same word in the whole document and 3) external knowledge represented by lists of entities extracted from the Web. To train and evaluate our NER systems, we assembled a text corpus of Estonian newspaper articles in which we manually annotated names of locations, persons, organisations and facilities. In the process of comparing several solutions we achieved F1 score of 0.86 by the CRF system using combination of local and global features and external knowledge.
Graduation Thesis language
Graduation Thesis type
Master - Computer Science
Konstantin Tretjakov
Defence year