Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Named Entity Recognition for the Estonian Language

Name

Aleksandr Tkatšenko

Abstract

In this thesis we study the applicability of recent statistical methods to extraction of named entities from Estonian texts. In particular, we explore two fundamental design challenges: choice of inference algorithm and text representation. We compare two state-of-the-art supervised learning methods, Linear Chain Conditional Random Fields (CRF) and Maximum Entropy Model (MaxEnt). In representing named entities, we consider three sources of information: 1) local features, which are based on the word itself, 2) global features extracted from other occurrences of the same word in the whole document and 3) external knowledge represented by lists of entities extracted from the Web. To train and evaluate our NER systems, we assembled a text corpus of Estonian newspaper articles in which we manually annotated names of locations, persons, organisations and facilities. In the process of comparing several solutions we achieved F1 score of 0.86 by the CRF system using combination of local and global features and external knowledge.

Graduation Thesis language

English

Graduation Thesis type

Master - Computer Science

Supervisor(s)

Konstantin Tretjakov

Defence year

2010

PDF

UT Institute of Computer Science Graduation Theses Registry

Named Entity Recognition for the Estonian Language