Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Pattern based fact extraction from Estonian free-texts

Name

Timo Petmanson

Abstract

Natural language processing is one of the most difficult problems, since words and language constructions have often ambiguous meaning that cannot be resolved without extensive cultural background. However, some facts are easier to deduce than the others. In this work, we consider unary, binary and ternary relations between the words that can be deduced form a single sentence. The relations represented by sets of patterns are combined with basic machine learning methods, that are used to train and deploy patterns for fact extraction. We also describe the process of active learning, which helps to speed up annotating relations in large corpora. Other contributions include a prototype implementation with plain-text preprocessor, corpus annotator, pattern miner and fact extractor. Additionally, we provide empirical study about the efficiency of the prototype implementation with several relations and corpora.

Graduation Thesis language

English

Graduation Thesis type

Master - Computer Science

Supervisor(s)

Sven Laur

Defence year

2012

PDF

UT Institute of Computer Science Graduation Theses Registry

Pattern based fact extraction from Estonian free-texts