Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

A Rule-Based Disambiguator for Estonian

Name

Kristi Zirk

Abstract

Word-sense disambiguation (WSD) is an open problem of natural language processing, which governs the process of identifying which sense of a word is used in a sentence, when the word has multiple meanings. WSD is performed by using TEKsaurus as a reference sense inventory for Estonian. The atom of a wordnet-type thesaurus is a synonym set (also called a synset), which is a set containing all the synonymous words or multi-word units that express the same concept. WSD can be classified into two categories: rule-based method and statistics-based method. The theoretical part gives an overview of general topics in WSD. Theoretical part also shows the process of manual and automatically WSD. At this moment morphologically disambiguated corpus of Estonian texts consists approximately 500 000 words and at least two people have disambiguation this. The aim of the practical part was to formalize existing word-sense disambiguation rules and create a program what use these formalized rules to tag words in corpus. 75 noun and 5 verb rules were formalized during the work. WSD rules were so far written down in the Estonian sentences what were helpful to lexicographer to determining the proper meaning of the word.

Graduation Thesis language

Estonian

Graduation Thesis type

Bachelor - Computer Science

Supervisor(s)

Neeme Kahusk

Defence year

2013

PDF Extras

UT Institute of Computer Science Graduation Theses Registry

A Rule-Based Disambiguator for Estonian