A Rule-Based Disambiguator for Estonian

Kristi Zirk
Word-sense disambiguation (WSD) is an open problem of natural language processing, which governs the process of identifying which sense of a word is used in a sentence, when the word has multiple meanings. WSD is performed by using TEKsaurus as a reference sense inventory for Estonian. The atom of a wordnet-type thesaurus is a synonym set (also called a synset), which is a set containing all the synonymous words or multi-word units that express the same concept. WSD can be classified into two categories: rule-based method and statistics-based method. The theoretical part gives an overview of general topics in WSD. Theoretical part also shows the process of manual and automatically WSD. At this moment morphologically disambiguated corpus of Estonian texts consists approximately 500 000 words and at least two people have disambiguation this. The aim of the practical part was to formalize existing word-sense disambiguation rules and create a program what use these formalized rules to tag words in corpus. 75 noun and 5 verb rules were formalized during the work. WSD rules were so far written down in the Estonian sentences what were helpful to lexicographer to determining the proper meaning of the word.
