Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Acronym extraction from texts written in Estonian

Name

Anti Torp

Abstract

The aim of this paper was to give an overview of acronym extraction in general and to try to implement the knowledge on texts written in Estonian. As there is no universal agreement on the definition, it is a vague term. Acronym is an abbreviation formed from the initial components in a phrase [2]. Because of that they can be following: USA meaning „United States of America‟ and Benelux meaning „Belgium-Netherland-Luxembourg‟. Here we identify that there are acronyms and their expansions – „United States of America‟ would be an expansion for USA. The two named acronyms are well known and searching for their expansions is unnecessary, however there are more specific acronyms that one can find while reading long scientific texts. In that case, it would be helpful to get an instantaneous recall of possible acronym expansion candidates. The simplest way to get expansion candidate is to search manually compiled databases. That solution is followed by automated extraction solutions: pattern and rule-based The general solution for automated acronym extraction is to identify the acronyms and recognize their expansions from surrounding text. This problem gets more difficult when dealing with text written in another language (here we try to solve the problem with Estonian language). The increased difficulty is caused by the fact that a lot of texts are translated from English and some of the acronym expansions are translated, while the acronyms are not. The problem gets worse since Estonian translation of a regular English acronym might be a compound noun. Luckily, all the cases are not so extreme and most acronyms are closely preceded or followed by their expansions. There are two metrics that are used to describe acronym extractors – precision and recall. Precision measures how many correct expansions are extracted compared to all expansions found. Recall measures how many expansions were identified compared to what was possible to identify. Lastly, there is an attempt to create prototype extractor for Estonian language using simple regular expressions to match and extract acronyms and their expansions from texts written in Estonian. This attempt is tested on about 30 small articles that contain acronyms. While the main idea was to get the prototype to match expansions without making too many mistakes, the patterns that were compiled are intended to have as high precision as possible (the prototype scored 84.2%) and leaving questionable expansions out. That is the reason the prototype‟s recall score was 66.6% (compared to SVM‟s, which was 84.1%/83.4%).

Graduation Thesis language

Estonian

Graduation Thesis type

Bachelor - Computer Science

Supervisor(s)

Mare Koit

Defence year

2011

PDF

UT Institute of Computer Science Graduation Theses Registry

Acronym extraction from texts written in Estonian