Adding UTF-8 Encoding Support to the Program Lingua::Ident
Name
Ando Paju
Abstract
The purpose of this paper is to analyze whether adding UTF-8 encoding support to
Lingua::Ident will provide any benefits. Currently Lingua::Ident uses bytes internally to
decide how each language is rated.
In the first paragraph I gave overview of current language identification methods and
chose the algorithm developed by Ted Dunning which uses Markov models as the basis for
this paper.
In the second paragraph I explained what is a Markov model and how does Dunning's
algorithm work.
In the third paragraph possible disadvantages of Lingua::Ident for the Estonian language
were listed and proposed what changes should be implemented to use umlauts (and other
characters not present in the original ASCII encoding) for language identification in UTF-8
encoded documents.
Fourth paragraph contains experiments with the changed Lingua::Ident, to see whether
adding encoding support made any difference.
Experiment results concluded that adding UTF-8 encoding support to Lingua::Ident
provided minor benefit to identify the Estonian language. Benefits of language identification
are probably greater for languages that use more multi-byte UTF-8 symbols.
Graduation Thesis language
Estonian
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Heiki-Jaan Kaalep
Defence year
2013