Evaluation of Automatic Sentence and Word Tokenization on the Corpus of New Media Language

Name
Kairit Peekman
Abstract
There are many texts on the web that are not orthographically correct (eg forum posts, people-to-people comments, chat etc.). This is so called new media language or Internet language. The Bachelor thesis answers the question how good are the tokenizers of three language processing tools (EstNLTK, UDPipe, Stanford NLP) for new media language texts. EstNTLK word tokenizer is rule-based and sentence tokenizer is model-based with rule-based follow-up, UDPipe and StanfordNLP have pre-trained Estonian language models. All three still have room for improvement in sentences tokenization of new media language texts, but EstNLTK and StanfordNLP performed better than UDPipe. The results of the words tokenization differed less and were generally high, as the F-score was over 95%.
Graduation Thesis language
Estonian
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Kairit Sirts
Defence year
2020
 
PDF