Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Evaluation of Automatic Sentence and Word Tokenization on the Corpus of New Media Language

Name

Kairit Peekman

Abstract

There are many texts on the web that are not orthographically correct (eg forum posts, people-to-people comments, chat etc.). This is so called new media language or Internet language. The Bachelor thesis answers the question how good are the tokenizers of three language processing tools (EstNLTK, UDPipe, Stanford NLP) for new media language texts. EstNTLK word tokenizer is rule-based and sentence tokenizer is model-based with rule-based follow-up, UDPipe and StanfordNLP have pre-trained Estonian language models. All three still have room for improvement in sentences tokenization of new media language texts, but EstNLTK and StanfordNLP performed better than UDPipe. The results of the words tokenization differed less and were generally high, as the F-score was over 95%.

Graduation Thesis language

Estonian

Graduation Thesis type

Bachelor - Computer Science

Supervisor(s)

Kairit Sirts

Defence year

2020

PDF

UT Institute of Computer Science Graduation Theses Registry

Evaluation of Automatic Sentence and Word Tokenization on the Corpus of New Media Language