Creation and Analysis of the Estonian Subreddit Corpus

Name
Tauno Tamm
Abstract
Reddit is the world's largest forum, visited by about 1.2 billion users monthly. The largest Estonian subreddit is r/Eesti. This master's thesis involved creating a language corpus based on the data from r/Eesti and analyzing the data therein. The analysis addressed questions on how and when posts are made and what they discuss. For answering these research questions, various transformer-type models were fine-tuned for sentiment analysis, the Python language detection library Lingua was used for language detection, and BERTopic was employed for topic analysis. The results revealed that the r/Eesti subreddit can be considered bilingual, as a significant portion of posts and comments are also in English. The sentiment analysis exhibited that users posting and commenting in Estonian are mostly negative, while those who write in English tend to be neutral, with a slight lean towards positivity. In both languages, “Education” is the most common topic.
Graduation Thesis language
Estonian
Graduation Thesis type
Master - Data Science
Supervisor(s)
Siim Orasmaa
Defence year
2024
 
PDF