Training small language models on high quality or limited vocabulary datasets for Estonian

Organization
TartuNLP
Abstract
While large language models are knowledgeable in almost any domain one can come up with, there are papers that show that small language models can also generate meaningful text if the vocabulary or domain of the data is limited or has a very good quality (see https://arxiv.org/pdf/2305.07759.pdf and https://arxiv.org/pdf/2306.11644.pdf). The goal of this thesis is to investigate the aforementioned topic for Estonian by generating or collecting new dataset(s) which the student will then use to train language models in different (small) sizes and evaluate on various aspects.
Graduation Theses defence year
2023-2024
Supervisor
Hele-Andra Kuulmets
Spoken language (s)
Estonian, English
Requirements for candidates
Level
Masters
Keywords

Application of contact

 
Name
Hele-Andra Kuulmets
Phone
E-mail
hele-andra.kuulmets@ut.ee