Training small language models on high quality or limited vocabulary datasets for Estonian
While large language models are knowledgeable in almost any domain one can come up with, there are papers that show that small language models can also generate meaningful text if the vocabulary or domain of the data is limited or has a very good quality (see https://arxiv.org/pdf/2305.07759.pdf and https://arxiv.org/pdf/2306.11644.pdf). The goal of this thesis is to investigate the aforementioned topic for Estonian by generating or collecting new dataset(s) which the student will then use to train language models in different (small) sizes and evaluate on various aspects.
Lõputöö kaitsmise aasta
eesti keel, inglise keel