Training small language models on high quality or limited vocabulary datasets for Estonian

Organisatsiooni nimi
TartuNLP
Kokkuvõte
While large language models are knowledgeable in almost any domain one can come up with, there are papers that show that small language models can also generate meaningful text if the vocabulary or domain of the data is limited or has a very good quality (see https://arxiv.org/pdf/2305.07759.pdf and https://arxiv.org/pdf/2306.11644.pdf). The goal of this thesis is to investigate the aforementioned topic for Estonian by generating or collecting new dataset(s) which the student will then use to train language models in different (small) sizes and evaluate on various aspects.
Lõputöö kaitsmise aasta
2023-2024
Juhendaja
Hele-Andra Kuulmets
Suhtlemiskeel(ed)
eesti keel, inglise keel
Nõuded kandideerijale
Tase
Magister
Märksõnad

Kandideerimise kontakt

 
Nimi
Hele-Andra Kuulmets
Tel
E-mail
hele-andra.kuulmets@ut.ee