Institute of Computer Science - Graduation Theses Topics Registry

Graduation theses topics (Submit a thesis topic) Completed theses (Submit your thesis)

Training small language models on high quality or limited vocabulary datasets for Estonian

Organization

TartuNLP

Abstract

While large language models are knowledgeable in almost any domain one can come up with, there are papers that show that small language models can also generate meaningful text if the vocabulary or domain of the data is limited or has a very good quality (see https://arxiv.org/pdf/2305.07759.pdf and https://arxiv.org/pdf/2306.11644.pdf). The goal of this thesis is to investigate the aforementioned topic for Estonian by generating or collecting new dataset(s) which the student will then use to train language models in different (small) sizes and evaluate on various aspects.

Graduation Theses defence year

2023-2024

Supervisor

Hele-Andra Kuulmets

Spoken language (s)

Estonian, English

Requirements for candidates

Level

Masters

Keywords

Application of contact

Name

Hele-Andra Kuulmets

Phone

E-mail

hele-andra.kuulmets@ut.ee

UT Institute of Computer Science Graduation Theses Topics Registry

Training small language models on high quality or limited vocabulary datasets for Estonian

Application of contact