Arvutiteaduse instituut - lõputööde teemade register

Lõputööde teemad (sisestamine) Valminud lõputööd (sisestamine)

Practical NLP Models for Estonian Language

Organisatsiooni nimi

TartuNLP

Kokkuvõte

Currently, not many pre-trained language models are available for Estonian language, however multilingual models often do include Estonian. Such multilingual language models are shown to perform well on Estonian on various NLP tasks (Kittask et al., 2020, Evaluating Multilingual BERT for Estonian). However they are not optimized for usage in single language practical applications, because they still contain the vocabulary and embeddings for all the languages they were trained on. This proposal suggests several modifications to multilingual language models to solve that. The core modification entails the removal of tokens (and their embeddings) that do not appear in texts in Estonian language from a given model.

Lõputöö kaitsmise aasta

2022-2023

Juhendaja

Aleksei Dorkin, Kairit Sirts

Suhtlemiskeel(ed)

inglise keel

Nõuded kandideerijale

Decent knowledge of python is mandatory. Reasonable level of familiarity with transformer-based language models and corresponding approaches to tokenization is strongly advised (this is not an opportunity to learn this completely from scratch).

Tase

Magister

Märksõnad

#transformers #tokenizers #embeddings #language_model

Kandideerimise kontakt

Nimi

Aleksei Dorkin

Tel

E-mail

aleksei.dorkin@ut.ee

Vaata lähemalt

https://docs.google.com/document/d/19O8Llco9ZKxpeoZsRjw9GhxccQYsLA_eVOJ1-0WODQw

TÜ arvutiteaduse instituudi lõputööde teemade register

Practical NLP Models for Estonian Language

Kandideerimise kontakt