Preparing Text Data for Training Large Language Models

Name
Tanel Pastarus
Abstract
This bachelor’s thesis focuses on restoring the original order of translated text data by referencing the original text corpus documents. After the translation process, some sentences contained errors, which the author tried to fix by processing them. Additionally, a pilot test was conducted by fine-tuning three GPT-2 models on the processed data to assess the viability of using translated text data for training language models.
Graduation Thesis language
Estonian
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Mark Fišel
Defence year
2024
 
PDF