Paragraph-Level Translation of Low-Resource Finno-Ugric Languages

Name
Dmytro Pashchenko
Abstract
The emergence of massively multilingual neural machine translation models made it possible to efficiently translate many languages simultaneously, including those with extremely limited resources. The recent record holder, MADLAD-400, which spans over 400 languages, remains largely unexplored. In this work, we attempt to investigate the capabilities of MADLAD by fine-tuning it to translate four low-resource Finno-Ugric languages (Karelian Proper, Livvi, Ludian, and Veps, not included in MADLAD's collection) into Russian and back. Moreover, we explore the impact of paragraph-level translation on the model's performance, leveraging the document-level capabilities of MADLAD. We find that (1) the MADLAD-based system achieves results comparable to those of state-of-the-art models and discover that (2) the paragraph-level version of the system outperforms the sentence-level version by up to 3 BLEU points, significantly improving the consistency between sentences.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Mark Fishel, Elizaveta Yankovskaya
Defence year
2024
 
PDF