Institute of Computer Science - Graduation Theses Registry

Completed theses (Submit your thesis) Graduation theses topics (Submit a thesis topic)

Low-Resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation

Name

Agnes Luhtaru

Abstract

We compare two approaches for training a grammatical error correction (GEC) model without annotated data as independent systems and initialisation for fine-tuning with error correction examples. The first method we explore is pre-training using mainly language-independent synthetic data. The second one is correcting errors with multilingual neural machine translation (NMT) via monolingual zero-shot translation. We found that the model trained using only synthetic data suffers from low recall, but the precision is decent. The NMT model is the opposite. It corrects mistakes with high recall but adds many unnecessary edits. Fine-tuning decreases the differences between the models - the synthetic model gains in the recall, and the NMT model’s precision increases. After fine-tuning, the model trained with artificial data is still more precise, but the recall is only slightly lower, making it more usable out of the two options.

Graduation Thesis language

English

Graduation Thesis type

Master - Computer Science

Supervisor(s)

Mark Fišel

Defence year

2022

PDF

UT Institute of Computer Science Graduation Theses Registry

Low-Resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation