Low-Resource Grammatical Error Correction via Synthetic Pre-training and Monolingual Zero-shot Translation

Name
Agnes Luhtaru
Abstract
We compare two approaches for training a grammatical error correction (GEC) model without annotated data as independent systems and initialisation for fine-tuning with error correction examples. The first method we explore is pre-training using mainly language-independent synthetic data. The second one is correcting errors with multilingual neural machine translation (NMT) via monolingual zero-shot translation. We found that the model trained using only synthetic data suffers from low recall, but the precision is decent. The NMT model is the opposite. It corrects mistakes with high recall but adds many unnecessary edits. Fine-tuning decreases the differences between the models - the synthetic model gains in the recall, and the NMT model’s precision increases. After fine-tuning, the model trained with artificial data is still more precise, but the recall is only slightly lower, making it more usable out of the two options.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Mark Fišel
Defence year
2022
 
PDF