Estonian Synthetic Error Generation by Prompting for Grammatical Error Correction

Martin Vainikko
For Estonian grammatical error correction (GEC), sufficient data to train end-
to-end models is lacking. However, the recent advancements in large language models (LLMs) offer new opportunities. We utilize OpenAI’s GPT models (GPT-3.5-Turbo, GPT-4-Turbo, and GPT-4) to generate synthetic errors and analyze these errors across different model versions, prompting strategies, and data domains. By fine-tuning models on these synthetic datasets and conducting human evaluations, we assess the effectiveness of various prompting strategies for synthetic error generation. Our findings indicate that within the GEC domain, the errors generated by GPT models are comparable to those made by humans. Human evaluations also revealed that GPT models produce problematic edits. This highlights significant potential for further research in this area.
Graduation Thesis language
Graduation Thesis type
Master - Computer Science
Agnes Luhtaru, Mark Fišel
Defence year