Unsupervised Machine Translation Using Cross-Lingual N-gram Embeddings
Name
Andre Tättar
Abstract
The current best machine translation systems have achieved excellent results, but rely heavily on large parallel corpora. There have been many attempts on getting the same good results on low-resource languages, but these tries have been somewhat unsuccessful.
In this work, I propose a novel unsupervised machine translation system that uses n-gram embeddings for getting the translations, by learning cross-lingual embeddings. This solution requires only monolingual corpora, not a single parallel sentence is needed, which is achieved by using unsupervised word translation. I report my findings for Estonian - English - Estonian language pair. The solution does not work as well as expected, but tests suggest that it works better than simple word-by-word translation.
In this work, I propose a novel unsupervised machine translation system that uses n-gram embeddings for getting the translations, by learning cross-lingual embeddings. This solution requires only monolingual corpora, not a single parallel sentence is needed, which is achieved by using unsupervised word translation. I report my findings for Estonian - English - Estonian language pair. The solution does not work as well as expected, but tests suggest that it works better than simple word-by-word translation.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Mark Fishel
Defence year
2018