Morpheme-Aware Subword Segmentation for Neural Machine Translation
Name
Kaspar Papli
Abstract
Neural machine translation together with subword segmentation has recently produced state-of-the-art translation performance. The commonly used segmentation algorithm based on byte-pair encoding (BPE) does not consider the morphological structure of words. This occasionally causes misleading segmentation and incorrect translation of rare words. In this thesis we explore the use of morphological structure in subword segmentation and develop a novel segmentation algorithm that succeeds in preventing misleading BPE segmentations that occur due to its disregard for morphology. Analysis shows that the proposed algorithm decreases translation performance as measured by BLEU by 0.9 points while producing subjectively more intuitive segmentations and mildly better translations for sentences previously involving inaccurate baseline segmentation.
Graduation Thesis language
English
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Mark FiĊĦel
Defence year
2017