Semi-Supervised Automatic Speech Recognition for Low Resource Languages

Name
Karel Roots
Abstract
Automatic speech recognition (ASR) is a field of computer science that focuses on the development of methods and technologies that recognise and transform spoken language to text. It has a wide variety of applications in the field of human-computer interaction. It can be used to help people with disabilities to understand transcribed speech and control computer systems with speech-based input.
Presently, the major challenge in developing speech recognition models for low resource languages such as the Estonian language is the lack of large amounts of data that is needed for neural network based machine learning models. Recently however, multilingual self-supervised pre-training of machine learning models on large datasets and fine-tuning on small amounts of labelled data of the target language has shown great promise in improving speech recognition for low resource languages.
In this thesis, the wav2vec 2.0 machine learning model architecture for speech recognition is explored. We evaluate and compare a monolingual model exclusively pre-trained on unlabelled data and fine-tuned and evaluated on labelled data of the Estonian language to a multilingual model that is pre-trained on unlabelled English and Estonian data and fine-tuned and tested on labelled Estonian data.
The performed experiments reveal that the multilingual pre-training achieves an average word error rate of 12.1% and character error rate of 5% compared to 26.9% and 5.9% for the respective metrics on the monolingual model evaluation. These results represent a 53.6% decrease in word errors and 15.3% decrease in character errors for the multilingual model and highlight the potential of improving speech recognition of low resource languages by means of self-supervised learning on multilingual unlabelled speech data.
Graduation Thesis language
English
Graduation Thesis type
Master - Software Engineering
Supervisor(s)
Mark FiĊĦel
Defence year
2022
 
PDF