Multi-speaker Text-to-speech Synthesis in Estonian

Name
Oleh Matsuk
Abstract
Text-to-speech synthesis is a challenging problem, but in recent years it has obtained convincing solutions in the form of neural network models. Specialized model architectures have been proposed to affect speaker identity features of the synthesized speech without training separate models, thus reducing the requirements for data volume and training time.
In this work we implement and train a recently proposed neural architecture with limited amount of Estonian speech data to obtain a model capable of multi-speaker text-to-speech synthesis. Consequently, we evaluate the overall quality of the synthesized speech and the model's ability to assume speaker identity features for speakers both seen and unseen in training. We evaluate and compare the results between multiple models trained with different sets of training data.
Graduation Thesis language
English
Graduation Thesis type
Master - Computer Science
Supervisor(s)
Mark FiĊĦel
Defence year
2021
 
PDF