Creation of HMM-based Speech Model for Estonian Text-to-Speech Synthesis

Name
Tõnis Nurk
Abstract
The main purpose of this thesis is to create hidden Markov model based speech models for both male and female voice for Estonian text-to-speech synthesis. To begin with, a brief overview of text-to-speech synthesis process is given, alongside with description of components in a typical speech synthesis system and popular techniques in common use. Subsequently, the thesis focuses on statistical parametric speech synthesis in particular. The technique called hidden Markov model-based speech synthesis which is utilized in the system HTS (HMM-based Speech Synthesis System) is described. HTS is employed to generate voice models needed for this bachelor work. Discussed are the advantages and drawbacks of the system HTS and described are solutions to some of the problems. In the practical part of the work the creation of speech corpus in Institute of the Estonian Language is analyzed. Presented are the guidelines for creation of the corpus as well as its connection with text analysis module and related constraints. Described are the changes to phonetic system in use followed from development of text analysis modules. Given are the aspects related to recording the speech corpus and guidelines to evaluate the quality of the signal produced. Analyzed are the unforeseen findings that affect quality of the corpus and from these new guidelines for corpus construction are derived. Described is the process of adapting Estonian-related training data and linguistic specification to the system HTS. Linguistic specification is compatible with text analysis module in order to enable implementation of the trained voice models to Estonian speech synthesis applications. Experiments are carried out on data from different speakers, subcorpora and linguistic specifications. Presented are examples of generated speech for both male and female voice models trained with HTS. Speech model evaluation process has given expected findings. The most important factors that affect voice model quality are the quality and size of training corpus. It is followed by the ability of text analysis module to generate accurate pronounciation text and optimizing of phonetical and phonological contextual factors. In the end, proposed are two possible courses of action to improve the quality of HMM-based speech models trained: implementation of STRAIGHT vocoder to reduce buzzyness of synthesized speech and optimizing of phonetical and phonological contextual factors.
Graduation Thesis language
Estonian
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Meelis Mihkla
Defence year
2012
 
PDF Extras