An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition

Ziad Al Bawab

doi:10.5339/qfarf.2012.AESNP6

Abstract

I. Background Articulatory modeling is used to incorporate speech production information into automatic speech recognition (ASR) systems. It is believed that solutions to the problems of co-articulation, pronunciation variations, and other speaking style related phenomena rest in how accurately we capture the production process. II. Objective In this work we present a novel approach for speech recognition that incorporates knowledge of the speech production process. We discuss our contribution on going from a purely statistical speech recognizer to one that is motivated by the physical generative process of speech. III. Methods We follow an analysis-by-synthesis approach. Firstly, we attribute a physical meaning to the inner states of the recognition system pertaining to the configurations the human vocal tract takes over time. We utilize a geometric model of the vocal tract, adapt it to our speakers, and derive realistic vocal tract shapes from electromagnetic articulograph (EMA) measurements in the MOCHA database. Secondly, we synthesize speech from the vocal tract configurations using a physiologically-motivated articulatory synthesis model of speech generation. Thirdly, the observation probability of the Hidden Markov Model (HMM), which is used for phone classification, is a function of the distortion between the speech synthesized from the vocal tract configurations and the real speech. The output of each state in the HMM is based on a mixture of density functions. Each density models the distribution of the distortion at the output of each vocal tract configuration. During training, we initialize the model parameters using ground-truth articulatory knowledge. During testing, only the acoustic data is used. IV. Results and conclusion We present phone classification results using our novel dynamic articulatory model and following our adaptation procedure. The table below shows phone error rates (PER) for a female and a male speaker. We use a three-state HMM with different observation densities and initialization techniques. We combine the probabilities of the baseline topology with the new ones. Our novel framework provides a 10.9% relative reduction in phone error rate over our baseline which uses MFCC features. This is achieved using the distortion features with linear discriminant analysis (LDA) and cepstral mean normalization (CMN). We conclude that incorporating articulatory knowledge in the combined statistical framework we devised contributes to lowering the error rates in speech recognition. Features (dimension) Topology Observation Prob / Initialization Female PER Male PER Both PER Improvement Baseline Features MFCC + CMN (13) 3S-128M-HMM Gaussian/VQ 61.6% 55.9% 58.8% Distortion Features (1024) (Prob. Combination with MFCC, α = 0.2) 3S-1024M-HMM Exponential/Flat Sparsity = 21% 57.6% 53.7% 55.7% 5.3% Distortion Features (1024) (Prob. Combination with MFCC, α = 0.2) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 58.3% 53.9% 56.1% 4.6% Adapted Distortion Features (1024) (Prob. Combination with MFCC, α = 0.25) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 58.4% 53.1% 55.7% 5.3% Distortion Features + LDA + CMN (20) (Prob. Combination with MFCC, α = 0.6) 3S-128M-HMM Gaussian/VQ Sparsity = 0% 54.9% 49.8% 52.4% 10.9%

oa An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Osteoporosis: An under-recognized public health problem

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia

Qatar’s economy: Past, present and future