An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition

Ziad Al Bawab

doi:10.5339/qfarf.2012.AESNP6

oa An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition
المؤلفون: Ziad Al Bawab¹
عرض الانتماءات إخفاء الانتسابات المهنية

¹ Microsoft Corporation, Sunnyvale, UNITED STATES
الناشر: Hamad bin Khalifa University Press (HBKU Press)
المصدر: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2012 Issue 1, أكتوبر ٢٠١٢, المجلد 2012, AESNP6
DOI https://doi.org/10.5339/qfarf.2012.AESNP6

I. Background Articulatory modeling is used to incorporate speech production information into automatic speech recognition (ASR) systems. It is believed that solutions to the problems of co-articulation, pronunciation variations, and other speaking style related phenomena rest in how accurately we capture the production process. II. Objective In this work we present a novel approach for speech recognition that incorporates knowledge of the speech production process. We discuss our contribution on going from a purely statistical speech recognizer to one that is motivated by the physical generative process of speech. III. Methods We follow an analysis-by-synthesis approach. Firstly, we attribute a physical meaning to the inner states of the recognition system pertaining to the configurations the human vocal tract takes over time. We utilize a geometric model of the vocal tract, adapt it to our speakers, and derive realistic vocal tract shapes from electromagnetic articulograph (EMA) measurements in the MOCHA database. Secondly, we synthesize speech from the vocal tract configurations using a physiologically-motivated articulatory synthesis model of speech generation. Thirdly, the observation probability of the Hidden Markov Model (HMM), which is used for phone classification, is a function of the distortion between the speech synthesized from the vocal tract configurations and the real speech. The output of each state in the HMM is based on a mixture of density functions. Each density models the distribution of the distortion at the output of each vocal tract configuration. During training, we initialize the model parameters using ground-truth articulatory knowledge. During testing, only the acoustic data is used. IV. Results and conclusion We present phone classification results using our novel dynamic articulatory model and following our adaptation procedure. The table below shows phone error rates (PER) for a female and a male speaker. We use a three-state HMM with different observation densities and initialization techniques. We combine the probabilities of the baseline topology with the new ones. Our novel framework provides a 10.9% relative reduction in phone error rate over our baseline which uses MFCC features. This is achieved using the distortion features with linear discriminant analysis (LDA) and cepstral mean normalization (CMN). We conclude that incorporating articulatory knowledge in the combined statistical framework we devised contributes to lowering the error rates in speech recognition. Features (dimension) Topology Observation Prob / Initialization Female PER Male PER Both PER Improvement Baseline Features MFCC + CMN (13) 3S-128M-HMM Gaussian/VQ 61.6% 55.9% 58.8% Distortion Features (1024) (Prob. Combination with MFCC, α = 0.2) 3S-1024M-HMM Exponential/Flat Sparsity = 21% 57.6% 53.7% 55.7% 5.3% Distortion Features (1024) (Prob. Combination with MFCC, α = 0.2) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 58.3% 53.9% 56.1% 4.6% Adapted Distortion Features (1024) (Prob. Combination with MFCC, α = 0.25) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 58.4% 53.1% 55.7% 5.3% Distortion Features + LDA + CMN (20) (Prob. Combination with MFCC, α = 0.6) 3S-128M-HMM Gaussian/VQ Sparsity = 0% 54.9% 49.8% 52.4% 10.9%

جارٍ تحميل قياسات المقالة...

/content/papers/10.5339/qfarf.2012.AESNP6

٢٠١٢-١٠-٠١

٢٠٢٥-١٢-١٤

القياسات

Full text loading...

/content/papers/10.5339/qfarf.2012.AESNP6

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

- Barriers and facilitators influencing the physical activity of Arabic adults: A literature review
  
  المؤلفون: Kathleen Benjamin and Tam Truong Donnelly
- Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum
  
  المؤلفون: Mohammad Asim, Farhana Amin and Ayman El-Menyar
- Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria
  
  المؤلفون: AA Ayandele, EK Oladipo, O Oyebisi and MO Kaka
- Effect of green marketing on consumer purchase behavior
  
  المؤلفون: Narges Delafrooz, Mohammad Taleghani and Bahareh Nouri
- Evolution of emergency medical services in Saudi Arabia
  
  المؤلفون: Talal AlShammari, Paul Jennings and Brett Williams
مزيد أقل

oa An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition

القياسات

Most Read This Month

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria

Effect of green marketing on consumer purchase behavior

Evolution of emergency medical services in Saudi Arabia