1887

Abstract

Background and Objectives: In dysprosodic speech, the prosody does not match the expected intonation pattern and can result in robotic-like speech, with each syllable produced with equal stress. These errors are manifested through inconsistent lexical stress as measured by perceptual judgments and/or acoustic variables. Lexical stress is produced through variations in syllable duration, peak intensity and fundamental frequency. The presented technique automatically evaluates the unequal lexical stress patterns Strong-Weak (SW) and Week-Strong (WS) in American English continuous speech production based upon a multi-layer feed forward neural network with seven acoustic features chosen to target the lexical stress variability between two consecutive syllables. Methods: The speech corpus used in this work is the PTDB-TUG. Five females and three males were chosen to form a training set and one female and one male for testing. The CMU pronouncing dictionary with lexical stress levels marked was used to assign stress levels to each syllable in all words in the speech corpus. Lexical stress is phonetically realized through the manipulation of signal intensity, the fundamental frequency (F0) and its dynamics and the syllable/vowel duration. The nucleus duration, syllable duration, mean pitch, maximum pitch over nucleus, the peak-to-peak amplitude integral over syllable nucleus, energy mean and maximum energy over nucleus were calculated for each syllable in the collected speech. As lexical stress errors are identified by evaluating the variability between consecutive syllables in a word, we computed the pairwise variability index ("PVI") for each acoustic measure. The PVI for any acoustic feature f_i is given by: PVI_i= (f_i1-f_i2)/(( f_i1+f_i2)/2)(1), where f_i1,f_i2 are the acoustic features of the first and second syllables consecutively. A multi-layer feed forward neural network which consisted of input, hidden and output layers was used to classify the stress patterns in the words in the database. Results: The presented system had an overall accuracy of 87.6%. It correctly classified 92.4% of the SW stress patterns and 76.5% of the WS stress pattern. Conclusions: A feed-forward neural network was used to classify between the SW and WS stress patterns in American English continuous speech with overall accuracy around 87 percent.

Loading

Article metrics loading...

/content/papers/10.5339/qfarf.2012.CSP22
2012-10-01
2020-09-26
Loading full text...

Full text loading...

http://instance.metastore.ingenta.com/content/papers/10.5339/qfarf.2012.CSP22
Loading
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error