Long Audio Alignment is a known problem in speech processing in which the goal is to align a long audio input with the corresponding text. Accurate alignments help in many speech processing tasks such as audio indexing, speech recognizer's acoustic model training, audio summarizing and retrieving, etc. In this work, we have collected more than 1400 hours of conversational Arabic speech extracted from Al-Jazeerah podcasts besides the corresponding non-aligned text transcriptions. Podcast's length varies from 20-50 minutes each. Five episodes have been manually aligned that meant to be used in evaluating alignment accuracy. For each episode, a split and merge segmentation approach is applied to segment audio file into small segments of average length of 5 sec. having filled pauses on the boundary of each segment. A pre-processing stage in applied on the corresponding raw transcriptions to remove titles, headings, images, speaker's names, etc. A biased language model (LM) is trained on the fly using the processed text. Conversational Arabic speech is mostly spontaneous and influenced by dialectal Arabic. Since phonemic pronunciation modeling is not always possible for non-standard Arabic words, a graphemic pronunciation model (PM) is utilized to generate one pronunciation variant for each word. Unsupervised acoustic model adaptation in applied on a pre-trained Arabic acoustic model using the current podcast audio. The adapted AM along with the biased LM and the graphemic PM are used in a fast speech recognition pass applied on the current podcast's segments. Recognizer's output is aligned with the processed transcriptions using Levenshtein distance algorithm. This way we can ensure error recovery where miss-alignment of a certain segment does not affect alignment of later segments. The proposed approach resulted in an alignment accuracy of 97% on the evaluation set. Most of miss-alignment errors were found to be with segments having significant background noise (music, channel noise, cross-talk, etc.) or significant speech disfluencies (truncated words, repeated words, hesitations, etc.). For some speech processing tasks like acoustic model training, it is required to eliminate miss-aligned segments from the training data. That is why a confidence scoring metric is proposed to accept/reject aligner output. The score is provided for each segment and it is basically the Min-Edit distance between recognizer's output and the aligned text. By using confidence scores, it was possible to reject the majority of miss-aligned segments resulting in 99% alignment accuracy. This work was funded by a grant from the Qatar National Research Fund under its National Priorities Research Program (NPRP) award number NPRP 09-410-1-069. Reported experimental work was performed at Qatar University in collaboration with University of Illinois.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error