Translate or Transliterate? Modeling the Decision For English to Arabic Machine Translation

Mahmoud Azab

doi:10.5339/qfarf.2013.ICTP-066

Abstract

Translation of named entities (NEs) is important for NLP applications such as Machine Translation (MT) and Cross-lingual Information Retrieval. For MT, named entities are major subset of the out-of-vocabulary terms. Due to their diversity, they cannot always be found in parallel corpora, dictionaries or gazetteers. Thus, state-of-the-art MT systems need to handle NEs in speciï¬c ways: (i) direct translation which results in missing many out of vocabulary terms and (ii) blind transliteration of out of vocabulary terms which does not necessarily contribute to translation adequacy and may actually create noisy contexts for the language model and the decoder. For example, in the sentence "Dudley North visits North London", the MT system is expected to transliterate "North" in the former case, and translate "North" in the latter. In this work, we present a classification-based framework, that enables MT system to automate the decision of translation vs. transliteration for different categories of NEs. We model the decision as a binary classification at the token level: each token within a named-entity gets a decision label to be translated or transliterated. Training the classifier requires a set of NEs with token-level decision labels. For this purpose, we automatically construct a set of bilingual lexicon of NEs paired with the translation/transliteration decisions from two different domains: We heuristically extract and label parallel NEs from a large word aligned news parallel corpus and we use a lexicon of bilingual NEs collected from Arabic and Wikipedia titles. Then, we designed a procedure to clean up the noisy Arabic NE spans by part-of-speech verification, and heuristically ï¬ltering impossible items (e.g. verbs). For training, the data is automatically annotated using a variant of edit distance measuring the similarity between an English word and its Arabic transliteration. For test set, we manually reviewed the labels and fixed the incorrect ones. As part of our project, this bilingual corpus of named entities has been released to the research community. Using Support Vector Machines, we trained the classifier using a set of token-based, contextual and semantic features of the NEs. We evaluated our classiï¬er both in the limited news and diverse Wikipedia domains, and achieved promising accuracy of 89.1%. To study the utility of using our classifier on an English to Arabic statistical MT system, we deployed it as a pre-translation component to the MT system. We automatically located the NEs in the source language sentences and used the classiï¬er to ï¬nd those which should be transliterated. For such terms, we offer the transliterated form as an option to the decoder. The impact of adding the classifier to the SMT pipeline resulted in a major reduction of out of vocabulary terms and a modest improvement of the BLEU score. This research is supported by the Qatar National Research Fund (a member of the Qatar Foundation) through grants NPRP-09-1140-1-177 and YSREP-1-018-1-004. The statements made herein are solely the responsibility of the authors.

oa Translate or Transliterate? Modeling the Decision For English to Arabic Machine Translation

Abstract

Metrics

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

AI and the evolution of journalistic practices

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Effect of green marketing on consumer purchase behavior

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria