Translation of named entities (NEs) is important for NLP applications such as Machine Translation (MT) and Cross-lingual Information Retrieval. For MT, named entities are major subset of the out-of-vocabulary terms. Due to their diversity, they cannot always be found in parallel corpora, dictionaries or gazetteers. Thus, state-of-the-art MT systems need to handle NEs in specific ways: (i) direct translation which results in missing many out of vocabulary terms and (ii) blind transliteration of out of vocabulary terms which does not necessarily contribute to translation adequacy and may actually create noisy contexts for the language model and the decoder. For example, in the sentence "Dudley North visits North London", the MT system is expected to transliterate "North" in the former case, and translate "North" in the latter. In this work, we present a classification-based framework, that enables MT system to automate the decision of translation vs. transliteration for different categories of NEs. We model the decision as a binary classification at the token level: each token within a named-entity gets a decision label to be translated or transliterated. Training the classifier requires a set of NEs with token-level decision labels. For this purpose, we automatically construct a set of bilingual lexicon of NEs paired with the translation/transliteration decisions from two different domains: We heuristically extract and label parallel NEs from a large word aligned news parallel corpus and we use a lexicon of bilingual NEs collected from Arabic and Wikipedia titles. Then, we designed a procedure to clean up the noisy Arabic NE spans by part-of-speech verification, and heuristically filtering impossible items (e.g. verbs). For training, the data is automatically annotated using a variant of edit distance measuring the similarity between an English word and its Arabic transliteration. For test set, we manually reviewed the labels and fixed the incorrect ones. As part of our project, this bilingual corpus of named entities has been released to the research community. Using Support Vector Machines, we trained the classifier using a set of token-based, contextual and semantic features of the NEs. We evaluated our classifier both in the limited news and diverse Wikipedia domains, and achieved promising accuracy of 89.1%. To study the utility of using our classifier on an English to Arabic statistical MT system, we deployed it as a pre-translation component to the MT system. We automatically located the NEs in the source language sentences and used the classifier to find those which should be transliterated. For such terms, we offer the transliterated form as an option to the decoder. The impact of adding the classifier to the SMT pipeline resulted in a major reduction of out of vocabulary terms and a modest improvement of the BLEU score. This research is supported by the Qatar National Research Fund (a member of the Qatar Foundation) through grants NPRP-09-1140-1-177 and YSREP-1-018-1-004. The statements made herein are solely the responsibility of the authors.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error