Utilizing Monolingual Gulf Arabic Data For Qatari Arabic-english Statistical Machine Translation

Kamla Al-mannai; Hassan Sajjad; Alaa Khader; Fahad Al Obaidli; Preslav Nakov; Stephan Vogel

doi:10.5339/qfarc.2014.ITPP1152

Abstract

With the recent rise of social media, Arabic speakers have started increasingly using dialects in writing, which has constituted research in dialectal Arabic (DA) as a field of interest in natural language processing (NLP). DA NLP is still in its infancy, both in terms of its computational resources and in its tools, e.g. lack of dialectal morphological segmentation tools. In this work, we present a 2.7M-token collection of monolingual corpora of Gulf Arabic extracted from the Web. The data is unique since it is genre-specific, i.e. romance genre, in spite of the various sub-dialects of Gulf Arabic that it covers, e.g., Qatari, Emirati, Saudi. In addition to the monolingual Qatari data collected, we use existing parallel corpora of Qatari (0.47M-token), Egyptian (0.3M-token), Levantine (1.2M-token) and Modern Standard Arabic (MSA) (3.5M-token) to English to develop a Qatari Arabic to English statistical machine translation system (QA-EN SMT). We exploit the monolingual data to 1) develop a morphological segmentation tool for Qatari Arabic, 2) generate a uniform segmentation scheme for the various variants of Arabic employed, and 3) build a Qatari language model in the opposite translation direction. Proper morphological segmentation of Arabic plays a vital role in the quality of a SMT system. Using the monolingual Qatari data collected in combination with the QA side of the small QA-EN existing parallel data, we trained an unsupervised morphological segmentation model for Arabic, i.e. Morfessor, to create a word segmenter for Qatari Arabic. We then extrinsically compare the impact of the resulting segmentation (as opposed to using tools for MSA) on the quality of QA-EN machine translation. The results show that this unsupervised segmentation can yield better translation quality. Unsurprisingly, we found that removing the monolingual data from the training set of the segmenter affects the translation quality with a loss of 0.9 BLEU points. Arabic dialect resources, when adapted for the translation of one dialect are generally helpful in achieving better translation quality. We show that a standard segmentation scheme can improve vocabulary overlap between dialects by segmenting words with different morphological forms in different dialects to a common root form. We train a generic segmentation model for Qatari Arabic and the other variants used using the monolingual Qatari data and the Arabic side of the parallel corpora. We train the QA-EN SMT system using the different parallel corpora (one at a time) in addition to the QA-EN parallel corpus segmented using the generic statistical segmenter. We show a consistent improvement of 1.5 BLEU points when compared with their respective baselines with no segmentation. In the reverse translation direction, i.e. EN_QA, we show that adding a small amount of in-domain data to the language model used results in a relatively large improvement compared to the degradation resulted by adding a large amount of out-of-domain data.

oa Utilizing Monolingual Gulf Arabic Data For Qatari Arabic-english Statistical Machine Translation

Abstract

Metrics

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

AI and the evolution of journalistic practices

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Effect of green marketing on consumer purchase behavior

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria