With the recent rise of social media, Arabic speakers have started increasingly using dialects in writing, which has constituted research in dialectal Arabic (DA) as a field of interest in natural language processing (NLP). DA NLP is still in its infancy, both in terms of its computational resources and in its tools, e.g. lack of dialectal morphological segmentation tools. In this work, we present a 2.7M-token collection of monolingual corpora of Gulf Arabic extracted from the Web. The data is unique since it is genre-specific, i.e. romance genre, in spite of the various sub-dialects of Gulf Arabic that it covers, e.g., Qatari, Emirati, Saudi. In addition to the monolingual Qatari data collected, we use existing parallel corpora of Qatari (0.47M-token), Egyptian (0.3M-token), Levantine (1.2M-token) and Modern Standard Arabic (MSA) (3.5M-token) to English to develop a Qatari Arabic to English statistical machine translation system (QA-EN SMT). We exploit the monolingual data to 1) develop a morphological segmentation tool for Qatari Arabic, 2) generate a uniform segmentation scheme for the various variants of Arabic employed, and 3) build a Qatari language model in the opposite translation direction. Proper morphological segmentation of Arabic plays a vital role in the quality of a SMT system. Using the monolingual Qatari data collected in combination with the QA side of the small QA-EN existing parallel data, we trained an unsupervised morphological segmentation model for Arabic, i.e. Morfessor, to create a word segmenter for Qatari Arabic. We then extrinsically compare the impact of the resulting segmentation (as opposed to using tools for MSA) on the quality of QA-EN machine translation. The results show that this unsupervised segmentation can yield better translation quality. Unsurprisingly, we found that removing the monolingual data from the training set of the segmenter affects the translation quality with a loss of 0.9 BLEU points. Arabic dialect resources, when adapted for the translation of one dialect are generally helpful in achieving better translation quality. We show that a standard segmentation scheme can improve vocabulary overlap between dialects by segmenting words with different morphological forms in different dialects to a common root form. We train a generic segmentation model for Qatari Arabic and the other variants used using the monolingual Qatari data and the Arabic side of the parallel corpora. We train the QA-EN SMT system using the different parallel corpora (one at a time) in addition to the QA-EN parallel corpus segmented using the generic statistical segmenter. We show a consistent improvement of 1.5 BLEU points when compared with their respective baselines with no segmentation. In the reverse translation direction, i.e. EN_QA, we show that adding a small amount of in-domain data to the language model used results in a relatively large improvement compared to the degradation resulted by adding a large amount of out-of-domain data.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error