QALB: Qatar Arabic language bank

Behrang Mohit

doi:10.5339/qfarf.2013.ICTP-032

Abstract

Automatic text correction has been attracting research attention for English and some other western languages. Applications for automatic text correction vary from improving language learning for humans and reducing noise in text input to natural language processing tools to correcting machine translation output for grammatical and lexical choice errors. Despite the recent focus on some Arabic language technologies, Arabic automatic correction is still a fairly understudied research problem. Modern Standard Arabic (MSA) is a morphologically and syntactically complex language, which poses multiple writing challenges not only to language learners, but also to Arabic speakers, whose dialects differ substantially from MSA. We are currently creating resources to address these challenges. Our project has two components: first is QALB (Qatar Arabic Language Bank), a large parallel corpus of Arabic sentences and their corrections, and second is ACLE (Automatic Correction of Language Errors), an Arabic text correction system trained and tested on the QALB corpus. The QALB corpus is unique in that: a) it will be the largest Arabic text correction corpus available, spanning two million words; b) it will cover errors produced by native-speakers, non-native speakers, and machine translation systems; and c) it will contain a trace of all the actions performed by the human annotators to achieve the final correction. This presentation describes the creation of two major components of the project: the web-based annotation interface and the annotation guidelines. QAWI (QALB Annotation Web Interface) is our web-based, language-independent annotation framework used for manual correction of the QALB corpus. Our framework provides intuitive interfaces for annotating text, managing a large number of human annotators and performing quality control. Our annotation interface, in particular, provides a novel token-based editing model for correcting Arabic text that allows us to reliably track all modifications. We demonstrate details of both the annotation and the administration interfaces as well as the back-end engine. Furthermore, we show how this framework is able to speed up the annotation process by employing automated annotators to correct basic Arabic spelling errors. We also discuss the evolution of our annotation guidelines from its early developments through its actual usage for group annotation. The guidelines cover a variety of linguistic phenomena, from spelling errors to dialectal variations and grammatical considerations. The guidelines also include a large number of examples to help annotators understand the general principles behind the correction rules and not simply memorize them. The guidelines were written in parallel to the development of our web-based annotation interface and involved several iterations and revisions. We periodically provided new training sessions to the annotators and measured their inter-annotator agreement. Furthermore, the guidelines were updated and extended using feedback from the annotators and the inter-annotator agreement evaluations. This project is supported by the National Priority Research Program (NPRP grant 4-1058-1-168) of the Qatar National Research Fund (a member of the Qatar Foundation). The statements made herein are solely the responsibility of the authors.

oa QALB: Qatar Arabic language bank

Abstract

Metrics

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

AI and the evolution of journalistic practices

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Effect of green marketing on consumer purchase behavior

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria