Building an Arabic Punctuated Corpus

Wajdi Zaghouani; Dana Awad

doi:10.5339/qfarc.2016.SSHAPP3148

Abstract

1. Introduction

Punctuation can be defined as the use of spacing and conventional signs to help the understanding of the handwritten and printed texts. Punctuation marks are used to create sense, clarity and stress in sentences and they are also used to structure and organize the text.

The punctuation rules vary with language and register. Some punctuation aspects of are stylistic choices. For language such Arabic, the punctuation marks are relatively a modern innovation since Arabic did not use punctuation and therefore the punctuation rules in Arabic are not always consistently used. From a Natural Language Processing (NLP) perspective, punctuation marks can be useful in the automatic sentence Segmentation tasks, since the sentences boundaries and phrase boundaries can be estimated according to punctuation marks. Moreover and as shown by a number of studies, the absence of punctuation could be confusing both for humans and computers.

Furthermore, many NLP systems trained on well formatted text often have problems when dealing with unstructured texts. In order to build robust automatic punctuation systems, large scale manually punctuated corpora are usually needed. In this abstract, we present our effort to build a large scale error corrected punctuated corpus for Arabic. We present our special punctuation annotation guidelines designed to improve the inter-annotator agreement. Our guidelines were used by trained annotators and a regular inter-annotator agreement were measured to ensure the annotation quality.

2. Corpus Description:

In this work, we describe the 2M words corpus developed for the Qatar Arabic Language Bank (QALB) project a large-scale error annotation effort that aims to create a manually corrected corpus of errors including punctuation errors for a variety of Arabic texts (Zaghouani et al., 2014; Zaghouani et al., 2015). The goal of the annotation in this project is twofold: first, to correct the existing punctuation found in text, than to add the missing necessary punctuation when needed. The comments are selected from the available comments related to news stories. The native student essays data is 150k words extracted from the Arabic Learners Corpus (ALC). The non-native student essays data, is 150k words corpus selected the Arabic Learners Written Corpus (ALWC). The data is categorized by the student level (beginner, intermediate, advanced), learner type (L2 vs. heritage), and essay type (description, narration, instruction).

Finally, the machine translation output data is collected from 100k words of English news article taken from the collaborative journalism Wikinews website. The corpus includes 520 articles with an average of 192 words per article. The original English files were in HTML format and were exported to a UTF-8 plain Text standard format so it can be used later on in the annotation tool. Afterwards, the corpus collected was automatically translated from English to Arabic using the Google Translate API service.

3. Punctuation Annotation Guidelines

Our punctuation guidelines focus on the types of punctuation errors that are targeted and describe the process of how to correct them and also when to add the missing punctuation marks. Many annotated examples are provided in the guidelines to illustrate the various annotation rules and exceptions. Since the Arabic punctuation rules are not always clearly defined, we adopted an iterative approach for developing the guidelines, which includes multiple revisions and updates needed to different rounds of updating and annotation to reach a consistent set of directions.

In order to help our annotators with some complex punctuation rules, we wrote a summary of the most common punctuation marks rules in Arabic as an appendix to the guidelines.

The rules of punctuation vary with language and register. Moreover, aspects of punctuation use vary from author to author, and can be considered a stylistic choice.

While punctuation in the English or French language is guided by a series of grammar-related rules, in other languages such as Arabic, punctuation is a recent innovation as pre-modern Arabic did not use punctuation.

According to Awad (2013), there is an inconsistency in the punctuation rules and usage in Arabic, and omitting the punctuation marks is a very frequent error. We use the Arabic standard general punctuation rules commonly used today and described in Awad (2013).

Punctuation errors are especially present in student essays and online news comments. This is mainly due to the fact that some punctuation mark rules are not clearly defined in Arabic writing references. We created a set of simple rules for correcting punctuation and adding missing ones.

4. Annotation Procedure

The lead annotator is also the annotation work-flow manager of this project. He frequently evaluate the quality of the annotation, monitor and report on the annotation progress.

A clearly defined protocol is set, including a routine for the annotation job assignment and the inter-annotator agreement evaluation. The lead annotators is also responsible of the corpus selection and normalization process beside the annotation of the gold standard to be used to compute the Inter-Annotator Agreement (IAA) portion of the corpus.

The annotators in this project are five university graduates with good Arabic language background. To ensure the annotation quality, an extensive training phase for each annotator was conducted. Afterwards, the annotator's performance is closely monitored during the initial period, before allowing the annotator to join the official annotation production phase. Moreover, a dedicated on-line discussion group is frequently used by the annotation team to keep track of the punctuation questions and issues raised during the annotation process, this mechanism, proved to help the annotators and the lead annotator to have a better communication.

This framework includes two major components:

1. The annotation management interface which is used to assist the lead annotator in the general work-flow process, it allows the user to upload, assign, monitor, evaluate and export annotation tasks.

2. The annotation interface is the actual annotation tool, which allows the annotators to do the manual correction of the Arabic text and add the missing punctuation or correct the existing ones.

All the annotation history is recorded in a database and can be exported to an XML export file to keep a trace of the entire correction actions for a given file.

5. Evaluation

To evaluate the punctuation annotation quality, we measure the inter-annotator agreement (IAA) on randomly selected files to ensure that the annotators consistently following the annotation guidelines. A high annotation agreement is a good indicator of the data quality. The IAA is measured over all pairs of annotations to compute the AWER (Average Word Error Rate). In this evaluation, the WER measures the punctuation error against all punctuation marks in the text. The average IAA results obtained was 89.84% (WER) computed over 10 files from each corpus (4,116 words total) annotated by at least three different annotators. Overall, the results obtained showed that the annotators are consistently following the punctuation guidelines.

6. Conclusions

We presented our method to create an Arabic manually punctuated corpus, including the writing of the guidelines as well as the annotation procedure and the quality control procedure used to verify the annotation quality. We showed that there is a high variety in the use of punctuation in Arabic texts and despite the existence of punctuation rules, the use of punctuation in Arabic is highly individual and it depends on the style of the author who defines his own use of punctuation.

7. References

Awad, D. (2013). La ponctuation arabe: histoire et règles.

Zaghouani, W., Mohit, B., Habash, N., Obeid, O., Tomeh, N., Rozovskaya, A., Farra, N., Alkuhlani, S., and Oflazer, K. (2014). Large scale arabic error annotation: Guidelines and framework. In International Conference on Language Resources and Evaluation (LREC 2014).

Zaghouani, W., Habash, N., Bouamor, H., Rozovskaya, A., Mohit, B., Heider, A., and Oflazer, K. (2015). Correction annotation for non-native arabic texts: Guidelines and corpus. Proceedings of The 9th Linguistic Annotation Workshop, pages 129–139.

oa Building an Arabic Punctuated Corpus

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Osteoporosis: An under-recognized public health problem

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia

Qatar’s economy: Past, present and future