On Arabic Multi-Genre Corpus Diacritization

Houda Bouamor; Wajdi Zaghouani; Mona Diab; Ossama Obeid; Kemal Oflazer; Mahmoud Ghoneim; Abdelati Hawwari

doi:10.5339/qfarc.2016.ICTPP2921

Abstract

One of the characteristics of writing in Modern Standard Arabic (MSA) is that the commonly used orthography is mostly consonantal and does not provide full vocalization of the text. It sometimes includes optional diacritical marks (henceforth, diacritics or vowels).

Arabic script consists of two classes of symbols: letters and diacritics. Letters comprise long vowels such as A, y, w as well as consonants. Diacritics on the other hand comprise short vowels, gemination markers, nunation markers, as well as other markers (such as hamza, the glottal stop which appears in conjunction with a small number of letters, dots on letters, elongation and emphatic markers) which in all, if present, render a more or less exact precise reading of a word. In this study, we are mostly addressing three types of diacritical marks: short vowels, nunation, and shadda (gemination).

Diacritics are extremely useful for text readability and understanding. Their absence in Arabic text adds another layer of lexical and morphological ambiguity. Naturally occurring Arabic text has some percentage of these diacritics present depending on genre and domain. For instance, religious text such as the Quran is fully diacritized to minimize chances of reciting it incorrectly. So are children's educational texts. Classical poetry tends to be diacritized as well. However, news text and other genre are sparsely diacritized (e.g., around 1.5% of tokens in the United Nations Arabic corpus bear at least one diacritic (Diab et al., 2007)).

In general, building models to assign diacritics to each letter in a word requires a large amount of annotated training corpora covering different topics and domains to overcome the sparseness problem. The currently available diacritized MSA corpora are generally limited to the newswire genres (those distributed by the LDC) or religion related texts such as Quran or the Tashkeela corpus. In this paper we present a pilot study where we annotate a sample of non-diacritized text extracted from five different text genres. We explore different annotation strategies where we present the data to the annotator in three modes: basic (only forms with no diacritics), intermediate (basic forms–POS tags), and advanced (a list of forms that is automatically diacritized). We show the impact of the annotation strategy on the annotation quality.

It has been noted in the literature that complete diacritization is not necessary for readability Hermena et al. (2015) as well as for NLP applications, in fact, (Diab et al., 2007) show that full diacritization has a detrimental effect on SMT. Hence, we are interested in discovering the optimal level of diacritization. Accordingly, we explore different levels of diacritization. In this work, we limit our study to two diacritization schemes: FULL and MIN. For FULL, all diacritics are explicitly specified for every word. For MIN, we explore what is a minimum and optimal number of diacritics that needs to be added in order to disambiguate a given word in context and make a sentence easily readable and unambiguous for any NLP application.

We conducted several experiments on a set of sentences that we extracted from five corpora covering different genres. We selected three corpora from the currently available Arabic Treebanks from the Linguistic Data Consortium (LDC). These corpora were chosen because they are fully diacritized and had undergone significant quality control, which will allow us to evaluate the anno tation accuracy as well as our annotators understanding of the task. We select a total of 16,770 words from these corpora for annotation. Three native Arabic annotators with good linguistic background annotated the corpora samples. Diab et al. (2007), define six different diacritization schemes that are inspired by the observation of the relevant naturally occurring diacritics in different texts. We adopt the FULL diacritization scheme, in which all the diacritics should be specified in a word. Annotators were asked to fully diacritize each word.

The text genres were annotated following the different strategies:

- Basic: In this mode, we ask for annotation of words where all diacritics are absent, including the naturally occurring ones. The words are presented in a raw tokenized format to the annotators in context.

- Intermediate: In this mode, we provide the annotator with words along with their POS information. The intuition behind adding POS is to help the annotator disambiguate a word by narrowing down on the diacritization possibilities.

- Advanced: In this mode, the annotation task is formulated as a selection task instead of an editng task. Annotators are provided with a list of automatically diacritized candidates and are asked to choose the correct one, if it appears in the list. Otherwise, if they are not satisfied with the given candidates, they can manually edit the word and add the correct diacritics. This technique is designed in order to reduce annotation time and especially reduce annotator workload. For each word, we generate a list of vowelized candidates using MADAMIRA (Pasha et al., 2014). MADAMIRA is able to achieve a lemmatization accuracy 99.2% and a diacritization accuracy of 86.3%. We present the annotator with the top three candidates suggested by MADAMIRA, when possible. Otherwise, only the available candidates are provided.

We also provided annotators with detailed guidelines, describing our diacritization scheme and specifying how to add diacritics for each annotation strategy. We described the annotation procedure and specified how to deal with borderline cases. We also provided in the guidelines many annotated examples to illustrate the various rules and exceptions.

In order to determine the most optimized annotation setup for the annotators, in terms of speed and efficiency, we test the results obtained following the three annotation strategies. These annotations are all conducted for the FULL scheme. We first calculated the number of words annotated per hour, for each annotator and in each mode. As expected, following the Advanced mode, our three annotators could annotate an average of 618.93 words per hour which is double those annotated in the Basic mode (only 302.14 words). Adding POS tags to the Basic forms, as in the Intermediate mode, does not accelerate the process much. Only − 90 more words are diacritized per hour compared to the basic mode.

Then, we evaluated the Inter-Annotator Agreement (IAA) to quantify the extent to which independent annotators agree on the diacritics chosen for each word. For every text genre, two annotators were asked to annotate independently a sample of 100 words.

We measured the IAA between two annotators by averaging WER (Word Error Rate) over all pairs of words. The higher the WER between two annotations, the lower their agreement. The results obtained show clearly that the Advanced mode is the best strategy to adopt for this diacritization task. It is the less confusing method on all text genres (with WER between 1.56 and 5.58).

We also conducted a preliminary study for a minimum diacritization scheme. This is a diacritization scheme that encodes the most relevant differentiating diacritics to reduce confusability among words that look the same (homographs) when undiacritized but have different readings. Our hypothesis in MIN is that there is an optimal level of diacritization to render a text unambiguous for processing and enhance its readability. We showed the difficulty in defining such a scheme and how subjective this task can be.

Acknowledgement

This publication was made possible by grant NPRP-6-1020-1-199 from the Qatar National Research Fund (a member of the Qatar Foundation).

oa On Arabic Multi-Genre Corpus Diacritization

Abstract

Metrics

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

AI and the evolution of journalistic practices

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Effect of green marketing on consumer purchase behavior

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria