Annotating a Multi-Topic Corpus for Arabic Natural Language Processing

Behrang Mohit; Nathan Schneider; Kemal Oflazer; Noah A. Smith

doi:10.5339/qfarf.2011.CSO4

Abstract

Human-annotated data is an important resource for most natural language processing (NLP) systems. Most linguistically annotated text for Arabic NLP is in the news domain, but systems that rely on this data do not generalize well to other domains. We describe ongoing efforts to compile a dataset of 28 Arabic Wikipedia articles spanning four topical domains—sports, history, technology, and science. Each article in the dataset is annotated with three types of linguistic structure: named entities, syntax and lexical semantics. We adapted traditional approaches to linguistic annotation in order to make them accessible to our annotators (undergraduate native speakers of Arabic) and to better represent the important characteristics of the chosen domains.

For the named entity (NE) annotation, we start with the task of marking boundaries of expressions in the traditional Person, Location and Organization classes. However, these categories do not fully capture the important entities discussed in domains like science, technology, and sports. Therefore, where our annotators feel that these three classes are inadequate for a particular article, they are asked to introduce new classes. Our data analysis indicates that both the designation of article-specific entity classes and the token-level annotation are accomplished with a high level of inter-annotator agreement.

Syntax is our most complex linguistic annotation, which includes morphology information, part-of-speech tags, syntactic governance and dependency roles of individual words. While following a standard annotation framework, we perform quality control by evaluating inter-annotator agreement as well as eliciting annotations for sentences that have been previously annotated so as to compare the results.

The lexical semantics annotation consists of supersense tags, coarse-grained representations of noun and verb meanings. The 30 noun classes include person, quantity, and artifact; the 15 verb tags include motion, emotion, and perception. These classes provide a middle-ground abstraction of the large semantic space of the language. We have developed a flexible web-based interface, which allows annotators to review preprocessed text and add the semantic tags.

Ultimately, these linguistic annotations will be publicly released, and we expect that they will facilitate NLP research and applications for an expanded variety of text domains.

oa Annotating a Multi-Topic Corpus for Arabic Natural Language Processing

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Osteoporosis: An under-recognized public health problem

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia

Qatar’s economy: Past, present and future