Human-annotated data is an important resource for most natural language processing (NLP) systems. Most linguistically annotated text for Arabic NLP is in the news domain, but systems that rely on this data do not generalize well to other domains. We describe ongoing efforts to compile a dataset of 28 Arabic Wikipedia articles spanning four topical domains—sports, history, technology, and science. Each article in the dataset is annotated with three types of linguistic structure: named entities, syntax and lexical semantics. We adapted traditional approaches to linguistic annotation in order to make them accessible to our annotators (undergraduate native speakers of Arabic) and to better represent the important characteristics of the chosen domains.

For the named entity (NE) annotation, we start with the task of marking boundaries of expressions in the traditional Person, Location and Organization classes. However, these categories do not fully capture the important entities discussed in domains like science, technology, and sports. Therefore, where our annotators feel that these three classes are inadequate for a particular article, they are asked to introduce new classes. Our data analysis indicates that both the designation of article-specific entity classes and the token-level annotation are accomplished with a high level of inter-annotator agreement.

Syntax is our most complex linguistic annotation, which includes morphology information, part-of-speech tags, syntactic governance and dependency roles of individual words. While following a standard annotation framework, we perform quality control by evaluating inter-annotator agreement as well as eliciting annotations for sentences that have been previously annotated so as to compare the results.

The lexical semantics annotation consists of supersense tags, coarse-grained representations of noun and verb meanings. The 30 noun classes include person, quantity, and artifact; the 15 verb tags include motion, emotion, and perception. These classes provide a middle-ground abstraction of the large semantic space of the language. We have developed a flexible web-based interface, which allows annotators to review preprocessed text and add the semantic tags.

Ultimately, these linguistic annotations will be publicly released, and we expect that they will facilitate NLP research and applications for an expanded variety of text domains.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error