-
oa A new generic approach for information extraction
-
View Affiliations Hide Affiliations
- Publisher: Hamad bin Khalifa University Press (HBKU Press)
- Source: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2012 Issue 1, Oct 2012, Volume 2012, CSP17
Abstract
Automatic Information Extraction (IE) is a challenging task because it involves experts' skills and requires well developed Natural Language Processing (NLP) algorithms. Moreover, IE is domain dependent and context sensitive. In this research, we present a general learning approach that may be applied for different types of events. As a matter of fact, we observed that even if a natural language text containing a target event is apparently unstructured, it may contain a segment that we can map automatically into a structured form. Segments representing the same kind of events have a similar structure or pattern. Each pattern is composed of an ordered sequence of named entities, keywords and articulation words. Some generic named entities like organizations, persons, locations, dates, and grammatical annotations are generated by an automatic part of speech identification tool. During the learning step, each relevant segment is manually annotated with respect to the targeted entities (roles) structuring an event of the ontology. IE is processed by associating a role with a specific entity. By alignment of generic entities to specific entities, some strings of a text are automatically annotated. The alignment between patterns and a new text is not often guaranteed because of the writing styles diversity that may be detected in the news. For that reason, we have proposed soft matching between reduced formats with the objective of maximal utilization of pattern expressiveness. In several cases, this reduced format successfully allows the assignment of the same role to similar entities cited in the same side, with respect to some keywords or cue words. The experiment results are very promising since we've obtained 76.90 % as an average recognition rate.