1887

Abstract

Abstract

Entity type recognition is used as a pre-processing step in common applications like summarization of text, classifying documents or automatic answering of questions posed in natural language. Here, ‘entity’ refers to concrete and abstract objects identified by proper and common nouns. Entity recognition focuses on detecting instances of types like person, location, organization, and so on. For example, an entity recognizer would take as input:

. and output:

.

The task can be performed using machine learning techniques to train a system that recognizes entities with performance comparable to a human annotator. Challenges like the lack of a large annotated training data corpus, impossible nature of listing all entity types, and ambiguity in language make this problem hard. There are existing entity recognizers which perform this task but with fair performance. One of the ways adopted to improve the performance of an existing entity recognizer is feature engineering. We initially find out which of the existing features, used in the recognizer, affect the performance most strongly. We accomplish this by adding and removing one or more features at a time from the feature list. We then use the training data to train a model and test to find out which set of features are important. The evaluation metric involves finding the precision, recall and f-score (which is the harmonic mean of precision and recall). As a next step, we add new features like word clusters and bigram word features to find out any improvements. Word clusters help when the training data does not have some words, but words belonging to the same cluster are present in the training data. This helps tagging unseen words in the test set. We also experiment with varying the size of the training data to find out how it affects the performance. Additionally, we look into Wikipedia as a source of additional features for the training data. Wikipedia has an elaborate internal link structure that can provide vital information about the category of a word. This category can be linked to a broader-sensed entity type.

Loading

Article metrics loading...

/content/papers/10.5339/qfarf.2010.CSPS2
2010-12-13
2024-03-29
Loading full text...

Full text loading...

References

  1. R. Bhowmick, M. Heilman, K. Oflazer, B. Mohit, N. Smith, Rich entity recognition in English text, QFARF Proceedings, 2010, CSPS2.
    [Google Scholar]
http://instance.metastore.ingenta.com/content/papers/10.5339/qfarf.2010.CSPS2
Loading
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error