-
View Affiliations Hide Affiliations
- Publisher: Hamad bin Khalifa University Press (HBKU Press)
- Source: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2011 Issue 1, Nov 2011, Volume 2011, CSP6
oa Hierarchical Clustering for Keyphrase Extraction from Arabic Documents Based on Word Context
Abstract
Keyphrase extraction is a process by which the set of words or phrases that best describe a document is specified. The phrases could be extracted from the document words itself, or they could be external and specified from an ontology for a given domain. Extracting keyphrases from documents is critical for many applications such as information retrieval, document summarization or clustering. Many keyphrase extractors view the problem as a classification problem and therefore they need training documents (i.e. documents which their keyphrases are known in advance). Other systems view keyphrase extraction as a ranking problem. In the latter approach, the words or phrases of a document are ranked based on their importance and phrases with high importance (usually located at the beginning of the list) are recommended as possible keyphrases for a document.
This abstract explains Shihab; a system for extracting keyphrases from Arabic documents. Shihab views keyphrase extraction as a ranking problem. The list of keyphrases is generated by clustering the phrases of a document. Phrases are built from words which appear in the document. These phrases consist of 1-, 2- or 3-words. The idea is to group phrases which are similar into one cluster. The similarity between phrases is determined by calculating the Dice value of their corresponding contexts. A phrase context is the sentence in which that phrase appears. Agglomerative hierarchical clustering is used in the clustering phase. Once the clusters are ready, then each cluster will nominate a phrase to the set of candidate keyphrases. This phrase is called cluster representative and is determined according to a set of heuristics. Shihab results were compared with other existing keyphrase extractors such as KP-Miner and Arabic-KEA and the results were encouraging.