Keyphrase extraction is a process by which the set of words or phrases that best describe a document is specified. The phrases could be extracted from the document words itself, or they could be external and specified from an ontology for a given domain. Extracting keyphrases from documents is critical for many applications such as information retrieval, document summarization or clustering. Many keyphrase extractors view the problem as a classification problem and therefore they need training documents (i.e. documents which their keyphrases are known in advance). Other systems view keyphrase extraction as a ranking problem. In the latter approach, the words or phrases of a document are ranked based on their importance and phrases with high importance (usually located at the beginning of the list) are recommended as possible keyphrases for a document.

This abstract explains Shihab; a system for extracting keyphrases from Arabic documents. Shihab views keyphrase extraction as a ranking problem. The list of keyphrases is generated by clustering the phrases of a document. Phrases are built from words which appear in the document. These phrases consist of 1-, 2- or 3-words. The idea is to group phrases which are similar into one cluster. The similarity between phrases is determined by calculating the Dice value of their corresponding contexts. A phrase context is the sentence in which that phrase appears. Agglomerative hierarchical clustering is used in the clustering phase. Once the clusters are ready, then each cluster will nominate a phrase to the set of candidate keyphrases. This phrase is called cluster representative and is determined according to a set of heuristics. Shihab results were compared with other existing keyphrase extractors such as KP-Miner and Arabic-KEA and the results were encouraging.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error