Frequency tree for clustering categorical data with similarity measure based on items' weights

Reem Bahgat; Mahmoud A. Mahdi; Samir E. Abdel-Rahman; Ismail A. Ismail

doi:10.5339/qfarf.2012.AESNP4

Abstract

Clustering is an important data mining technique that groups similar data records. In this research, the problem of clustering categorical (or transactional) data is studied. An algorithm for clustering transactional data, called F-Tree (frequent tree), is proposed, and a new similarity measure function, called PWO (probability of the weights of overlapped items), is introduced. Then, a framework to detect the best similarity value for different datasets is developed. Based on the items' frequencies in transactions, the F-Tree algorithm first generates small but highly pure clusters. It then merges similar clusters using the PWO similarity measure function, which depends on the probability of weights of overlapped items. One advantage of PWO is that it outputs the percentage of similarity between items; this is a clearer and more accurate measure of similarity than other similarity measures. Our experimental evaluation on real categorical datasets such as (Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch) shows that: Firstly, F-Tree is effective in finding rare or interesting clusters, and produces clusters with higher purity than LargeItem, WCD and CLOPE. Secondly, PWO is more effective in measuring the similarity between categorical data than LargeItem and CLOPE. Thirdly, clustering using the similarity measure PWO with pre-defined number of clusters results in separate classes with a good purity of average 80% coverage of real classes. Fourthly, the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5%. Finally, it is seen that the process of merging pure and small clusters increases the purity of resulted clusters as well as reduces time of clustering better than the process of generating clusters directly from the dataset then refining clusters. Based on the items' frequencies in transactions, The F-Tree algorithm first generates small but highly pure clusters. It then merges similar clusters using the PWO similarity measure function, which depends on the probability of weights of overlapped items. One advantage of PWO is that it outputs the percentage of similarity between items; this is a clearer and more accurate measure of similarity than other similarity measures. Our experimental evaluation on real categorical datasets such as "Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch" shows that: Firstly, F-Tree is effective in finding rare or interesting clusters, and produces clusters with higher purity than LargeItem, WCD and CLOPE. Secondly, PWO is more effective in measuring the similarity between categorical data than LargeItem and CLOPE. Thirdly, Clustering using the similarity measure PWO with pre-defined number of clusters results in separate classes with a good purity of average 80% coverage of real classes. Fourthly, the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5 percent. Finally, It is seen that the process of merging pure and small clusters increases the purity of resulted clusters as well as reduces time of clustering better than the process of generating clusters directly from the dataset then refining clusters.

oa Frequency tree for clustering categorical data with similarity measure based on items' weights

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Osteoporosis: An under-recognized public health problem

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia

Qatar’s economy: Past, present and future