Partition Clustering Techniques for Big LIDAR Dataset

Ahmad Q. Al Shami

doi:10.5339/qfarc.2016.ICTPP1880

Abstract

I. Abstract: Smart cities are collecting and producing massive amount of data from various data sources such as local weather stations, LIDAR data, mobile phones sensors, Internet of Things (iOT) etc. To use such large volume of data for potential daily computing benets, it is important to store and analyse such amount of urban data using handy computing resources and al-gorithms. However, this can be problematic due to many challenges. This article explores some of these challenges and test the performance of two partitional algorithms for clus-tering such Big LIDAR Datasets. Two handy clustering algorithms the K-Means vs. the Fuzzy c-Mean (FCM) were put to the test to address the suitability of these algorithms for clustering such a large dataset. The purpose of clustering urban data is to categorize it into homogeneous groups according to specic attributes. Clustering Big LIDAR Data in compact format represents the information of the whole data and this can benefit researchers to deal with this reorganised data much efciently. To achieve this end, the two techniques were utilised against a large set of Lidar data to show how they perform on the same hardware set-up. Our experiments conclude that FCM outperformed the K-Means when presented with such type of dataset, however the later is lighter on the hardware utilisations. II. Introduction: Many ongoing and recent researches and development in computation and data storing technologies have contributed to production of the Big Data phenomena. The challenges of Big Data are due to the 5V's which are: Volume, Velocity, Variety, Veracity and Value to be gained from the analysis of Big Data [1]. From the survey of the literature, there is anagreement between data scientists about the general attributes that characterise Big Data 5V's which can be summed as follows: Very large data mainly in Terabytes/Petabytes/Exabyte's of data (Volume). Data can be found in structured, unstructured and semi structured forms (Variety). Often incomplete data and inaccessible. Data sets extraction should be from reliable and verified sources. Data can be streaming at very high speed (Velocity). Data can be very complex with interrelationships and highdimensionality. Data may contain few complex interrelationships between different elements. The challenges of Big Data in general are an ongoing thing and the problems is growing every year. A report by Cisco, estimated that by the end of 2017, annual global data traffic will reach 7.7 Zettabytes. The global internet traffic will be three times over the next five years. Overall, the global data traffic will grow at a Compound Annual Growth Rate (CAGR) of 25% by the year 2017. It is essential to take steps toward tackling these challenges because it can be predicted that a day will come when Big Data tools will become obsolete in front of such enormous amount of data flow. III. Clustering Methods: Researchers are dealing with many types of large datasets, the concern here is to wither introduce new algorithms or use the existing algorithms to suit large datasets by focusing on the data itself to suit the available algorithms. Currently, two approaches are predominant: First, is known as “Scaling-Up” which focuses the efforts on the enhancement of the available algorithms. This approach risks them becoming useless for tomorrow, as the data continues to grow. Hence, to deal with continuously growing in size datasets, it will be necessary to frequently scale up algorithms as the time moves on. The second approach is to “Scale-Down” or to skim the data itself, and to use existing algorithms on the skimmed version of the data after reducing its size. This article focuses on the scale-down of data sets by comparing clustering techniques. Clustering is defined as the process of grouping a set ofitems or objects which have same attributes or characteristics in the same group called a cluster which may differ from another group. Clustering can be very useful for between cluster separation, within cluster homogeneity and for good representation of data by its centroid. These can be applied to different fields such as Biology to find groups of genes which have same functions or similarities. It is also used in Medicine to find patterns in symptoms of disease and in Business to find and target potential customers. IV. Compared Techniques K-Means vs. Fuzzy c-Means: To highlight the advantages to everyday computing for Big Data, this article is focusing on comparing two trendy and computationally attractive partitional techniques which can be explained as follows: 1) K-Means Clustering: This is a widely used clustering algorithm. It partition a data set into K clusters (C1;C2;:::;CK), represented by their arithmetic means called the “centroid” which is calculated as the mean of all data points (records) belonging to certain cluster. 2) Fuzzy c-Means clustering: FCM was introduced by Bezdek et al. and it is derived from the explained K-means concept for the purpose of clustering datasets, but it differs in that the object may belong to more than one cluster with degrees of belonging. However, it is possible that an object may belong to more than one cluster according to its degree of membership, which is also calculated on the bases of distances (usually the Euclidean) between the data points and cluster. V. Experiments Set-up: The experiments are done to compare and illustrate how the candidate K-Means and FCM clustering techniques cope with clustering Big LIDAR Data set using a handy computer hardware. The experiment were performed using an AMD8320, 4.1 GHz, 8 core processor with 8 GB of RAM and running a 64-bit Windows 8.1 OS. The algorithms were implemented against a LIDAR data points, taken for our campus location at Latitude: 52:23–52:22 and Longitude: 1:335–1:324. This location represents the International University of Sarajevo main campus with an initialization of 1000000 × 1000 digital surface data points. Both clustering techniques were applied to the dataset starting with a small cluster number K = 5 and gradually increased to reach K = 25 clusters. VI. Conclusions: The lowest time measured for FCM to re group the data into 5 clusters was recorded at 42.18 seconds while it took K-Means 161.17 seconds to form the same number of clusters. The highest time recorded for K-Means to converge was 484. 01 seconds, while it took FCM 214.15 seconds to cluster the same dataset. Hence, There is a high positive correlation between the time and the number of clusters assigned, as the number of clusters count increases so does the time complexity for both algorithms. On average FCM used up between 5–7 out of the eight available cores, with 63.2 percent of the CPU processing power and 77 percent of the RAM memory. The K-Meanson the other hand utilised between 4–6 with the rest remain as idle cores with an average of 37.4 percent of the CPU processing power and 47.2 percent of the RAM memory. Overall, both algorithms are scalable to deal with Big Data, but, FCM is fast and would make an excellent clustering algorithm for everyday computing. In addition, it would offer some extra added advantages such as its ability to handle different data types. Also, this fuzzy partitioning technique and due to its fuzzy capability, FCM could produce a better quality of the clustering output which could benefit many data analysts.

oa Partition Clustering Techniques for Big LIDAR Dataset

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Osteoporosis: An under-recognized public health problem

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia

Qatar’s economy: Past, present and future