sPCA: Scalable Principle Component Analysis for Big Data Web sites, social networks, sensors, and scientific experiments today generate massive amounts of data i.e Big Data. Owners of this data strive to obtain insights from it, often by applying machine learning algorithms.Thus, designing scalable machine learning algorithms that run on a cloud computing infrastructure is an important area of research. Many of these algorithms use the MapReduce programming model. In this poster presentation, we show that MapReduce machine learning algorithms often face scalability bottlenecks, commonly because the distributed MapReduce algorithms for linear algebra do not scale well. We identify several optimizations that are crucial for scaling various machine learning algorithms in distributed settings. We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm. PCA is an important tool in many areas including image processing, data visualization, information retrieval, and dimensionality reduction. We refer to the proposed optimized PCA algorithm as sPCA. sPCA is implemented in the MapReduce framework. It achieves scalability via employing efficient large matrix operations, effectively leveraging matrix sparsity, and minimizing intermediate data.Experiments show that sPCA outperforms the PCA implementation in the popular Mahout machine learning library by wide margins in terms of accuracy, running time, and volume of intermediate data generated during the computation. For example, on a 94 GB dataset of tweets from Twitter, sPCA achieves almost 100% accuracy and it terminates in less than 10,000 s (about 2.8 hours), whereas the accuracy of Mahout PCA can only reach up to 70% after running for more than 259,000 s (about 3 days). In addition,both sPCA and Mahout PCA are iterative algorithms, where the accuracy improves by running more iterations until a target accuracy is achieved. In our experiments, when we fix the target accuracy at 95%, Mahout PCA takes at least two orders of magnitudes longer than sPCA to achieve that target accuracy. Furthermore, Mahout PCA generates about 961 GB of intermediate data, whereas sPCA produces about 131 MB of such data, which is a factor of 3,511x less data. This means that, compared to Mahout PCA, sPCA can achieve more than three orders of magnitudes saving in network and I/O operations, which enables sPCA to scale well.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error