Large-scale graph processing is becoming central to our modern life. For instance, graph pattern matching (GPM) can be utilized to search and analyze social graphs, biological data and road networks, to mention a few. Conceptually, a GPM algorithm is typically defined in terms of subgraph isomorphism, whereby it seeks to find subgraphs in an input data graph, G, which are similar to a given query graph, Q. Although subgraph isomorphism forms a uniquely important class of graph queries, it is NP-complete and very restrictive in capturing sensible matches for emerging applications like software plagiarism detection, protein interaction networks, and intelligence analysis, among others. Consequently, GPM has been recently relaxed and defined in terms of graph simulation. As opposed to subgraph isomorphism, graph simulation can run in quadratic time, return more intuitive matches, and scale well with modern big graphs (i.e., graphs with billions of vertices and edges). Nonetheless, the current state-of-the-art distributed graph simulation systems still rely on graph partitioning (which is also NP-complete), induce significant communication overhead between worker machines to resolve local matches, and fail to adapt to various complexities of query graphs.

In this work, we observe that big graphs are not big data. That is, the largest big graph that we know of can still fit on a single physical or virtual disk (e.g., 6TB physical disks are cheaply available nowadays and AWS EC2 instances can offer up to 24 ×  2048GB virtual disks). However, since graph simulation requires exploring the entire input big graph, G, and naturally lacks data locality, existing memory capacities can get significantly dwarfed by G's size. As such, we propose GraphSim, a novel distributed and adaptive system for efficient and scalable graph simulation. GraphSim precludes graph partitioning altogether, yet still exploits parallel processing across cluster machines. In particular, GraphSim stores G at each machine but only matches an interval of G's vertices at the machine. All machines are run in parallel and each machine simulates its interval locally. Nevertheless, if necessary, a machine can inspect remaining dependent vertices in G to fully resolve its local matches without communicating with any other machine. Hence, GraphSim does not shuffle intermediate data whatsoever. In addition, it attempts not to overwhelm the memory of any machine via employing a mathematical model to predict the best number of machines for any given query graph, Q, based on Q's complexity, G's size and the memory capacity of each machine. Subsequently, GraphSim renders adaptive as well. We experimentally verified the efficiency and the scalability of GraphSim over private and public clouds using real-life and synthetic big graphs. Results show that GraphSim can outperform the current fastest distributed graph simulation system by several orders of magnitude.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error