Graph Degree Distributions Using R Over Hadoop
Join the DZone community and get the full member experience.
Join For Free
there are two common types of graph engines. one type is focused on providing real-time, traversal-based algorithms over linked-list graphs represented on a single-server. such engines are typically called graph databases and some of the vendors include neo4j , orientdb , dex , and infinitegraph . the other type of graph engine is focused on batch-processing using vertex-centric message passing within a graph represented across a cluster of machines. graph engines of this form include hama , golden orb , giraph , and pregel .
the purpose of this post is to demonstrate how to express the computation of two fundamental graph statistics — each as a graph traversal and as a mapreduce algorithm. the graph engines explored for this purpose are neo4j and hadoop . however, with respects to hadoop, instead of focusing on a particular vertex-centric bsp -based graph-processing package such as hama or giraph, the results presented are via native hadoop (hdfs + mapreduce). moreover, instead of developing the mapreduce algorithms in java, the r programming language is used. rhadoop is a small, open-source package developed by revolution analytics that binds r to hadoop and allows for the representation of mapreduce algorithms using native r.
the two graph algorithms presented compute degree statistics : vertex in-degree and graph in-degree distribution . both are related, and in fact, the results of the first can be used as the input to the second. that is, graph in-degree distribution is a function of vertex in-degree. together, these two fundamental statistics serve as a foundation for more quantifying statistics developed in the domains of graph theory and network science .
- vertex in-degree : how many incoming edges does vertex x have?
- graph in-degree distribution : how many vertices have x number of incoming edges?
these two algorithms are calculated over an artificially generated graph that contains 100,000 vertices and 704,002 edges. a subset is diagrammed on the left. the algorithm used to generate the graph is called preferential attachment . preferential attachment yields graphs with “natural statistics” that have degree distributions that are analogous to real-world graphs/networks. the respective igraph r code is provided below. once constructed and simplified (i.e. no more than one edge between any two vertices and no self-loops), the vertices and edges are counted. next, the first five edges are iterated and displayed. the first edge reads, “vertex 2 is connected to vertex 0.” finally, the graph is persisted to disk as a graphml file.
~$ r r version 2.13.1 (2011-07-08) copyright (c) 2011 the r foundation for statistical computing > g <- simplify(barabasi.game(100000, m=10)) > length(v(g)) [1] 100000 > length(e(g)) [1] 704002 > e(g)[1:5] edge sequence: [1] 2 -> 0 [2] 2 -> 1 [3] 3 -> 0 [4] 4 -> 0 [5] 4 -> 1 > write.graph(g, '/tmp/barabasi.xml', format='graphml')
graph statistics using neo4j
when a graph is on the order of 10 billion elements (vertices+edges), then a single-server graph database is sufficient for performing graph analytics. as a side note, when those analytics/algorithms are “ego-centric” (i.e. when the traversal emanates from a single vertex or small set of vertices), then they can typically be evaluated in real-time (e.g. < 1000 ms). to compute these in-degree statistics, gremlin is used. gremlin is a graph traversal language developed by tinkerpop that is distributed with neo4j, orientdb, dex, infinitegraph, and the rdf engine stardog . the gremlin code below loads the graphml file created by r in the previous section into neo4j. it then performs a count of the vertices and edges in the graph.
~$ gremlin \,,,/ (o o) -----oooo-(_)-oooo----- gremlin> g = new neo4jgraph('/tmp/barabasi') ==>neo4jgraph[embeddedgraphdatabase [/tmp/barabasi]] gremlin> g.loadgraphml('/tmp/barabasi.xml') ==>null gremlin> g.v.count() ==>100000 gremlin> g.e.count() ==>704002
the gremlin code to calculate vertex in-degree is provided below. the first line iterates over all vertices and outputs the vertex and its in-degree. the second line provides a range filter in order to only display the first five vertices and their in-degree counts. note that the clarifying diagrams demonstrate the transformations on a toy graph, not the 100,000 vertex graph used in the experiment.
gremlin> g.v.transform{[it, it.in.count()]} ... gremlin> g.v.transform{[it, it.in.count()]}[0..4] ==>[v[1], 99104] ==>[v[2], 26432] ==>[v[3], 20896] ==>[v[4], 5685] ==>[v[5], 2194]
next, to calculate the in-degree distribution of the graph, the following gremlin traversal can be evaluated. this expression iterates through all the vertices in the graph, emits their in-degree, and then counts the number of times a particular in-degree is encountered. these counts are saved into an internal map maintained by groupcount. the final cap step yields the internal groupcount map. in order to only display the top five counts, a range filter is applied. the first line emitted says: “there are 52,611 vertices that do not have any incoming edges.” the second line says: “there are 16,758 vertices that have one incoming edge.”
gremlin> g.v.transform{it.in.count()}.groupcount.cap ... gremlin> g.v.transform{it.in.count()}.groupcount.cap.next()[0..4] ==>0=52611 ==>1=16758 ==>2=8216 ==>3=4805 ==>4=3191
to calculate both statistics by using the results of the previous computation in the latter, the following traversal can be executed. this representation has a direct correlate to how vertex in-degree and graph in-degree distribution are calculated using mapreduce (demonstrated in the next section).
gremlin> degreev = [:] gremlin> degreeg = [:] gremlin> g.v.transform{[it, it.in.count()]}.sideeffect{degreev[it[0]] = it[1]}.transform{it[1]}.groupcount(degreeg) ... gremlin> degreev[0..4] ==>v[1]=99104 ==>v[2]=26432 ==>v[3]=20896 ==>v[4]=5685 ==>v[5]=2194 gremlin> degreeg.sort{a,b -> b.value <=> a.value}[0..4] ==>0=52611 ==>1=16758 ==>2=8216 ==>3=4805 ==>4=3191
graph statistics using hadoop
when a graph is on the order of 100+ billion elements (vertices+edges), then a single-server graph database will not be able to represent nor process the graph. a multi-machine graph engine is required. while native hadoop is not a graph engine, a graph can be represented in its distributed hdfs file system and processed using its distributed processing mapreduce framework. the graph generated previously is loaded up in r and a count of its vertices and edges is conducted. next, the graph is represented as an edge list. an edge list (for a single-relational graph) is a list of pairs, where each pair is ordered and denotes the tail vertex id and the head vertex id of the edge. the edge list can be pushed to hdfs using rhadoop. the variable edge.list represents a pointer to this hdfs file.
> g <- read.graph('/tmp/barabasi.xml', format='graphml') > length(v(g)) [1] 100000 > length(e(g)) [1] 704002 > edge.list <- to.dfs(get.edgelist(g))
in order to calculate vertex in-degree, a mapreduce job is evaluated on edge.list. the map function is fed key/value pairs where the key is an edge id and the value is the ids of the tail and head vertices of the edge (represented as a list). for each key/value input, the head vertex (i.e. incoming vertex) is emitted along with the number 1. the reduce function is fed key/value pairs where the keys are vertices and the values are a list of 1s. the output of the reduce job is a vertex id and the length of the list of 1s (i.e. the number of times that vertex was seen as an incoming/head vertex of an edge). the results of this mapreduce job are saved to hdfs and degree.v is the pointer to that file. the final expression in the code chunk below reads the first key/value pair from degree.v — vertex 10030 has an in-degree of 5.
> degree.v <- mapreduce(edge.list, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k,length(v))) > from.dfs(degree.v)[[1]] $key [1] 10030 $val [1] 5 attr(,"rmr.keyval") [1] true
in concert, these two computations can be composed into a single mapreduce expression.
> degree.g <- mapreduce(mapreduce(edge.list, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k,length(v))), map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k,length(v)))
note that while a graph can be on the order of 100+ billion elements, the degree distribution is much smaller and can typically fit into memory. in general, edge.list > degree.v > degree.g. due to this fact, it is possible to pull the degree.g file off of hdfs, place it into main memory, and plot the results stored within. the degree.g distribution is plotted on a log/log plot. as suspected, the preferential attachment algorithm generated a graph with natural “ scale-free ” statistics — most vertices have a small in-degree and very few have a large in-degree.
> degree.g.memory <- from.dfs(degree.g) > plot(keys(degree.g.memory), values(degree.g.memory), log='xy', main='graph in-degree distribution', xlab='in-degree', ylab='frequency')
Published at DZone with permission of Marko Rodriguez, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments