Practicing Graph Computation With GraphX in Nebula Graph
In this article, practice graph computation with GraphX in Nebula Graph.
Join the DZone community and get the full member experience.Join For Free
With the rapid development of network information technology, data is gradually developing towards multi-source heterogeneity. Inside the multi-source heterogeneous data lies countless inextricable relations. And this kind of relations, together with the network structure, are surely essential for data analysis. Unfortunately, when it comes to large scale data analysis, the traditional relational databases are poor in association detection and expressions. Therefore, graph data has attracted great attention for its powerful ability in expressions. Graph computing uses a graph model to express and solve the problem. Graphs can integrate with multi-source data types.
In addition to displaying the static basic features for data, graph computing also finds its chance to display the graph structure and relationships hidden in the data. Thus graph becomes an important analysis tool in social network, recommendation system, knowledge graph, financial risk control, network security, and text retrieval.
In order to support the business needs of large-scale graph computing, Nebula Graph provides PageRank and Louvain community-discovered graph computing algorithms based on GraphX. Users can execute these algorithm applications by submitting Spark tasks. In addition, users can also write Spark programs by using Spark Connector to call other built-in graph algorithms in GraphX, such as LabelPropagation, ConnectedComponent, etc.
PageRank is an algorithm raised by Google to rank web pages in their search engine results. PageRank is a way of measuring the importance of website pages.
Introduction to PageRank
PageRank was named after Larry Page and Sergey Brin from Stanford in the United States. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
Applications of PageRank
Content recommendation based on similarity
With the help of the PageRank, you can recommend similar content to users based on their browse history and view time when analyzing social applications such as Twitter and Facebook.
Analyze the social influence of users
You can analyze the social influence of users according to their PageRank values in social network analysis.
Importance research for papers
Judge a paper’s quality according to its PageRank value. Actually the PageRank algorithm comes from the idea of judging the quality of the papers. In addition, PageRank also finds its usage in data analysis and mining.
PageRank in GraphX is implemented based on the Pregel computing model. The algorithm contains three procedures:
- Set a same initial PageRank value for every vertex (web page) in the graph;
- The first iteration: Send a message along the edge. Each vertex receives all the vertices information along its related edges and gets a new PageRank value;
- The second iteration: Put the PageRank values you get in the first iteration to the formulas corresponding to different algorithm models and get the new PageRank values.
The Louvain method for community detection is a method to extract communities from large networks. The method is an aggregation algorithm for graphs.
Introduction to Louvain
The inspiration for Louvain is the optimization of modularity as the algorithm progresses. If a node joins a certain community and maximizes the modularity of the community compared to other communities, then the node belongs to the community. If a node doesn’t increase the modularity after joining other communities, it should stay in the current community.
The modularity Q measures the density of links inside communities compared to links between communities. Modularity is defined as:
Deformation for the Modularity Formula
In this formula, the formula is only meaningful when node i and node j belong to the same community. Therefore the formula measures the closeness within a certain community. The simplified deformation of this formula is as follows:
Calculate the Modularity Change
In the Louvain method, there is no need to calculate the specific modularity for each community. You only need to compare the modularity changes after adding a certain node to the community. That is to say you only need to calculate △Q.
When inserting node i to a certain community, the modularity change is:
Applications of Louvain
- Financial risk control
In financial risk control scenarios, you can use the method to identify fraud gangs based on the users behaviors.
- Social network
Louvain divides the social network based on the breadth and strength of nodes association in a graph. Louvain also analyzes complex networks and the closeness among a group of people.
- Recommendation system
Community detection based on the users’ interests. More accurate and effective customized recommendation can be implemented based on the community and collaborative filtering algorithm.
The Louvain method contains two stages. The procedure is actually the iteration of the two stages.
Stage 1: Continuously traverse the nodes in the graph, and compare the modularity changes introduced by nodes in each neighbor community. Then add a single node to the community that can maximize the modularity. (For example, node v is added to communities A, B, and C respectively. The modularity increments of the three communities are -1, 1, 2. Then node v should be added to community C.)
Stage 2: Process based on the first stage. The nodes belonging to the same community are merged into a super node to reconstruct the graph. That is, the community is regarded as a new node in the graph. At this time, the weight of the edges between the two super nodes are the sum of the weight of the edges attached to the original nodes in the two super nodes. That is, the sum of the weight of the edges in the two communities.
Following are details for the two stages.
In the first stage, the nodes are traversed and then added to the communities they belong to. In this way, we get the middle graph and the four communities.
In the second stage, the nodes in the community are merged into a super node. The community nodes have self-connected edges whose weight is twice the sum of the weights of the edges connected between all nodes in the community. The edge weight between the communities is the sum of the weights of the edges connecting the two communities. For example, the red community and the light green community are connected by (8,11), (10,11), (10,13). So the weight of the edges between the two communities is 3.
Note: The weight inside the community is twice the weight of the edges between all internal nodes, because the concept of Kin is the sum of the edges of all nodes in the community and node i. When calculating the Kin for a certain community, each edge is actually counted twice.
The Louvain method continuously iterates stage 1 and stage 2 until the algorithm is stable (the modularity for the graph does not change) or the iterations reach the maximum.
Practice the Proceeding Algorithms
- Three virtual machines:
- Cpu name: Intel(R) Xeon(R) Platinum 8260M CPU @ 2.30GHz
- Processors: 32
- CPU Cores: 16
- Memory Size: 128G
- Software environment:
- Spark: spark-2.4.6-bin-hadoop2.7 a cluster with three nodes
- yarn V2.10.0: a cluster with three nodes
- Nebula Graph V1.1.0: distributed deployed, used the default configurations
Dataset for Testing
- Create graph space
- Create schema
- Import data
Use Spark Writer to import data offline into Nebula Graph.
- Test results
The resource allocation fot the Spark job is
--driver-memory=20G --executor-memory=100G --executor-cores=3.
- PageRank algorithm execution time on a dataset with hundred million nodes is 21 minutes.
- Louvain algorithm execution time on a dataset with hundred million nodes is 1.3 hours.
How to Use Nebula Graph Algorithm
- Download and pack the nebula-algorithm project to a jar package.
- Configure the file
- Make sure that you have installed and started the Spark service on your machine.
- Submit the nebula algorithm application:
If you are interested in the content, welcome to try nebula-algorithm.
- Nebula Graph: https://github.com/vesoft-inc/nebula
- GraphX: https://github.com/apache/spark/tree/master/graphx
- nebula-algorithm: https://github.com/vesoft-inc/nebula-java/tree/master/tools/nebula-algorithm
You Might Also Like
Published at DZone with permission of Jamie Liu. See the original article here.
Opinions expressed by DZone contributors are their own.