Using Big Data Analytics With Graql to Determine Love or Hate
Using Big Data Analytics With Graql to Determine Love or Hate
Look at an example of a dataset containing people and their relations. Learn how to query it to see who a particular person loves, who the most hated person is, and more!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
- Grakn, a distributed hyper-relational database for knowledge-oriented systems. Grakn enables machines to manage complex data that serves as a knowledge base for cognitive/AI systems.
- Graql is Grakn’s reasoning and analytics query language. Graql is a much higher level abstraction over traditional query language — SQL, NoSQL, or graphs.
Most AI applications involve complex computation on big data. For example, training a neural network requires the system to feed all the data into the model and to update its parameters iteratively. Analytics, a component of Graql, is designed specifically for carrying out these complex operations on big data.
OLTP vs. OLAP
So how does Analytics do it? How is it different from executing a regular Graql query? In general, we can divide data processing into two types: online transaction processing (OLTP) and online analytical processing (OLAP). Let me explain the difference using an example.
Imagine you have a social network dataset containing people and relations between them. For simplicity, let's assume there are only two types of relations in the data: loves and hates. Both relations are undirected, so if Person X hates Person Y, Person Y also hates Person X back. The dataset is not big — say only a few million people and about the same amount of relations. Now, given this dataset, we can do something interesting on it.
To begin with, we would like to query the data regarding a particular person. For example, let's assume we have someone called Jason in the dataset, and we want to find all the people that Jason loves.
Naturally, we want to locate the entity or node representing Jason first, then follow all the "loves" relations. A really stupid way to start the query is going through all the vertices we have in the database and checking whether they represent Jason. This can be quite a slow process when there are millions of entities in the data.
Fortunately, with Graql, we don't have to do this to find Jason, thanks to our clever indexes. Given the type (person) and value (Jason) of the entity, we can quickly locate Jason. From Jason, we follow all his relations and find the person(s) he loves. Note that to find the answer, we only need to start from one person, who serves as a "single source of truth." This type of operation is called OLTP. It is used when the user only cares about a small subset of the data. Therefore, OLTP queries are usually fast, as the system only needs to read a small fraction of the data. Most regular Graql queries fall into this category.
Let's take this one step further. We want to find out who's the most hated person. You can imagine that to do this, we probably need to find out the number of relations "hates" for each person in the network. There is no "single source of truth." Therefore, in contrast to the OLTP query, we need to start from every person in our dataset this time. This type of query is called OLAP query. In Graql, OLAP is executed by the Analytics module.
As you can probably guess, OLAP usually takes longer than OLTP because it can involve all the data in the database. Typical applications of OLAP include business and management reporting, budgeting and forecasting, data mining, and knowledge discovery. To carry out OLAP, we need some big data framework, unless we want to reinvent the wheels.
Big Data Frameworks: Pregel vs. MapReduce
Many people think of Hadoop and its MapReduce engine when it comes to big data frameworks. Indeed, together with HDFS — its distributed framework — Hadoop makes big data on large distributed datasets possible. It has incredible scalability potential and has been used in production in many different fields, such as biomedics, finance, and physics.
However, when it comes to highly connected data, MapReduce might not be the best solution. In such data, graph or network structure is used to model the data, so entities in the database can be linked together directly and in many cases retrieved with one operation. MapReduce is essentially functional. Expressing a graph algorithm as a chained MapReduce requires passing the entire state of the graph from one stage to the next.
To solve the problem of MapReduce, engineers from Google created Pregel. It is the framework Google use to perform OLAP on large graph data. In Pregel, everything is either a vertex or an edge. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration and send messages to other vertices. Pregel keeps vertices and edges on the machine that performs computations. Only messages need to be transferred over the network. This leads to much better performance for graph data. In Graql Analytics, we follow the idea of Pregel to build scalable and efficient OLAP methods.
The art of Pregel is that everything can be done via message passing and some local computation on the vertex. I found it hard to believe at the beginning. However, once I started to think from the perspective of vertices and edges, message passing becomes quite natural.
Let's look at a simple example first: degree. In hyper-relational databases, we sometimes want to compute how many relations each entity has. This can help us understand how connected our data is, identify super nodes, or (following our social network example) identify who's the most hated person in the network.
The Pregel program takes two iterations to compute the degrees. In the first iteration of the program, we go to every person vertex and send a message via each of its hates edges. In the next iteration, each person vertex simply counts the number of messages it has received. This number is the degree. After this, we can easily find out who is the most hated person or, on average, how many people we hate.
This is pretty straightforward, right? If you are a Graql user, you can use our degree method simply by executing the following Graql command:
compute degrees of person in hates
Transforming Graph Algorithms
Let's look at a more complex problem that can show the true strength of Pregel. Still based on the same social network data, we'd like to find out the groups of people that are connected by love. Formally, this means every pair of people in the same group is connected by relations "loves," either directly or through other people. For example, Sheldon loves Felix, and Felix loves Filipe, so we say Filipe and Sheldon are also connected.
This is a common problem called "connected component" in graph theory. Let's solve it using message passing. For simplicity, let's assume every person in the database has a unique integer ID. In the first iteration, every person simply sends their own ID via relation "loves" to their neighbors. From the second iteration, each vertex sends out the largest ID it has received in all previous iterations and saves this ID on the vertex as its group ID. The program terminates when every person's group ID stops changing.
The idea here is that if two people are connected by love, they would eventually have the same group ID. So when the program terminates, all persons connected by love will have the same group ID.
Of course, if you use Graql, you don't have worry about these implementation details. All you need is the following Graql command:
compute cluster in person, loves;
Spark for Better Performance
As you may notice, the Pregel program for "connected component" may take many iterations. Under the hood, we use Spark to make this iterative process fast and efficient. Apache Spark is a massively parallel computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance centered on its in-memory caching abstraction. This makes Spark ideal for tasks where multiple operations access the same input data.
If you enjoyed this article, please share it! Also, please get in touch if you have any questions or comments.
Published at DZone with permission of Jason Liu , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.