In part 1 of this series, we looked at why graph databases are worth considering. In summary, graph databases can answer complex questions that relational databases can’t. In this installment, we will look at how the Cray graph engine represents graphs and what makes it a robust solution for graph queries.
Conceptually, graphs are easy to think about—bubbles and arrows or bubbles and lines. They come in two main flavors, directed and undirected.
In directed graphs, the bubbles (vertices) are connected by an arrow (directed edge). They are used for relationships that are not symmetric.
In the above example, "parent-of" isn’t a symmetric relationship. If John is a parent of Susan, Susan can’t be a parent of John.
Undirected graphs are good for representing symmetric relationships. When we say "John is a friend-of Tom," we generally assume that Tom is a friend of John, also. Graphs with symmetric relationships are represented with lines or double arrows:
When you represent data inside a computer, though, you’ll be a little short of bubbles and arrows. You’ve got to use something else to represent components of the graph. In the Cray Graph Engine (CGE), we use RDF (Resource Description Framework) triples. They consist of three rather unwieldy character strings called the "subject," the "predicate," and the "object" of the triple. This fits the concept that every triple represents a subject-predicate-object fact stored in the database (those of us who received good grades in English would probably prefer "subject-verb-object," but let’s leave it alone).
For example, the graph component:
Would be represented inside the computer memory by three (subject-predicate-object) strings that would look something like the following:
You can tell from this example the three most important characteristics of RDF triples:
- They’re designed to be self-defining. You don’t have to look anywhere else to figure out what this triple is talking about.
- They’re designed to be unique across the entire Internet. That’s what all that "<http://cray.com/example…" is all about.
- As you can tell from how I represented the "John parent-of Susan" example, they’re inherently graph-oriented.
So RDF triples are graph oriented, but a bit verbose and clumsy. Be assured that we represent them more compactly in memory. Each of these strings is mapped to a unique integer.
Why did we pick RDF triples as our data representation? That’s simple: they come with SPARQL.
Here’s an example of relational tuples + SQL compared to RDF triples + SPARQL:
Suppose you had a relational database of cities, countries, and continents. It might look something like the following:
How might it look as RDF triples? Well, there wouldn’t be separate tables for cities and countries. You could have separate graphs of triples, but they’d all be triples:
Notice how in relational triples, the "meaning" of a data value (what country, whether or not it’s the capital, etc.) is implicit in its position in the table, but in the RDF triples, the "meaning" is explicitly defined by the triple’s predicate.
Now, suppose we wanted to ask the query:
Show me every capital city, its country, and the continent it’s on.
In SQL, the query would be expressed like this:
In SPARQL, you’d get a similar-looking query, except that it is obviously expressed in terms of triples:
Well, if we chose RDF triples for the sake of SPARQL, why did we choose SPARQL as our query language?
- It’s graph oriented, since it’s aimed at RDF triples databases.
- It’s a fully specified query language, with a lot of the operators query writers expect: FILTER, GROUP BY, ORDER BY, UNION, OPTIONAL, etc.
- Syntactically, it’s quite similar to SQL, which a lot of people are familiar with.
- It’s a World Wide Web Consortium (W3C) standard, with a full suite of compliance tests.
We believed that we needed to go beyond SPARQL, though, to fully equip our graph database system with analytic capabilities. There are some types of important graph-oriented queries and analysis that SPARQL can’t express. For example, SPARQL can’t express an indefinite-length search. "How many hops on the shortest path from Vertex A to Vertex B?" The best SPARQL can do with a question like this is a series of fixed-length queries: "Can I reach B from A in one hop? Two hops? Three hops. . . ?"
For another example, there are classical graph algorithms, such as betweenness centrality, that treat the graph as a communication network, and analyze which vertices would be getting the most traffic through them. If you applied this to, say, a Facebook graph, you’d get an idea of which people were most central, and thus possibly most influential, among all their friends. But this requires a specialized statistical analysis of the vertices of the graph, which SPARQL just can’t express.
We wanted a way to provide users with a library of classical graph algorithms incorporated in such a way that a graph algorithm could be nested inside a SPARQL query. To accomplish this, we extended SPARQL two new operators, and changed the way an existing one could be used:
- Our new INVOKE operator calls a graph algorithm and hands it a graph to operate on, possibly along with some other arguments.
- A new PRODUCING operator expresses how the results of the graph algorithm will be turned back into data items SPARQL can work with. This allows graph algorithms to be completely nested inside a SPARQL query, in the sense that after the graph algorithm finishes, control can return to the SPARQL query processing.
- We extended the way that the existing CONSTRUCT operator can be used. In standard SPARQL, it’s used to produce a set of results that are structured as a graph—e.g., a set of RDF triples—and output it to the user. In our extension, a nested CONSTRUCT operator always precedes an INVOKE, and it is used to construct a temporary graph and pass it to the graph algorithm.
Here’s an example of how you might use the extended SPARQL syntax. Suppose you had a Facebook-like database in which friends were connected to friends. Now suppose you wanted to examine just those people between the ages of 40 and 50, and you wanted to know what clumps of mutually acquainted people, almost cliques, that they divide into. This is an ideal application of a graph algorithm called "community detection," which basically divides a graph’s vertices into subsets that are more densely interconnected with each other than they are with vertices from other subsets.
The extended SPARQL query would look something like this:
Let’s break this query down into its components, so that we can explain each of them.
This is the final result that is being requested by the query: I want to end up with all the people on the vertices of the graph, paired with the community to which he/she has been assigned.
This is the embedded CONSTRUCT operator that builds the graph of friends-linked-to-friends that we want to analyze with the community detection algorithm.
Pick out all the pairs of friends, and get each of their ages.
The community detection algorithm expects integer weights on the edges of the graph it’s analyzing. In this example, we don’t consider any edge to be more important than any other, so we give them all a weight of 1.
For this query, we’re only interested in constructing a graph of friendships of people between the ages of 40 and 50, so this filters the rest of them out.
Here’s the new syntax we introduced. The INVOKE operator calls the community detection algorithm. It’s handed the graph that we built using the CONSTRUCT operator, and it’s also handed a scalar argument of 20. For the community detection algorithm, this means "only iterate the algorithm 20 times." The PRODUCING clause maps the outputs of the community detection algorithm back into SPARQL variables, so that these results can be worked with by the SPARQL code outside the nested community detection algorithm. To write this PRODUCING clause correctly, we had to know that the community detection algorithm produces two vectors of results: the original vertex ID (?person), paired with the ID of the community that person was assigned to (?communityID). As you might guess, we document in the CGE User Guide what each graph algorithm produces as output. In this example, the two output vectors tied to ?person and ?community will be handed to the original SELECT clause and generated as the result of the query.
One last thing the query does before it returns the answers to the original SELECT clause: sort them by communityID. This is just a convenience, enabling the user to see all the vertices assigned to a particular community gathered together in the output. It also illustrates how a query can go from SPARQL to a graph algorithm and then back to a SPARQL operator.
To me, this is an example of the whole being greater than the sum of its parts. You can use the powerful selection and filtering capabilities of a SPARQL query to pick out just the graph that we want to analyze, and then invoke a classical graph algorithm to analyze its structural properties in a way that’s difficult or impossible for SPARQL alone.
So CGE combines both graph database structure and mathematical algorithms to give you a very comprehensive capability to analyze graph-oriented data.
What else do you get?
Speed. At scale.
The speed of CGE on a Cray platform is often an order of magnitude or two higher than our competition, running on a similarly sized system. Furthermore, we often scale, especially on complex queries, an order of magnitude larger than the competition we’ve been able to measure against.