Opensource Graph Technologies Meetup: The Inaugural Event
Opensource Graph Technologies Meetup: The Inaugural Event
Read through the Q&A sessions from a recent open-source and graph technology meetup in London with GRAKN.AI, Apache Giraph, and OrientedDB.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
As advocates of open-source and graph technologies, we came up with the idea of running a meetup that would merge the two! And so, we had our first London-based one on March 13 at Campus London.
If you weren’t able to make it or if you just want a refresher, you can find the slides linked below and some of the (paraphrased) Q&A. Videos will be uploaded later.
Our founder, Haikal Pribadi, gave an overview of some of the core features and differentiators of GRAKN.AI — from how to define your relationships in your ontology to the query compression that it is able to perform using its inferencing capabilities. You can view the presentation here.
- It seems that GRAKN.AI can support more than two entities in a relationship, so does it mean it can also support hyper relationships?
Yes — you can consider a relationship as a hyperedge.
Since you drew a query comparison between SQL, Gremlin, and Graql, it would be valid to compare Graql to its Semantic Web counterpart, SPARQL. How does Graql compare to SPARQL?
First of all, SPARQL works in the RDF/OWL world. You can represent anything in the OWL/RDF world in the GRAKN world. What is different is that we have higher-level constructs for you to do the modeling. These higher-level constructs are always in the form of relationships and entities; that is not necessarily the case for in OWL/RDF. So, to be able to query this structure, we needed to create Graql.
I guess your data model is not only about its logical properties — it can easily go from relational models to directed graphs and back. It also implicitly suggests a certain modeling philosophy — you model n-ary patterns directly whereas in RDF you think in ontology patterns.
Is there support for negation in the inference?
Not in the current release, but we are currently working on it. Right now, we can do negation on concept values — such as x value is not equal to y value. But negation on the patterns themselves is still in progress.
The inference is its classical deduction; is there a plan to support abductive reasoning? What are the things that need to be in the database in order for some statement to be true? For example, if you ask the database whether there is a driver going to London, your query would return an answer like “Yes, if xyz location is in London.”
Not in the current release — but potentially in a future one!
Can you perform multiple inheritances, i.e., inheriting the properties from two types to one?
There is some debate within our team about whether or not we should support it — but for the time being, no.
We have been creating a graph engine in-house (which is hard to do), so we would like to know how you go about storing data to get the queries to execute within a reasonable time.
Under the hood, the Graql queries get translated down to Gremlin queries. So, however Gremlin is able to manage large amounts of data is basically how we do it. If you were writing the same query in Gremlin, you would have to specify the sequence and path that you want the query to execute. In GRAKN.AI, we have an optimization algorithm that optimizes the query path for you. However, the optimization of the query planner is an ongoing project.
Why did OrientDB choose the Multi-Model structure?
It was actually part of an evolution, OrientDB originally started as an object database used to map objects on documents but eventually saw more demand for a graph structure. In the 1.0 release, we started to support TinkerPop 2.x and have continued to do so in subsequent releases. We are currently working on supporting Tinkerpop 3.x. In version 3.0, we are bringing all of the graph concept inside the core so you can use OrientDB without using TinkerPop. In the older versions of OrientDB, you had to go through TinkerPop in order to use it as a graph database.
Current Googler and PMC member/project committer of the Apache Giraph project Claudio Martella showed us how an iterative graph processing framework would help your performance. You can view the presentation here.
How does Giraph deal with two marginals in the path?
Every vertex in the graph will be in one of two states — halted or active mode. The computation is over when every vertex is in halted mode. If you are in halted mode and someone sends you a message then you automatically go into active mode. So again the shortest path you start with everybody being active, everybody going to sleep because they know that they are not the source vertex. Then the get “woken up” as this “wave” of messages gets to them. One PageRank you wouldn’t need to wake them as you keep on sending them messages and iterating on them.
What’s the point of halting?
It’s to avoid calling a vertex that you have no reason to. You will also know when the computation is over. For example, in the shortest path, how will you know when it’s finished?
How do you compare Giraph to Spark Graph Computer? Let’s assume that enough memory is given.
Let me put this way, Spark is a simpler system, I think. If you compare it, Giraph is faster but it requires more engineering, a little bit more overhead to program it. The big difference would be that because Spark and systems like Spark, i.e., Flink, are general purpose but they do some optimization similar to the Pregl model, they will be fast. They will be faster than MapReduce but they’re not as fast as Giraph because Giraph does only one thing, and it’s very optimized to do that one thing.
If you go, you will see that Spark test that they are faster but they are not. We spent one year trying to replicate the result at the university I was working at before and we never managed to do it. Not even Facebook managed to do it and Facebook is the major developer of Giraph. So, they have some spooky benchmark there published. That’s why I wanted the question — they are slower.
What can you model for an edge, because in the second example, there were numbers — are they relationships or properties?
The system runs on bare metal. We’re trying to minimize the overhead as much as possible. This runs on a trillion edge at Facebook. So, you only want the data that is useful to that particular algorithm, so in effect, you only have properties unless you really need them. And if you read some properties for your algorithm, you will have this edge value that you model for your algorithm that has that particular information. This vertex is just a vertex class and the edges are just as hashmap indexed by target ID. Or even less, it could be an array list if you want to minimise memory and not have random access to the edges. You just want to iterate which is 99% of the time. It’s going to be just an array so you will save some memory.
So, I assume Giraph will be integrated into some data store, to be able to run Giraph Pregl function on top of it. So, what does it take to integrate with Giraph?
Depends on your use case. The very first problem is that for every algorithm you need to write a separate class, job or application. A datastore will require a query language. We have run some tests using a subset of Cypher to do some distance-related, shortest path queries. In order to integrate, the data store will have to generate code in Giraph, compile it, send it to your Hadoop cluster and the Hadoop cluster will read the graph. For example, if you have Hbase, then you can push some of your filters to the datastore and use only the vertices that you care about. It would be useless to load the whole graph. Essentially, Giraph would load the entire graph unless you specify otherwise.
Does Giraph have an efficient data structure to enable analytical querying on data?
If it’s a graph, then maybe. It’s really down to the bare metal. This is for very large scale. If you go to something like Giraph then you really are going for scale and speed. If you are looking for flexibility with different models, properties or aggregations then no. With technology like Hive and Pig, the idea is that you write a query and then it automatically builds a graph of computations with a mapper job. Giraph would be one part of this data flow — where the graph parts are handled by Giraph and the other parts gets sent back to MapReduce or something.
Thanks to everyone who attended and to CampusLondon for providing us with the space for the event.
Published at DZone with permission of Precy Kwan , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.