Titan Can Handle 2400 Concurrent Users Against a Graph Cluster in Real-Time
This thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing. This is distributed in real-time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.
I met Matthias Bröcheler (CTO of Aurelius the company behind Titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist, but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got opened for the first time about what big data really means and how companies like Facebook, Google, etc. knit their business model around collecting data. Matthias truly influenced me and I have a lot of respect of him.
We lost contact, and I did not start my PhD right away. I knew he was interested in graphs but that was about it. When I first started to use Neo4j, I realized that Matthias was also one of the authors of the tinkerpop blueprints, which are interfaces to talk to graphs. Most vendors of graph data bases use them. At that time, I looked him up again and I realized he was working on Titan – a distributed graph data base. I found this promising looking slide deck:
But at that time for me there wasn’t much evidence that Titan would really deliver on the promise that is given in slides 106 and 107. In fact, those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now, I am kind of amused since I did not really aim for the important points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from Mahout. I started to realize the big fundamental differences between HBase (the implementation of Google Bigtable) and Cassandra (an implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines, I stumbled again onto Titan, and seeing Matthias’ talk on the Cassandra summit 2013 got me excited. The 21 - 22 minute talk is really interesting. I also suggest you skip the first 15 minutes of the talk:
Let me sum up the amazing parts of the talk:
- 2400 concurrent users against a graph cluster!
- real time!
- 16 different (non trivial queries) queries
- achieving more than 10k requests answered per second!
- graph with more than a billion nodes!
- graph partitioning is plugable
- graph schema helps indexing for queries
- Scaling data size
- scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated.
- making partitioning pluggable
- requiring an schema for the graph (to enable efficient indexing)
- being able on runtime to extend the schema.
- building on top of ether Cassandra (for realtime) or HBase for consistency
- being compatible with the tinkerpop techstack
- bringing up an entire framework for analytics and graph processing.
- https://github.com/renepickhardt/metalcon/wiki/Technologytitan (our wiki page in metalcon where we will collect more information about titan.)
- There is also a screencast by Marko available on how to set up titan on an amazon cluster and querying music brainz rdf data: