Titan Provides Real-Time Big Graph Data
Join the DZone community and get the full member experience.
Join For Freethe content of this article was originally co-authored by marko rodriguez, dan larocque, and matthias broecheler over at the aurelius blog .
titan is an apache 2 licensed, distributed graph database capable of supporting tens of thousands of concurrent users reading and writing to a single massive-scale graph . in order to substantiate the aforementioned statement, this post presents empirical results of titan backing a simulated social networking site undergoing transactional loads estimated at 50,000–100,000 concurrent users. these users are interacting with 40 m1.small amazon ec2 servers which are transacting with a 6 machine amazon ec2 cc1.4xl titan/ cassandra cluster.
the presentation to follow discusses the simulation’s social graph structure, the types of processes executed on that structure, and the various runtime analyses of those processes under normal and peak load. the presentation concludes with a discussion of the amazon ec2 cluster architecture used and the associated costs of running that architecture in a production environment. in short summary, titan performs well under substantial load with a relatively inexpensive cluster and as such, is capable of backing online services requiring real-time big graph data.
the social graph’s structure and processes
an online social networking service like twitter typically supports the 5 operations enumerated below.
- create an account : create a new user with provided handle.
- publish a tweet : disseminate a <140 character message.
- read stream : get a time ordered list of 10 tweets from the followed users.
- follow a user : subscribe to the tweets of another user.
- get a recommendation : receive a solicitation of potentially interesting users to follow.
these operations lead to the emergence of a
property graph
structure
epimorphic
to the schema diagrammed on the right. in this schema, there are user vertices and tweet vertices. when a user tweets, a
tweets
edge connects the user to their tweet. moreover, all of the followers
of that user (i.e. subscribers) have a timestamped outgoing
stream
edge attaching their vertex to the tweet. for each user vertex, the
stream
edges are sorted by time as, in this system, time is a declared primary key. titan supports
vertex-centric indices
which ensure o(log(n)) lookups of adjacent vertices based on the incident edge labels and properties, where n
is the number of edges emanating from the vertex. for the sake of
simulation, the artificially generated tweets are randomly selected
snippets from homer’s
the odyssey
(as provided by
project gutenberg
), where the length is sampled from a
gaussian distribution
with a mean of 70 characters.
to provide a foundational layer of data, the twitter graph as of 2009
was first loaded into the titan cluster. this data includes 41.7 million
user vertices and 1.47 billion follows edges. after loading, the 40
m1.small machines are put into a “while(true) loop” (in
fact, there are 10 concurrent threads on each worker running 125,000
iterations). during each iteration of the loop, a worker selects an
operation to enact using a biased coin toss (see the diagram on the
left). the distribution heavily favors stream reading as this is
typically the most prevalent operation in such online social systems.
next, if a
recommendation
is provided, then there is a 30% chance that the user will follow one of the recommended users. this is how
follows
edges are added to the graph.
a follows recommendation (e.g. “
who to follow
“) makes use of the existing
follows
edges to determine, for a particular user, other users that they might find interesting. typically, some variant of a
triangle closure
is computed in such situations. in plain english, if the users that user a follows tend to follow user b, then it is most likely that user b is a good user for user a to follow. to capture this notion as a real-time graph algorithm, the
gremlin
graph traversal language is used.
follows = g.v('name',name).out('follows').tolist() follows20 = follows[(0..19).collect{random.nextint(follows.size)}] m = [:] follows20.each { it.oute('follows')[0..29].inv.except(follows).groupcount(m).iterate() } m.sort{a,b -> b.value <=> a.value}[0..4]
- retrieve all the users that the user follows, where name is the user’s unique twitter handle.
- randomly select 20 of those followed users (provides variation on each invocation — non-deterministic ).
- create an empty associative array/map that will be populated with recommendation rankings.
- for each of the 20 random followed users, get their 30 most recently followed users that are not already followed, and score them in the map.
- reverse sort the map and return the top 5 users as recommendations.
note that vertex-centric indices come into play again in line 4 where follows edges (like stream edges) have a primary key of time and are thus, chronologically ordered. the 30 most recently followed users is a single o(log(n)) lookup, where again, n is the number of edges emanating from the vertex.
titan serving 50,000–100,000 concurrent users
titan is a oltp graph database. it is designed to handle numerous short, concurrent transactions like the ones discussed previously. in this section, titan’s performance under normal (5,900 transactions per second) and peak (10,200 transactions per second) load are presented. we consider what follows to be a reasonable benchmark — no specialized hardware is required (standard ec2 machines), no complex configurations/tunings of either cassandra or titan, and all worker code is via the standard blueprints api.
normal load
the normal load simulation ran for
2.3 hours
and during that time,
49 million transactions
occurred. this comes to approximately
5,900 transactions a second
.
assuming that a human user does a transaction every 5-10 seconds (e.g.
reads their stream and then publishes a tweet, etc.), this titan cluster
is supporting approximately
50,000 concurrent users
. in the table below, the number of transactions per operation, the average transaction times, the
standard deviation
of those times, and the
3 sigma
times are presented. 3 sigma is 3 standard deviations greater than the
mean and represents the expected worst case time that 0.1% of the users
will experience. finally, note that creating an account is a slower
transaction because it is a locking operation that ensures that no two
users have the same username (i.e. twitter handle).
action | number of tx | mean tx time | std of tx time | 3 sigma tx time |
---|---|---|---|---|
create an account | 379,019 | 115.15 ms | 5.88 ms | 132.79 ms |
publish a tweet | 7,580,995 | 18.45 ms | 6.34 ms | 37.48 ms |
read stream | 37,936,184 | 6.29 ms | 1.62 ms | 11.15 ms |
get recommendation | 3,793,863 | 67.65 ms | 13.89 ms | 109.33 ms |
total | 49,690,061 |
after 2.3 hours of the aforementioned transactions, the following types of vertices and edges were added to the pre-existing 2009 twitter graph. on the right are the statistics given this behavior extrapolated for a day.
2.3 hours | 1 day |
---|---|
|
|
peak load
to determine how titan would perform in a peak load environment, the 40 worker machines together executed
10,200 transactions a second
in
1.3 hours
(
49 million total transactions
). this simulates approximately
100,000 concurrent users
.
transaction numbers and timing statistics are provided in the table
below. note that higher latencies are expected given the higher load and
that even though the transaction times are longer than those under
normal load, the times are still acceptable for a real-time online
service.
action | number of tx | mean tx time | std of tx time | 3 sigma tx time |
---|---|---|---|---|
create an account | 374,860 | 172.74 ms | 10.52 ms | 204.29 ms |
publish a tweet | 7,517,667 | 70.07 ms | 19.43 ms | 128.35 ms |
read stream | 37,618,648 | 24.40 ms | 3.18 ms | 33.93 ms |
get recommendation | 3,758,266 | 229.83 ms | 29.08 ms | 317.06 ms |
total | 49,269,441 |
amazon ec2 machinery and costs
the simulation presented was executed on
amazon ec2
. the software infrastructure to run this simulation made use of
cloudformation
. in terms of the hardware infrastructure, this section discusses the
instance types
, their physical statistics during the experiment, and the cost of running this architecture in a production environment.
the 40 workers were m1.small amazon ec2 instances (1.7 gb of memory with 1 virtual core). the titan/cassandra cluster was composed of 6 machines each with the following specification.
- 23 gb of memory
- 33.5 ec2 compute units (2 x intel xeon x5570, quad-core “ nehalem ” architecture)
- 1,690 gb of storage
- 64-bit platform
- 10 gigabit ethernet
- ec2 api name: cc1.4xlarge
under the normal load simulation, the 6 machine titan cluster experienced the following cpu utilization, disk reads (in bytes), and disk writes (in bytes) — each colored line represents 1 of the 6 cc1.4xlarge machines. note that the disk read chart is a 1 hour snapshot during the middle of the experiment and therefore, the caches are warm. in summary, titan is able to consistently, and without exertion, maintain the normal transactional load.
the cost of running all these machines is provided in the table below. note that in a production environment (non-simulation), the 40 workers can be interpreted as web servers taking user requests and processing results returned from the titan cluster.
instance | cost per hour | cost per day | cost per year |
---|---|---|---|
6 cc1.4xl | $7.80 | $187.20 | $68,328 |
40 m1.small | $3.20 | $76.80 | $28,032 |
total | $11.00 | $264.00 | $96,360 |
for serving 50,000–100,000 concurrent users, $96,360 a year is inexpensive considering incoming revenue seen from a user base of that size (assume 5% of the user base is concurrent: ~2 million registered users). moreover, titan can be deployed over an arbitrary number of machines and dynamically scale to meet load requirements (see the benefits of titan ). therefore, this 6 cc1.4xl architecture is not a necessity, but a particular configuration that was explored for the purpose of the presented social simulation. for environments with less load, a smaller cluster can and should be used.
conclusion
titan has been in research and development for the past 4 years. in spring 2012, titan was made freely available by aurelius under the liberal apache 2 license. it is currently distributed as a 0.1-alpha with a 0.1 release planned by the end of summer 2012.
note that titan is but one piece of the larger graph puzzle. titan serves the
oltp
aspect of graph processing. by the middle of fall 2012, aurelius will release a collection of
olap
graph technologies to support global graph processing and analytics.
all of the aurelius technologies will integrate with one another as well
as with the suite of open source,
bsd licensed
graph technologies provided by
tinkerpop
. by standing on the shoulders of giants (e.g. cassandra, tinkerpop, amazon ec2), great leaps and bounds in applied
graph theory
and
network science
are possible.
references
kwak, h., lee, c., park, h., moon, s., “ what is twitter, a social network or a news media? ,” world wide web conference, 2010.
rodriguez, m.a., broecheler, m., “ titan: the rise of big graph data ,” public lecture at jive software, palo alto, 2012.
broecheler, m., larocque, d., rodriguez, m.a., “ titan: a highly scalable, distributed graph database ,” graphlab workshop 2012, san francisco, 2012.
Opinions expressed by DZone contributors are their own.
Comments