Trident Storm, Real-Time Analytics for Big Data
In a previous post I mentioned Storm already. Trident is an extension of Storm that makes it an easy-to-use distributed real-time analytics framework for Big Data. Both Trident and Storm were developed by Twitter.
One of Twitter’s major problems is keeping statistics of Tweets and Tweeted URLs that get retweeted by millions of followers. Imagine a famous person who tweets a URL to millions of followers. Lots of followers will retweet the URL. So how do you calculate how many Tweeters have seen the URL? This is important for features like “Top retweeted URLs”.
The answer was Storm but with the addition of Trident, it has become a lot easier to manage. Trident is doing to Storm what Pig and Cascading are doing to Hadoop: simplification. Instead of having to create a lot of Spouts and Bolts and take care of how messages are distributed, Trident comes with a lot of the work already done.
In a few lines of code, you set-up a Distributed RPC server, send it URLs, have it collect the tweeters and followers, and count them. Fail-over and resiliance as well as massive distribution throughput are build into the platform. You can see it in this example code:
TridentState urlToTweeters = topology.newStaticState(getUrlToTweetersState()); TridentState tweetersToFollowers = topology.newStaticState(getTweeterToFollowersState()); topology.newDRPCStream("reach") .stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters")) .each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter")) .shuffle() .stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers")) .parallelismHint(200) .each(new Fields("followers"), new ExpandList(), new Fields("follower")) .groupBy(new Fields("follower")) .aggregate(new One(), new Fields("one")) .parallelismHint(20) .aggregate(new Count(), new Fields("reach"));
The possibilities of Trident + Storm, combined with fast scalable
datastores, like for instance Cassandra, are enormous. Everything from
real-time counters, filtering, complex event processing, machine
The Storm concept of Spout [data generation] and Bolt [data processing] can be easily understood by most programmers. Storm is an asynchronous highly distributed framework but with a simple distributed RPC server it can easily be used in synchronous code.
The only drawback I have seen is that DRPC is focused only on Strings (and other primitive types that can be contained in a String). Adding more complex objects (via Kryo, Avro, Protocol Buffers, etc.), or at least bytes, would be useful for companies that do not only focus on Tweets.