Apache Storm is an open-source and distributed stream processing computation framework written predominantly in the Clojure programming language. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm allows developers to build powerful applications that are highly responsive and can find trends between topics on Twitter, monitor spikes in payment failures, and so on.
Storm is simple. The best part about Apache Storm is a number of things that can be done with it. It is compatible with multiple languages, is extremely fast for processing through large datasets, is scalable, fault-tolerant, and packed with more amazing features.
History of Apache Storm
Storm was originally created by Nathan Marz and the team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache Storm became the standard for distributed real-time processing systems in that it allows you to process a large amount of data, similar to Hadoop. Apache Storm is written in Java and Clojure. It is continuing to be a leader in real-time analytics.
Nathan announced that he would be open-sourcing Storm to GitHub on September 19, 2011, during his talk at Strange Loop, and it quickly became the most-watched JVM project on GitHub. Production deployments soon followed, and the Storm development community rapidly expanded.
At the time Storm was introduced, big data analytics largely involved batch processing in MapReduce on Apache Hadoop or one of the higher level abstractions like Apache Pig and Cascading.
Before Storm was written, the usual way of processing data in real time was using queues and worker thread approaches. For example, some threads will be continuously writing data to some queues like rabbitMq and some worker threads will be continuously reading data from these queues and processing them. The output might be written again to some other queues and chained as input to some other worker threads to process further.
Such design is possible but obviously very fragile. Much of the time would be spent in maintaining the entire framework, serializing/deserializing messages, dealing with data loss, and resolving many other issues rather than doing the actual processing work.
Nathan Marz came up with the nice idea of creating the abstraction to all these in an efficient way in a program where we just have to create SPOUT and BOLT to do necessary processing and submit the job as TOPOLOGY and the framework will take care of everything else.
Some of the really beautiful abstractions he came up with are:
- Streams: The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream's tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples. Every stream is given an ID when declared. Since single-stream spouts and bolts are so common,
OutputFieldsDeclarerhas convenience methods for declaring a single stream without specifying an ID. In this case, the stream is given the default ID of
- Tracking algorithms: It has a very efficient algorithm which guaranteed that every message will be processed. It ensured, no matter how much a message is going to process downstream, fixed amount of space (about 20 bytes) would be needed to keep track of the state of every message tuple.
Benefits of Using Apache Storm
Storm advantages include:
- Real-time stream processing.
- Scalability, where throughput rates of even one million 100 byte messages per second per node can be achieved.
- Low latency. Storm performs data refresh and end-to-end delivery response in seconds or minutes depends upon the problem.
- Reliable. Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
- Easy to operate. Standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.
- Fault-tolerant: The ability to be fault-tolerant is extremely important for Storm, as it processes massive data all the time and should not be interrupted by a minimal failure, such as hardware fail in nodes of the storm cluster. Storm can redeploy tasks when it is necessary.
- Data guarantee: No data loss is one of the essential requirements for a data processing system. The risk of losing data would not be accepted in the use of most fields, especially when it comes to getting accurate results. Storm makes sure that all the data would be processed as they are designed during their processing in the topology.
- Ease of use in deploying and operating the system.
- Support for multiple programming languages.
- Fraud can be detected the moment it happens and proper measures can be taken to limit the damage.
Similarities Among Hadoop and Storm
- All three are open-source processing frameworks.
- All these frameworks can be used for business intelligence and big data analytics.
- Each of these frameworks provides fault tolerance and scalability.
- Both are distributed.
- These frameworks are preferred choices for big data developers due to their simple installation methods.
- Hadoop and Storm have implementations in JVM-based programming languages (Java and Clojure, respectively).
Apache Storm vs. Hadoop
Hadoop and Storm frameworks are both fundamentally used for analyzing big data. They both complement each other and differ in some aspects. Apache Storm does all the operations except persistency, while Hadoop is good at everything but lags in real-time computation. The following table compares the attributes of Storm and Hadoop.
Use Cases of Apache Storm
Storm powers Twitter's publisher analytics product, processing every tweet and click that happens on Twitter to provide analytics for Twitter's publisher partners. Storm integrates with the rest of Twitter's infrastructure, including Cassandra, the Kestrel infrastructure, and Mesos. Many other projects are underway using Storm, including projects in the areas of revenue optimization, anti-spam, and content discovery.
Wego is a travel metasearch engine located in Singapore. Travel related data comes from many sources all over the world with different timing. Storm helps Wego to search real-time data, resolves concurrency issues and find the best match for the end-user.
Yahoo! is developing a next generation platform that enables the convergence of big-data and low-latency processing. While Hadoop is our primary technology for batch processing, Storm empowers stream/micro-batch processing of user events, content feeds, and application logs.
Navsite is using Apache Storm as part of their server event log monitoring & auditing system. The log messages from thousands of servers are sent to RabbitMQ cluster and Storm is used to compare each message with a set of regular expressions. If there is a match, then the message is sent to a bolt that stores data in MongoDB. At the moment, 5-10k messages per second are being handled, however the existing RabbitMQ + Storm clusters have been tested up to about 50k per second.
I highly recommend reading this excellent post by Nathan Marz, in which he beautifully explains his experience, how he came up with the idea of Storm, what issues he faced, and how he took things forward.