Every day in the Big Data world, new frameworks are being introduced to solve complex problems — though Hadoop was the one who opened up a gate to look into the huge volume of data for data analytics. Knowing about the emerging capabilities of Storm and Hadoop help the user choose the right technology for various business needs.
According to the Hadoop website, "the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures."
Apache Storm runs continuously, consuming data from the configured sources (spouts) and passes the data down the processing pipeline (bolts). Spouts and bolts make a topology, which can be written in any language. Storm can integrate with any queuing and any database system (i.e., RDBMS, NOSQL).
Storm does not natively run on top of typical Hadoop clusters. Instead, it uses Apache ZooKeeper and its own master/minion worker processes to coordinate topologies, the master and worker states, and the message guarantee semantics.
Quoting from the project site:
Storm has many use cases: realtime analytics, online Machine Learning, continuous computation, distributed RPC, ETL, and more. Storm is fast — a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
In this table, I compare the trade-offs involved when choosing between Storm and Hadoop for data processing.
|Distributed real-time processing of large volumes of high-velocity data.||Distributed batch processing of large volumes of high-velocity data.|
|Data is mostly dynamic and continuously streamed.||Data is mostly static and stored in persistent storage.|
|Relatively slow.||Relatively fast.|
|Architecture consists of sprouts and bolts.||Architecture consists of HDFS and MapReduce.|
|Scalable and fault-tolerant.||Scalable and fault-tolerant.|
|Implemented in Clojure.||Implemented in Java.|
|Simple and can be used with any programming language.||Complex and can be used with any programming language.|
|Easy to set up and operate.||Easy to set up but difficult to operate.|
|Used in business intelligence and Big Data analytics.||Used in business intelligence and Big Data analytics.|
|Open source (Apache license).||Open source (Apache license).|