In our previous blog, Apache Storm: The Hadoop of Real-Time we have discussed introduction of apache storm. So, to explore more about apache storm, we will be going to talk about the basic architecture of Apache Storm.
Components of a Storm Cluster
An Apache Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run MapReduce jobs, on Storm, you run topologies. Jobs and topologies themselves are very different — one key difference being that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).
There are two kinds of nodes in a Storm cluster: master node and worker nodes.
1. Master Node (Nimbus)
The master node runs a daemon called Nimbus that is similar to Hadoop’s JobTracker. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
Nimbus is an Apache Thrift service enabling you to submit code in any programming language. This way, you can always utilize the language that you are proficient in without needing to learn a new language to utilize Apache Storm.
The Nimbus service relies on Apache ZooKeeper to monitor the message processing tasks as all the worker nodes update their tasks status in the Apache ZooKeeper service.
2. Worker Nodes (Supervisor)
Each worker node runs a daemon called the Supervisor. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.
All coordination between Nimbus and the Supervisors is done through a ZooKeeper cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless. Even though stateless nature has its own disadvantages, it actually helps Storm process real-time data in the best possible and quickest way.
Storm is not entirely stateless, though. It stores its state in Apache ZooKeeper. Since the state is available in Apache ZooKeeper, a failed Nimbus can be restarted and made to work from where it left. Service monitoring tools can monitor Nimbus and restart it if there is any failure.
Apache Storm also has an advanced topology called Trident Topology with state maintenance. and it also provides a high-level API like Pig.
To do real-time computation on Storm, you create what are called topologies. A topology is a graph of computation and is implemented as DAG (directed acyclic graph) data structure.
Each node in a topology contains processing logic (bolts) and links between nodes indicate how data should be passed around between nodes (streams).
When a topology is submitted to a Storm cluster, the Nimbus service on master node consults the supervisor services on different worker nodes and submits the topology. Each supervisor creates one or more worker processes, each having its own separate JVM. Each process runs within itself threads that we call Executors.
The thread/executor processes the actual computational tasks: Spout or Bolt.
Running a topology is straightforward. First, you package all your code and dependencies into a single JAR. Then, you run a command like the following:
storm jar all-my-code.jar org.apache.storm.MyTopology arg1 arg2
Streams represent the unbounded sequences of tuples (collection of key-value pairs) where a tuple is a unit of data.
A stream of tuples flows from spout to bolt(s) or from bolt(s) to another bolt(s). There are various stream grouping techniques to let you define how the data should flow in topology like global grouping, etc.
A spout is the entry point in a Storm topology. It represents the source of data in Storm. Generally, spouts will read tuples from an external source and emit them into the topology. You can write spouts to read data from data sources such as a database, distributed file systems, messaging frameworks, or a message queue as Kafka from where it gets continuous data, converts the actual data into a stream of tuples, and emits them to bolts for actual processing. Spouts run as tasks in worker processes by Executor threads.
Spouts can broadly be classified as follows:
- Reliable: These spouts have the ability to replay the tuples (a unit of data in the data stream). This helps applications achieve the "at least once message processing" semantic as, in case of failures, tuples can be replayed and processed again. Spouts for fetching data from messaging frameworks are generally reliable, as these frameworks provide a mechanism to replay the messages.
- Unreliable: These spouts don’t have the ability to replay the tuples. Once a tuple is emitted, it cannot be replayed, regardless of whether it was processed successfully. This type of spout follows the "at most once message processing" semantic.
All processing in topologies is done in bolts. Bolts can do anything from filtering and functions to aggregations, joins, talking to databases, and more.
Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).
Bolts can also emit more than one stream.
What Makes a Running Topology: Worker Processes, Executors, and Tasks
Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:
- Worker processes
- Executors (threads)
Here is a simple illustration of their relationships:
A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.
An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).
A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true:
#threads ≤ #tasks.
By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.
This pretty much sums up the architecture of Apache Storm. I hope it was helpful!
This article was first published on the Knoldus blog.