DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Next-Gen Data Pipes With Spark, Kafka and k8s
  • Superior Stream Processing: Apache Flink's Impact on Data Lakehouse Architecture
  • What Is a Streaming Database?
  • The Kappa Architecture: A Cutting-Edge Approach for Data Engineering

Trending

  • Top Book Picks for Site Reliability Engineers
  • Unlocking the Benefits of a Private API in AWS API Gateway
  • Google Cloud Document AI Basics
  • Unlocking AI Coding Assistants Part 3: Generating Diagrams, Open API Specs, And Test Data
  1. DZone
  2. Software Design and Architecture
  3. Microservices
  4. Apache Storm: Architecture

Apache Storm: Architecture

Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use! Let's dive into its architecture.

By 
Ayush Tiwari user avatar
Ayush Tiwari
·
Updated Nov. 17, 17 · Tutorial
Likes (23)
Comment
Save
Tweet
Share
34.2K Views

Join the DZone community and get the full member experience.

Join For Free

In our previous blog, Apache Storm: The Hadoop of Real-Time we have discussed introduction of apache storm. So, to explore more about apache storm, we will be going to talk about the basic architecture of Apache Storm.


Components of a Storm Cluster

An Apache Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run MapReduce jobs, on Storm, you run topologies. Jobs and topologies themselves are very different — one key difference being that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).

There are two kinds of nodes in a Storm cluster: master node and worker nodes.

1. Master Node (Nimbus)

The master node runs a daemon called Nimbus that is similar to Hadoop’s JobTracker. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.

Nimbus is an Apache Thrift service enabling you to submit code in any programming language. This way, you can always utilize the language that you are proficient in without needing to learn a new language to utilize Apache Storm.

The Nimbus service relies on Apache ZooKeeper to monitor the message processing tasks as all the worker nodes update their tasks status in the Apache ZooKeeper service.

2. Worker Nodes (Supervisor)

Each worker node runs a daemon called the Supervisor. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.

All coordination between Nimbus and the Supervisors is done through a ZooKeeper cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless. Even though stateless nature has its own disadvantages, it actually helps Storm process real-time data in the best possible and quickest way.

Storm is not entirely stateless, though. It stores its state in Apache ZooKeeper. Since the state is available in Apache ZooKeeper, a failed Nimbus can be restarted and made to work from where it left. Service monitoring tools can monitor Nimbus and restart it if there is any failure.

Apache Storm also has an advanced topology called Trident Topology with state maintenance. and it also provides a high-level API like Pig.

Image result for bolt in storm

Topologies

To do real-time computation on Storm, you create what are called topologies. A topology is a graph of computation and is implemented as DAG (directed acyclic graph) data structure.

Each node in a topology contains processing logic (bolts) and links between nodes indicate how data should be passed around between nodes (streams).

A Storm topology

When a topology is submitted to a Storm cluster, the Nimbus service on master node consults the supervisor services on different worker nodes and submits the topology. Each supervisor creates one or more worker processes, each having its own separate JVM. Each process runs within itself threads that we call Executors.

The thread/executor processes the actual computational tasks: Spout or Bolt.

Running a topology is straightforward. First, you package all your code and dependencies into a single JAR. Then, you run a command like the following: 

storm jar all-my-code.jar org.apache.storm.MyTopology arg1 arg2

Streams represent the unbounded sequences of tuples (collection of key-value pairs) where a tuple is a unit of data.

A stream of tuples flows from spout to bolt(s) or from bolt(s) to another bolt(s). There are various stream grouping techniques to let you define how the data should flow in topology like global grouping, etc.

Spouts

A spout is the entry point in a Storm topology. It represents the source of data in Storm. Generally, spouts will read tuples from an external source and emit them into the topology. You can write spouts to read data from data sources such as a database, distributed file systems, messaging frameworks, or a message queue as Kafka from where it gets continuous data, converts the actual data into a stream of tuples, and emits them to bolts for actual processing. Spouts run as tasks in worker processes by Executor threads.

Spouts can broadly be classified as follows:

  • Reliable: These spouts have the ability to replay the tuples (a unit of data in the data stream). This helps applications achieve the "at least once message processing" semantic as, in case of failures, tuples can be replayed and processed again. Spouts for fetching data from messaging frameworks are generally reliable, as these frameworks provide a mechanism to replay the messages.
  • Unreliable: These spouts don’t have the ability to replay the tuples. Once a tuple is emitted, it cannot be replayed, regardless of whether it was processed successfully. This type of spout follows the "at most once message processing" semantic.

Bolts

All processing in topologies is done in bolts. Bolts can do anything from filtering and functions to aggregations, joins, talking to databases, and more.

Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

Image result for bolt in storm

Bolts can also emit more than one stream.

What Makes a Running Topology: Worker Processes, Executors, and Tasks

Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

  1. Worker processes
  2. Executors (threads)
  3. Tasks

Here is a simple illustration of their relationships:

The relationships of worker processes, executors (threads) and tasks in Storm

A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.

An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).

A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads ≤ #tasks.

By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.

This pretty much sums up the architecture of Apache Storm. I hope it was helpful!

  • References: http://storm.apache.org/releases/1.1.1/index.html

This article was first published on the Knoldus blog.

Apache Storm Data processing Database cluster Architecture Task (computing) Bolt (CMS) Stream (computing) Nimbus (cloud computing) Tuple

Published at DZone with permission of Ayush Tiwari, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Next-Gen Data Pipes With Spark, Kafka and k8s
  • Superior Stream Processing: Apache Flink's Impact on Data Lakehouse Architecture
  • What Is a Streaming Database?
  • The Kappa Architecture: A Cutting-Edge Approach for Data Engineering

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!