{{announcement.body}}
{{announcement.title}}

Apache Spark vs Apache Storm

DZone 's Guide to

Apache Spark vs Apache Storm

In this article, we discuss difference between Apache Storm and Apache Spark feature-by-feature.

· Big Data Zone ·
Free Resource

Introduction

Apache Storm and Apache Spark are two powerful and open source tools being used extensively in the Big Data ecosystem. Many people have doubts regarding the suitability and applicability of these tools. In this post, I would like to draw a comparison between these tools.

Apache Storm: Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use!

Apache Spark: Diverse platform, which can handle all the workloads like: batch, interactive, iterative, real-time, graph, etc.

Spark Streaming is the ecosystem component of Spark, which handles real-time stream, let’s compare it with Storm

Differences Between Apache Storm and Spark Streaming

These differences will help you know which is better to use between Apache Storm and Spark. Let’s have a look on each feature one by one-

1. Processing Model

  • Storm: It supports true stream processing model through core storm layer.
  • Spark Streaming: Apache Spark Streaming is a wrapper over Spark batch processing.

2. Primitives

  • Storm: It provides a very rich set of primitives to perform tuple level process at intervals of a stream (filters, functions). Aggregations over messages in a stream are possible through group by semantics. It supports left join, right join, inner join (default) across the stream.
  • Spark Streaming: It provides 2 wide varieties of operators. First is Stream transformation operators that transform one DStream into another DStream. Second is output operators that write information to external systems. The previous includes stateless operators (filter, map, mapPartitions, union, distinct, etc.) still as stateful window operators (countByWindow, reduceByWindow, etc.).

3. State Management

  • Storm: Core Storm by default doesn’t offer any framework level support to store any intermediate bolt output (the result of user operation) as a state. Hence, any application has to create/update its own state as and once required.
  • Spark Streaming: Spark by default treats the output of every RDD operation (Transformations and Actions) as an intermediate state. It stores it as RDD. Spark Streaming permits maintaining and changing state via the updateStateByKey API. A pluggable method couldn’t be found to implement state within the external system.

4. Message Delivery Guarantees (Handling message level failures)

  • Storm: It supports 3 message processing guarantees: at least once, at-most-once, and exactly once. Storm’s reliability mechanisms are distributed, scalable, and fault-tolerant.
  • Spark Streaming: Apache Spark Streaming defines its fault-tolerance semantics, the guarantees provided by the recipient and output operators. As per the Apache Spark architecture, incoming data is read and replicated in different Spark executor nodes. This generates failure scenarios where data is received but may not be reflected. It handles fault tolerance differently in the case of worker failure and driver failure.

5. Fault Tolerance (Handling process/node level failures)

  • Storm: Storm was created with fault-tolerance at its core. Storm daemons (Nimbus and Supervisor) are made to be fail-fast (that means that method self-destructs whenever any sudden scenario is encountered) and stateless (all state is unbroken in Zookeeper or on disk).
  • Spark Streaming: The Driver Node (an equivalent of JT) is SPOF. If the driver node fails, then all executors will be lost with their received and replicated in-memory information. Hence, Spark Streaming uses data checkpointing to get over from driver failure.

6. Debuggability and Monitoring

  • Storm: Apache Storm UI supports images of every topology with the entire break-up of internal spouts and bolts. UI additionally contributes information having any errors coming in tasks and fine-grained stats on the throughput and latency of every part of the running topology. It helps in debugging problems at a high level. Metric-based monitoring, Storm’s inbuilt metrics feature supports framework level for applications to emit any metrics, which can then be simply integrated with external metrics/monitoring systems.
  • Spark Streaming: Spark web UI displays an extra Streaming tab that shows statistics of running receivers (whether receivers are active, the variety of records received, receiver error, and so on.) and completed batches (batch process times, queuing delays, and so on). It is useful to observe the execution of the application. The following 2 info in Spark web UI are significantly necessary for standardization of batch size:
  1. Processing Time – The time to process every batch of data.
  2. Scheduling Delay – The time a batch stays in a queue for the process previous batches to complete.

7. Auto Scaling

  • Storm: It provides configuring initial parallelism at various levels per topology – variety of worker processes, executors, tasks. Additionally, it supports dynamic rebalancing that permits to increase or reduces the number of worker processes and executors w/o being needed to restart the cluster or the topology.

    But, many initial tasks designed stay constant throughout the life of topology. Once all supervisor nodes are fully saturated with worker processes, and there’s a need to scale out, one merely has to begin a replacement supervisor node and inform it to cluster-wide Zookeeper.

    It is possible to transform the logic of monitoring the present resource consumption on every node in every Storm cluster and dynamically adding a lot of resources. STORM-594 describes such auto-scaling mechanisms employing a feedback system.
  • Spark Streaming: The community is currently developing on dynamic scaling to streaming applications. Currently, elastic scaling of Spark streaming applications isn’t supported.
    Essentially, dynamic allocation isn’t meant to be used in Spark streaming at the instant (1.4 or earlier). the reason is that presently the receiving topology is static. the number of receivers is fixed. One receiver is allotted with every DStream instantiated and it’ll use one core within the cluster. Once the StreamingContext is started, this topology cannot be modified. Killing receivers leads to stopping the topology.

8. Yarn Integration

  • Storm: The Storm integration alongside YARN is recommended through Apache Slider. A slider is a YARN application that deploys non-YARN distributed applications over a YARN cluster. It interacts with YARN RM to spawn containers for a distributed application. It then manages the lifecycle of these containers. The slider provides out-of-the-box application packages for Storm.
  • Spark Streaming: Spark framework provides native integration along with YARN. Spark streaming, as a layer above Spark, merely leverages the integration. Every Spark streaming application gets reproduced as an individual Yarn application. The ApplicationMaster runs the Spark driver and initializes the SparkContext. Every executor and receiver runs in containers managed by the ApplicationMaster. The ApplicationMaster then periodically submits one job per micro-batch on the YARN containers.

9. Isolation

  • Storm: Each employee process runs executors for a particular topology. That’s mixing of various topology tasks isn’t allowed at worker process level which supports topology level runtime isolation. Further, every executor thread runs one or more tasks of an identical element (spout or bolt), that’s no admixture of tasks across elements.
  • Spark Streaming: Spark application is a different application run on YARN cluster, wherever every executor runs in a different YARN container. Thus, JVM level isolation is provided by Yarn since 2 totally different topologies can’t execute in same JVM. Besides, YARN provides resource-level isolation so that container-level resource constraints (CPU, memory limits) can be organized.

11. Open Source Apache Community

  • Storm: Apache Storm powered-by page provides a healthy list of corporations that are running Storm in production for many use-cases. Many of them are large-scale web applications that are pushing the boundaries for performance and scale. For instance, Yahoo reading consists of two, 300 nodes running Storm for near-real-time event process, with the largest topology spanning across four hundred nodes.
  • Spark Streaming: Apache Spark streaming remains rising and has restricted expertise in production clusters. But, the general umbrella Apache Spark community is well one in all the biggest and thus the most active open supply communities out there nowadays. The general charter is space evolving given the massive developer base. This could cause the maturity of Spark Streaming within the close to future.

12. Ease of development

  • Storm: It provides extremely easy, rich, and intuitive APIs that simply describe the DAG nature of process flow (topology). The Storm tuples, which give the abstraction of data flowing between nodes within the DAG, are dynamically written. The motivation there’s to change the APIs for simple use. Any new custom tuple can be plugged in once registering its Kryo serializer. Developers will begin with writing topologies and run them in native cluster mode.

    In local mode, threads are used to simulate worker nodes, permitting the developer to set breakpoints, halt the execution, examine variables, and profile before deploying it to a distributed cluster wherever all this is often way tougher.
  • Spark Streaming: It offers Scala and Java APIs that have a lot of a practical programming (transformation of data). As a result, the topology code is way a lot of elliptic. There’s an upscale set of API documentation and illustrative samples on the market for the developer.

13. Ease of Operability

  • Storm: It is a little tricky to deploy/install Storm through many tools (puppets, and then on ) and deploy the cluster. Apache Storm contains a dependency on a Zookeeper cluster. So that it can meet coordination over clusters, store state and statistics. It implements CLI support to install actions like submit, activate, deactivate, list, kill topology. a powerful fault tolerance suggests that any daemon period of time doesn’t impact executing topology.

    In standalone mode, Storm daemons run in supervised mode. InYARN cluster mode, Storm daemons emerged as containers and driven by Application Master (Slider).
  • Spark Streaming: It uses Spark as the fundamental execution framework. It should be easy to feed up a Spark cluster on YARN. There are many deployment requirements. Usually, we enable checkpointing for fault tolerance of the application driver. This could bring a dependency on fault-tolerant storage (HDFS).
Topics:
apache spark, apache storm, big data, streaming

Published at DZone with permission of Aditya Bhuyan . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}