Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Storm vs WSO2 Stream Processor, Part 1

DZone's Guide to

Apache Storm vs WSO2 Stream Processor, Part 1

In this article, a developer gives a a side-by-side comparison of two open source stream processors. Read on to learn more!

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

Abstract

Data stream processors have gained significant attention from the industry in recent years due to the increasing trend of data stream processing applications. This article compares the features of Apache Storm with WSO2 Stream Processor which are two open source data stream processors. The comparison includes more than 20 different important features of modern stream processors. The article identifies the strong points of WSO2 Stream Processor and Apache Storm in terms of their different application scenarios.

1. Introduction

Data stream processing has become the main paradigm of data analytics in recent years due to the growing requirement of stream processing applications. Multiple applications of stream processing can be found in areas such as telecommunications, financial trading [7], cybersecurity [6], advertising, retail, manufacturing, energy, transportation [4],[5], etc [3].

In a typical data stream processing scenario (see Figure 1) streams of data flowing from external sensors are collected at a message queue. These messages are then fetched by the data stream processor and the data gets processed by a set of continuous queries in real-time. The operations they do are mostly in-memory processing such as filtering, aggregation, projection, etc. which could leverage an ephemeral store such as Random Access memory (RAM). Occasionally the stream processing applications may have to deal with data stored in secondary storage or databases (i.e., permanent store).

Image title

Figure 1: Reference architecture of a typical data stream processor

Recently, several notable open source stream processors have appeared targeting different application scenarios. Apache Storm has been one of the leaders in distributed event stream processors and it created a significant impact on the industry due to its heavy use by well-known companies such as Yahoo!, Twitter, Weather Channel, Spotify, and Rocket Fuel [1][2]. WSO2 Stream Processor is an open source, scalable, and feature-rich stream processing platform from WSO2. It has been used by hundreds of enterprises worldwide, including many famous organizations such as Uber, Transport For London (TFL), Experian, United Airlines, etc. 

2. Overview

Understanding the basics on which the stream processing system has been developed is essential to clearly identify their capabilities. In this section, we provide a brief description of the system architectures of Apache Storm 1.1.x and WSO2 Stream processor.

2.1. Apache Storm

Storm is a distributed computation system which supports real-time processing of large-scale streaming data. It was initially developed at Twitter and was converted as an Apache incubator project later on. The architecture of Storm is shown in Figure 2 which consists of two types of nodes called master nodes and worker nodes. The master node runs a daemon called Nimbus which is responsible for task assignment, code distribution, and fault monitoring. Stream processing applications in Storm are represented as graphs of computation (see Figure 3). A Storm application is called Topology. It consists of at least one input component called Spout which receives data from external streaming data sources. Furthermore, a topology may include at least one processing component called Bolt. The Topology is created by grouping multiple bolts into a data flow graph as shown in Figure 3.

Image title

Figure 2: An overview of Apache Storm architecture

Image title

Figure 3: Anatomy of a stream processing application developed with Storm

2.2. WSO2 Stream Processor

WSO2 Stream Processor (WSO2 SP) is an open source, scalable, and feature-rich stream processing platform from WSO2. It can ingest data from Kafka, HTTP requests, message brokers and you can query data stream using a “Streaming SQL” language. With just two commodity servers it can provide high availability and can handle 100K+ TPS throughput. It can scale up to millions of TPS on top of Kafka. The architecture of WSO2  Stream Processor is shown in Figure 4.

Image title

Figure 4: Architecture of WSO2 Stream Processor

How a typical stream processing application has been developed for WSO2 Stream Processor is shown in Figure 5. It receives input from external parties via event receivers. The received events are added to an event stream and gets processed.  These data streams are processed by the Siddhi engine which generates useful results. The results get published to the external parties via an Event Publisher.

Image title

Figure 5: Anatomy of a stream processing application developed with WSO2 Stream Processor

3. Selection of the Dimensions to Compare

The selection of the dimensions for comparison was made based on a few well known previous stream processor comparisons such as references [9],[10], and [14]. The stream processing system must provide support for conducting analytics operations and pattern detection within time windows on real-time data from multiple sources [27]. Hence we have used types of operators supported by the stream processing system. The way stream processing computation is specified has a significant impact on its usage. For example, the users need their applications to utilize third-party libraries. Extensibility of a stream processing platform is a very important feature because then only the stream processing system gets the ability to reuse existing code as well as accommodate the implementation of complex queries which could not be implemented with the existing programming model. System operation characteristics such as the ability to do distributed processing, scalability, and fault tolerance are also important characteristics for a stream processor. Furthermore, out-of-order event handling, resource-aware scheduling, etc. are important factors for operating IoT data processing and cloud data processing applications.

4. Comparison

When comparing two stream processors, multiple aspects need to be taken into consideration. These aspects can be related to either Quality of Service Features, System Characteristics, or Stream Processing Use Case support. Table I provides a summary comparison between the two stream processors.

Table I: Comparison between Storm 1.1.x and Stream Processor

Feature Apache Storm WSO2 Stream Processor

Query Language

Storm SQL language. Storm SQL compiles the SQL queries to Trident topologies and executes them in Storm clusters. Storm SQL is still an experimental language which is yet to go to production [12].

Siddhi query language is a complete streaming query language which supports defining basic as well as complex event processing queries. Siddhi query language has been used in real-world product deployment (e.g., Uber [11]).

Types of operators in query language

Insert

Update

Merge

Delete

Select

Join

Window

Select

Filter

Window

Aggregations

Group By

Having

Join

Pattern

Window

Both count and time-based windows are supported. These two types of windows can be implemented either as tumbling windows or as sliding windows. The time can be both event time as well as processing time. The interface IWindowedBolt is implemented by bolts that need windowing support. Typically Bolts which need to implement windowing capabilities could support that by extending from BaseWindowedBolt class. In Storm, time lag parameters can also be specified which indicate the maximum time limit for tuples with out-of-order timestamps.

Both count and time-based windows are supported. These two types of windows can be implemented either as tumbling windows or as sliding windows. The time can be both event time as well as processing time.

Message guarantee (i.e., Processing Semantics)

Storm, out-of-the-box, supports only two message processing guarantees [16].

  • At-most-once.
  • At-least-once (default).

Although exactly-once is a desirable feature it comes with added complexity. If the data we deal with is idempotent, having at-least-once delivery guarantee is sufficient.

Note that Storm provides support for implementing at-least-once but we need to do more work to get at-least-once semantics to really work in Storm.

Stream Processor supports both at-least-once and exactly-once. Exactly once is provided using Kafka [20].

Programming languages support (i.e., API languages)

Java

Clojure

Scala

Python

Ruby

Java

Python


Ability to Conduct Batch Processing

Storm departs considerably from the batch processing scenario.

Stream Processor provides support for specifying batch operations using Siddhi’s incremental processing feature for many of the use cases.

State Management

The operators’ state can be saved and retrieved using the Storm core’s built-in abstractions. Storm has both default in-memory state implementation as well as Redis backed implementation [24].

Stream Processor has a built-in checkpointing mechanism where, once enabled, it checkpoints the events it passes to Siddhi Apps to Apache Kafka.

Code Level Debugging

No explicit support has been provided for debugging. General practice is to use logs.

Siddhi Editor of Stream Processor supports writing Siddhi apps as well as debugging them.

Support for edge analytics

None

Supports edge analytics use cases. The light storage footprint (< 2MB) and very low resource requirements allow Siddhi to be embeddable in Android and RaspberryPi. Streaming SQL programs can be deployed as self-contained Siddhi applications via a single file.

Predictive analytics on streaming data (i.e., Streaming Machine Learning) [26]

None

Provides streaming machine learning extension. Supports key streaming machine learning techniques such as classification, clustering, and regression.

Support for developing information dashboards

None

Enables users to get a holistic view of their data via dashboards. Users can create custom gadgets and visualize the outcome of their processing.

Business user friendliness

None

Stream Processor provides a graphical user interface where users can design a complete end-to-end application using gadgets such as dropdowns, etc.

Extensibility using custom code

Storm is designed to be usable with any programming language. Since Storm uses Thrift definitions for defining and submitting topologies, Thrift can be used in any language, topologies can be created and submitted using any language [15].

Siddhi can be extended via creating Siddhi extensions. Siddhi extensions written in Java can be linked with other programming languages via Java’s language interoperability support.

Fault Tolerance

Storm has an application-level fault tolerance mechanism which completely addresses the scenarios of component failures (i.e., both worker and Nimbus) [21]. Nimbus and Supervisor daemons are designed to be stateless and fail-fast. Nimbus and Supervisor daemons must be run under supervision via a tool like daemontools or monit, which restarts these daemons if they die [20].

Stream Processor fault tolerance has been implemented via periodic snapshotting and replaying via the use of Kafka. Stream Processor has a mode where the applications which do not require exactly-once processing can run without the use of Kafka [20].

Performance Monitoring

Storm UI provides performance details of internal spouts and bolts, as well as the UI, and provides detailed statistics on the throughput and latency of every part of the running topology. Storm exposes a metrics interface to report summary statistics across the Storm topologies. These metrics can be accessed via Storm’s REST API.

Siddhi has probes built into it to report statistics. Once statistics collection enabled probes are enabled, they can send out performance numbers. Furthermore, Stream Processor consists of an information dashboard which indicates the lower-level system performance.

Supported input sources/transports

Multiple input sources are supported by Apache Storm. The Apache Storm 1.1.0 API consists of about 45 unique different Spouts which could handle reading data from CSV files and HDFS [25]. Furthermore, it has support for reading data from Kafka, Hive, etc. Moreover, multiple spouts are available for generating synthetic streaming data which can be used to test the behavior of Storm topologies. For others, connectors need to be created using custom code.


Siddhi provides IO mappers for Kafka, TCP, JMS, WSO2 Event, HTTP, files [18]. Each mapper can handle different data formats such as binary, JSON, Text, XML, Key-value. Since the Siddhi language itself provides access to these mapper objects it provides a significant advantage. Stream Processor consists of an event simulator which can be used to produce synthetic data required for offline testing.

Late arrival and out of order event handling

In most applications, the out-of-order handling needs to be implemented manually. However, certain operators such as Windows has support for dropping out-of-order events which arrive after a specified time period.

The reorder extension of Siddhi support handling out-of-order [13]. This extension has to be used in front of the operators which are order sensitive.

Data Model

Based on Tuples. A tuple is a named list of values, and a field in a tuple can be an object of any type.

Uses an event as its data model. An event is an object with a list of named fields.

Programming Model

Based on three entities, Spout (which receives input), Stream (represents the data flow), Bolt (processing entity)

Based on Siddhi

Transport

Netty

Stream Processor has several transports, such as TCP, Kafka, wso2-event, etc., which are used to communicate between Stream processor components.

Secure event processing

None. Has to be manually implemented.

Transports such as WSO2-Event has support for encrypting data with SSL.

Resource Management

YARN and Mesos

YARN

Back Pressure

Yes Yes

Notion of time

Event time (the time at which the event was generated), stream time (injection time/time at which the event was received by the system). But does not have support for processing time.

Event time and process time. The time window in Siddhi operates on processing time.

Stream grouping and partitioning support

A stream grouping defines how the stream should be partitioned among the bolt's tasks. Storm consists of eight different stream grouping approaches such as Shuffle grouping, Fields grouping, Partial Key grouping, All grouping, Global grouping, None grouping, Direct grouping, and Local or shuffle grouping.

WSO2 Stream Processor provides stream partitioning capabilities via Siddhi’s partitioning capabilities. There are variable partitions and range partitions [22].

Capable of handling the straggler nodes?

No. The default behavior of Storm is that if a worker dies, the supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reassign the worker to another machine. But this does not address the scenario of the worker still heartbeat to Nimbus but the worker is also acting as a straggler.

No. But Stream Processor has the functionality to support detecting straggler nodes via its built-in performance monitoring infrastructure.

Conduct resource aware scheduling?

Yes. Storm comes with a resource-aware scheduler [23].

Yes

That's all for Part 1. Tune back in tomorrow when we'll cover to which use cases these technologies best apply. 

References

[1]    Apache Software Foundation (2015), Apache Storm, http://storm.apache.org/

[2]   Forrester (2014), The Forrester Wave™: Big Data Streaming Analytics Platforms, Q3
        2014
, https://www.forrester.com/report
        /The+Forrester+Wave+Big+Data+Streaming+Analytics+Platforms+Q3+2014/-/E-
        RES113442


[3]   WSO2 (2018), Analytics Solutions, https://wso2.com/analytics/solutions/

[4]   De Silva, R. and Dayarathna, M.(2017), Processing Streaming Human Trajectories with
        WSO2 CEP
, https://www.infoq.com/articles/smoothing-human-trajectory-streams

[5]   WSO2 (2017), Video Analytics: Technologies and Use Cases,
        http://wso2.com/whitepapers/innovating-with-video-analytics-technologies-and-use-cases

[6]   WSO2 (2015), Fraud Detection and Prevention: A Data Analytics
        Approach
http://wso2.com/whitepapers/fraud-detection-and-prevention-a-data-analytics-
        approach


[7]   WSO2 (2018), WSO2 Helps Safeguard Stock Exchange via Real-Time Data Analysis and         Fraud Detection, http://wso2.com/casestudies/wso2-helps-safeguard-stock-exchange-via-
        real-time-data-analysis-and-fraud-detection


[8]   Apache Software Foundation (2015), Trident Tutorial, http://storm.apache.org/releases
        /1.1.2/Trident-tutorial.html


[9]  Luckham, D.(2016), Proliferation of Open Source Technology for Event Processing,
        http://www.complexevents.com/2016/06/15/proliferation-of-open-source-technology-for-
        event-processing/


[10] Zapletal, P.(2016), Comparison of Apache Stream Processing Frameworks: Part 1http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1

[11] WSO2 (2017), [WSO2Con USA 2017] Scalable Real-time Complex Event Processing at Uber, http://wso2.com/library/conference/2017/2/wso2con-usa-2017-scalable-real-time-complex-event-processing-at-uber/

[12] Apache Software Foundation (2015), Storm SQL integrationhttp://storm.apache.org/releases/2.0.0-SNAPSHOT/storm-sql.html

[13] GitHub (2018), siddhi-execution-reorder, https://github.com/wso2-extensions/siddhi-execution-reorder

[14] Microsoft (2018), Stream Analytics Documentationhttps://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-comparison-storm

[15] Apache Software Foundation (2015), Apache Stormhttp://storm.apache.org/about/multi-language.html

[16] Tsai, B. (2014), Fault Tolerant Message Processing in Storm, https://bryantsai.com/fault-tolerant-message-processing-in-storm-6b57fd303512

[17] Blogger (2015), Why We need SQL like Query Language for Realtime Streaming Analytics?http://srinathsview.blogspot.com/2015/02/why-we-need-sql-like-query-language-for.html

[18] GitHub (2018), WSO2 Siddhi, https://github.com/wso2/siddhi

[19] Apache Software Foundation (2015), Apache Storm, http://storm.apache.org/releases/current/Fault-tolerance.html

[20] WSO2 (2018), Introduction - Stream Processor 4.0.0, https://docs.wso2.com/display/SP400/Introduction

[21] Andrade, H.C.M. and Gedik, B. and Turaga, D.S. (2014), Fundamentals of Stream Processing: Application Design, Systems, and Analytics, 9781107434004, Cambridge University Press

[22] GitHub (2018), Siddhi Query Guide - Partition, https://wso2.github.io/siddhi/documentation/siddhi-4.0/#partition

[23] Apache Software Foundation (2015), Resource Aware Scheduler, http://storm.apache.org/releases/1.1.2/Resource_Aware_Scheduler_overview.html


[24] Apache Software Foundation (2015), Storm State Managementhttp://storm.apache.org/releases/1.1.2/State-checkpointing.html


[25] Apache Software Foundation (2018), Interface IRichSpout, http://storm.apache.org/releases/1.1.2/javadocs/org/apache/storm/topology/IRichSpout.html

[26] Bigml (2013), Machine Learning From Streaming Data: Two Problems, Two Solutions, Two Concerns, and Two Lessons, https://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/

[27] SAP (2017), Forrester Research Names SAP in Leaders Category for Streaming Analytics, https://reprints.forrester.com/#/assets/2/308/%27RES136545%27/reports

[28] Apache Software Foundation (2015), Storm JDBC Integration, http://storm.apache.org/releases/1.1.2/storm-jdbc.html


[29]  T. Hunter, T. Das, M. Zaharia, P. Abbeel and A. M. Bayen, Large-Scale Estimation in Cyberphysical Systems Using Streaming Data: A Case Study With Arterial Traffic Estimation, in IEEE Transactions on Automation Science and Engineering, vol. 10, no. 4, pp. 884-898, Oct. 2013. doi: 10.1109/TASE.2013.2274523

[30] A. Biem, B. Elmegreen, O. Verscheure, D. Turaga, H. Andrade and T. Cornwell, A streaming approach to radio astronomy imaging, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, 2010, pp. 1654-1657.

[31] Pathirage, M.(2018), Kappa Architecture, http://milinda.pathirage.org/kappa-architecture.com/
 
[32] MapR Technologies (2018), Architecture, https://mapr.com/developercentral/lambda-architecture/

[33] Hausenblas, M., Bijnens, N. (2017), Lambda Architecture, http://lambda-architecture.net/

[34] Hortonworks (2018), Apache Storm Component Guide, https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_storm-component-guide/content/storm-trident-intro.html

[35] Weinberger, Y. (2015), Exactly-Once Processing with Trident - The Fake Truth, https://www.alooma.com/blog/trident-exactly-once

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
apache storm ,stream processing ,scalability ,wso2 ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}