Until recently, data technology in business was fairly homogenous — it was commonly found within two popular locations, namely data warehouses and operational data stores. Moreover, massive portions of data processing and collection was executed by companies as big batch jobs involving dumping of CVS files from the database and logging the files collected by the end of the day. Modern businesses are now operating in real time while using advanced software. The use of stream processing allows companies to react promptly and continuously as the data arrives, as opposed to processing it at the end of the day.
Stream processing is often preferred due to its ability to relocate variables analytically while reporting processing in real-time. Rather than treating data as static files or tables, stream processing takes it as a continuous stream evolving from past happenings to future experiences.
This article focuses on giving comparing the two popular projects within the stream processing community, namely Apache Spark and Apache Storm Trident. However, Storm Trident is a relatively new project; thus, comparing and contrasting is quite restricted.
It's important to mention that both of the two styles of stream processing are good, with Apache Storm being ideal for ESP cases and Spark being useful for micro-batching and CEP cases. These streaming processes feature variable strengths when it comes to ease of deployment, fault tolerance and compatibility with YARN. Both platforms represent a streaming architecture; thus, they have an inbound, unending list of tuples.
Comparing Storm Nomenclature and Spark Nomenclature
Storm Nomenclature represents a stream that is divided into finite-sized tuples. Spark Nomenclature employs the data processing engine in discretizing and converting the stream into determinate RDDs that are fundamentally microbatches for message processing. Both streaming styles will enable stateful processing, which comes in handy in facilitating your recovery if you lose a driver node or worker node. This also allows you to replay any data that came in during that period. Storm streaming will guarantee exactly-once semantics, which will facilitate your ability to maintain message counts or time-based averages when you're making recommendations or executing trend analyses.
The cornerstone for spark streaming is RDD, which stands for Resilient Distributed Dataset. RDD focuses on the use of immutable data structures that are also finite. You can replay this data while replicating over HDFS among other persistent file systems. Storm permits the plugging in of HBase or Memcached among other forms of the resilient data store, which either uses 1/0 or an alternative type of persistent storage, allowing you to automate this state preservation.
Storm also features transactional spouts coupled with bolts, guaranteeing exactly-once semantics irrespective of there being different types of spouts. Within Storm, there exists a notion of worker nodes that actually represent workhorses. Spark also features an equivalent notion of worker node. Both streaming frameworks allow maintenance of fault tolerance. This is done by employing an external data store, though the semantics alter a bit.
What Types of Programming Languages Are Available?
Both Storm and Spark frameworks comfortably support Java among other JVM-based languages, which will take you to a home zone or directory with which you're familiar. Apache Spark employs the Scala language, which is a function that meets object-oriented dialects. This language operates by transporting ideas from both the object-oriented and functional worlds, yielding a captivating mix of code that is reusable and extendable while having higher order functions. On the other hand, Apache Storm employs Clojure, a dialect of Lisp that targets the JVM offering the Lisp philosophy. Clojure is principally functional in nature; it may, however, require state or side-effects that are facilitated using a transactional memory model. This helps with rendering multi-threaded based applications consistent and safe.
When to Choose One Streaming Style Over the Other
If you are in need of strict, stateful processing, you will want to go for Spark Streaming for their exactly-once semantics. You could otherwise select Storm Trident but remember that this is a relatively new project. It could, therefore, suffer some performance degradation in the process of maintaining the states within the spouts and bolts. The magnitude of this degradation can be measured during the process of benchmarking your application. When it comes to the development effort, you will want to choose Spark especially if you are familiar with it. This is attributable to the extremely small size of the learning curve.
Both Storm Trident and Spark streaming offer master applications when it comes to compatibility with YARN. You can effectively substitute both applications on a cluster that is running YARN. However, Apache Spark applications do not require YARN because it has its own server processes.
Apache Spark is a full-blown project whereas Apache Storm is currently undergoing incubation. While this doesn't strictly reflect on their stability or wholeness, it has a vital reflection of the state of communities.