DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Securing and Monitoring Your Data Pipeline: Best Practices for Kafka, AWS RDS, Lambda, and API Gateway Integration
  • Seamless Security Integration Strategies in Software Development
  • Securely Sign and Manage Documents Digitally With DocuSign and Ballerina
  • Setting Up CORS and Integration on AWS API Gateway Using CloudFormation

Trending

  • A Simple, Convenience Package for the Azure Cosmos DB Go SDK
  • Detection and Mitigation of Lateral Movement in Cloud Networks
  • FIPS 140-3: The Security Standard That Protects Our Federal Data
  • Beyond Code Coverage: A Risk-Driven Revolution in Software Testing With Machine Learning
  1. DZone
  2. Data Engineering
  3. Big Data
  4. A Brief History of Apache Storm

A Brief History of Apache Storm

Storm started as an idea to bring the power of Hadoop to real-time data, and has only grown since then.

By 
Taylor Goetz user avatar
Taylor Goetz
·
May. 25, 16 · Opinion
Likes (7)
Comment
Save
Tweet
Share
6.9K Views

Join the DZone community and get the full member experience.

Join For Free

In this series of blog posts, we will provide an in-depth look select features introduced with the release of Apache Storm (Storm) 1.0. To kick off the series, we’ll take a look how Storm has evolved over the years from its beginnings as an open source project, up to the 1.0 milestone release.

“The Hadoop of Real-Time”

Storm was originally created by Nathan Marz while he was at Backtype (later acquired by Twitter) working on analytics products based on historical and real-time analysis of the Twitter firehose. Nathan envisioned Storm as a replacement for the real-time component that was based on a cumbersome and brittle system of distributed queues and workers. Storm introduced the concept of the “stream” as a distributed abstraction for data in motion, as well as a fault tolerance and reliability model that was difficult, if not impossible, to achieve with a traditional queues and workers architecture.

Nathan open sourced Storm to GitHub on September 19th, 2011 during his talk at Strange Loop, and it quickly became the most watched JVM project on GitHub. Production deployments soon followed, and the Storm development community rapidly expanded.

At the time Storm was introduced, Big Data analytics largely involved batch processing in map-reduce on Apache Hadoop or one of the higher level abstractions like Apache Pig and Cascading. The introduction of Storm helped drive a change in the way people thought about large scale analytics, spurring the rise of stream processing and real-time analytics.

Early versions of Storm introduced the familiar stream abstraction and the corresponding Spout/Bolt/Topology API that allowed developers to easily reason about streaming computations. Another feature Storm introduced, and one that to this day remains unique to Storm, is the concept of Distributed Remote Procedure Calls where the inherent parallelism and scalability of Storm can be leveraged in a synchronous, request-response paradigm.

The Storm 0.8.x line of releases introduced the Trident API which added support for exactly-once semantics, micro-batching for increased throughput, stateful processing, and a high-level API for joins, aggregations, grouping, functions, and filters. Other improvements at the time included pluggable schedulers, introduction of the LMAX Disruptor for higher throughput, tick tuples, and improvements to the Storm UI.

Storm Moves to Apache

With encouragement from Andy Feng at Yahoo!, Nathan decided to propose moving Storm to Apache, and the project officially entered the Apache Incubator on September 18, 2013. This move marked the beginning of a fundamental shift in the Storm community away from a model where a single individual leads a project, to the consensus-driven Apache development model. The move to Apache ensured that the project would be governed by a sustainable community, and that no single individual would represent a bottleneck in terms of decision making.

During Storm’s time in the Apache Incubator, the 0.9.x line of releases was introduced. In 0.9.x Storm’s underlying transport layer based on 0mq was replaced with an implementation based on Netty. Not only was the new Netty-based transport almost twice as fast as the previous implementation, but the fact that Netty is a pure Java framework freed users from the requirement of difficult to install, platform specific binaries. Finally, the Netty transport set the stage for authorization and authentication between worker processes.

The 0.9.x line of releases also introduced Apache Kafka integration as a first class component in the Apache Storm distribution. Prior to this point Kafka integration was a separate project that had been forked many times, and it was difficult for users to understand which versions were compatible with various versions of Storm and Kafka. Bringing Kafka integration into the Apache distribution ensured that compatibility was maintained with each Storm release. Similarly, the Storm community added support for HDFS and Apache HBase integration as well.

Other notable improvements from this time included performance improvements to the Netty transport, pluggable serialization for multi-lang components, topology visualization, a logviewer, and support for Microsoft Windows deployments.

Apache Storm became a top-level Apache project on September 17, 2014. In the time since first entering the Apache Incubator with a group 7 initial committers, the Apache Storm PMC has grown to 28 members strong with contributions from 342 individuals.

Enterprise Readiness

Before graduating from the Apache Incubator, the Storm community was already working on the next major iteration of the platform: Version 0.10, which focused primarily on security and enterprise readiness. Much like the early days of Apache Hadoop, Apache Storm originally evolved in an environment where security was not a high-priority concern. Rather, it was assumed that Storm would be deployed to environments suitably cordoned off from security threats. While a large number of users were comfortable setting up their own security measures for Storm, this proved a hindrance to broader adoption among larger enterprises where security policies prohibited deployment without specific safeguards.

The implementation of enterprise-grade security in Apache Storm was a momentous effort that involved active collaboration between Yahoo!, Hortonworks, Symantec, and the broader Apache Storm community. Some of the highlights of Storm’s security features include:

  • Kerberos Authentication with Automatic Credential Push and Renewal
  • Pluggable Authorization and ACLs
  • Multi-Tenant Scheduling with per-user isolation and configurable resource limits.
  • User Impersonation
  • SSL Support for Storm UI, Log Viewer, and DRPC (Distributed Remote Procedure Call)
  • Secure integration with other Hadoop Projects (such as ZooKeeper, HDFS, HBase, etc.)
  • User isolation (Storm topologies run as the user who submitted them)

Another important feature of Storm 0.10 was the introduction of Flux, a  framework and set of utilities that make defining and deploying Storm topologies less developer-intensive. Prior to Flux, a common complaint from users was that the definition of a topology DAG was often tied up in java code, and any change required recompilation and packaging of the topology. Flux addressed that problem by providing a YAML DSL for defining topologies in a simple text file, decoupling the DAG definition from Java code.

Some of the key features of Flux include:

  • Easily configure and deploy Storm topologies (Both Storm core and Micro-batch API) without embedding configuration in your topology code
  • Support for existing topology code
  • Define Storm Core API (Spouts/Bolts) using a flexible YAML DSL
  • YAML DSL support for most Storm components (storm-kafka, storm-hdfs, storm-hbase, etc.)
  • Convenient support for multi-lang components
  • External property substitution/filtering for easily switching between configurations/environments (similar to Maven-style `${variable.name}` substitution)

Development of the 0.10 line of releases also saw a proliferation of additional integration components including JDBC/RDBMS integration, streaming ingest to Apache Hive, Microsoft Azure Event Hubs integration, and Redis support. Other important features included rolling upgrade support, logging improvements, and a partial key groupings implementation.

1.0 Milestone

On April 12, 2016 the Apache Storm community announced the release of Apache Storm 1.0, representing yet another major milestone in the evolution of the project. Version 1.0 includes a tremendous number of new features, and usability, management, and performance improvements.

In the coming weeks we will continue this blog series with more in-depth articles covering the important new features included in 1.0, including:

  • Performance improvements
  • Windowing and State Management
  • Nimbus High Availability
  • Management and Debugging improvements
  • Distributed Cache API
  • Automatic Backpressure Support
  • Resource Aware Scheduling
  • New Integration Components

For a sneak preview of these features, please see the Apache Storm 1.0 release announcement and stay tuned for more in-depth detail.

Apache Storm hadoop kafka Release (computing) Integration security

Published at DZone with permission of Taylor Goetz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Securing and Monitoring Your Data Pipeline: Best Practices for Kafka, AWS RDS, Lambda, and API Gateway Integration
  • Seamless Security Integration Strategies in Software Development
  • Securely Sign and Manage Documents Digitally With DocuSign and Ballerina
  • Setting Up CORS and Integration on AWS API Gateway Using CloudFormation

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: