Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What Is Data Streaming?

DZone's Guide to

What Is Data Streaming?

Data streaming is an extremely important process in the world of big data. Read on to learn a little more about how it helps in real-time analyses and data ingestion.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Data Streaming Defined

Visualize a river. Where does the river begin? Where does the river end? Intrinsic to our understanding of a river is the idea of flow. The river has no beginning and no end. Streaming data is ideally suited to data that has no discrete beginning or end. For example, data from a traffic light is continuous and has no "start" or "finish." Data streaming is the process of sending data records continuously rather than in batches. Generally, data streaming is useful for the types of data sources that send data in small sizes (often in kilobytes) in a continuous flow as the data is generated. This may include a wide variety of data sources such as telemetry from connected devices, log files generated by customers using your web applications, e-commerce transactions, or information from social networks or geospatial services.

Traditionally, data is moved in batches. Batch processing often processes large volumes of data at the same time, with long periods of latency. For example, the process is run every 24 hours. While this can be an efficient way to handle large volumes of data, it doesn't work with data that is meant to be streamed because that data can be stale by the time it is processed.

Data streaming is optimal for time series and detecting patterns over time. For example, tracking the length of a web session. Most IoT data is well-suited to data streaming. Things like traffic sensors, health sensors, transaction logs, and activity logs are all good candidates for data streaming.

This streamed data is often used for real-time aggregation and correlation, filtering, or sampling. Data streaming allows you to analyze data in real time and gives you insights into a wide range of activities, such as metering, server activity, geolocation of devices, or website clicks.

Consider the following scenarios:

  • A financial institution tracks market changes and adjusts settings to customer portfolios based on configured constraints (such as selling when a certain stock value is reached).
  • A power grid monitors throughput and generates alerts when certain thresholds are reached.
  • A news source streams clickstream records from its various platforms and enriches the data with demographic information so that it can serve articles that are relevant to the audience demographic.
  • An e-commerce site streams clickstream records to find anomalous behavior in the data stream and generates a security alert if the clickstream shows abnormal behavior.

Data Streaming Challenges

Data streaming is a powerful tool, but there are a few challenges that are common when working with streaming data sources. The following list shows a few of the things to plan for when data streaming:

  • Plan for scalability.
  • Plan for data durability.
  • Incorporate fault tolerance in both the storage and processing layers.

Data Streaming Tools

With the growth of streaming data, comes a number of solutions geared for working with it. The following list shows a few popular tools for working with streaming data:

  • Amazon Kinesis Firehose. Amazon Kinesis is a managed, scalable, cloud-based service which allows real-time processing of large data streams.
  • Apache Kafka. Apache Kafka is a distributed publish-subscribe messaging system which integrates applications and data streams.
  • Apache Flink. Apache Flink is a streaming data flow engine which provides facilities for distributed computation over data streams.
  • Apache Storm. Apache Storm is a distributed real-time computation system. Storm is used for distributed machine learning, real-time analytics, and numerous other cases, especially with high data velocity.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data streaming ,real-time analysis ,edge data collection

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}