Over a million developers have joined DZone.

IoT: Device Data and Stream Processing

DZone 's Guide to

IoT: Device Data and Stream Processing

Check out ways to solve the most challenging aspects of device data processing.

· IoT Zone ·
Free Resource

This article is featured in the new DZone Guide to IoT: Connecting Devices & Data. Get your free copy for insightful articles, industry stats, and more! 


IoT brings a whole new world of data, real-time streaming requirements, operational difficulties, security, and a large stream of massive data that needs to be made available for use at scale. Many of these challenges can be solved by using the best of breed open-source tools. A whole new world of data will become available to your stream processing frameworks. In this article, I will show you ways in which you can solve these problems and go from batch to stream.


Devices can output large amounts of data very frequently, which can be valuable to enterprises from as early as inception or as late as historical forecasts. This device data usually comes from onboard or connected sensors that measure things like GPS, temperature, humidity, air pressure, luminance, time of flight, gases, electricity, air quality, and motion. Once we have acquired data from a sensor on a device, we need to be able to transmit this data to our remote servers for processing at scale. There are a few caveats when doing so. The first is that we need a secure channel between our many devices and our receiver. We also need to be able to process the millions of device records being sent per second. If we can't process this as it comes in, we lose the value of this live data. We also need to be able to filter, alert, route, combine, query, store, aggregate, and discard the data.

Let's break down the problem into manageable steps.

Variety, Volume, Velocity, Variability of Data

First, we need to understand the data. There are many IoT devices in both the consumer and business landscape, each generating many types of data at rapid speed. This data can be a fixed schema, change frequently, it can be very sparse, numeric, or binary. For this reality, along with the sheer volume and velocity of the data, I recommend using an edge client and receiving server that can handle any type of data. Fortunately, that exists in the open-source space as Apache NiFi. You can easily have thousands of devices with dozens of data points each reporting back every second, which can easily grow to terabytes of data.

Device Data

An example of device data may look like this:

cputemp: 56.7

Memory: 22.6

Y: 471

Device data is often numeric, requires conversions, and appears in massive streams of often unchanging values, sometimes binary or strings. You never know what you may get until you start testing the data locally on the device. For example, I can take a reading from an LTR-559 Light and Proximity Sensor and will receive decimal data or whole numbers ranging from less than 1 to 64,000. You have to find out what your range of data is. Once you know that, you can build an AVRO schema that will let you ensure data range accuracy. This will also let you query that data on the device in MiniFi or in ApacheCalcite SQL, using Apache NiFi at the router or gateway level and beyond. Being able to accept many types of data including large strings, JSON, images, large sets of numbers, and unstructured data is critical when you're working with edge devices. They often get updates or variations on what they can send, and you will sometimes only get periodic network connectivity to update those sensors.

You also have to work with hundreds of different devices, different sensors, different versions of sensors and devices, different operating systems, and other factors. See my previous article and an example on my GitHub Page.

Edge Processing

As we mentioned, we have a lot of unchanging data that may not need to be sent to a remote source every second. We may want to send periodic data every 15 minutes or hour. We definitely want to send any alert conditions or changes right away. So, we need some ability to analyze data and act on it at the edge. The MiniFi C++ or Java agents are good options for this. We want to collect data at the edge as well as route and monitor it. This is where we can check for edge values and set up alert conditions to trigger an immediate action's location (such as restarting a service) or contact our remote resources. To enable more powerful transformations and enrichment, we will usually have a flow management solution at or near the edge, usually in the same building or network to host an Apache NiFi server or a small cluster. This will allow for more powerful transformation, queries, alerting conditions, rule processing, and machine learning classifications and analytics.

Transportation Options

Next, we need to get the data from the devices that read the sensor to another machine. The first step is usually a local router/gateway such as a flow management engine. This can do some aggregation and operate on local networking protocols before we send our streams to another server for processing, analytics, and storage. For transport options, there are three options that I recommend. The first is to use MiniFi's HTTPS transport, which can make it through almost any router, gateway, or proxy that is in the IoT network layer. This allows for secure, fast, robust transport of your data without loss and with full data provenance and lineage.

The second option is to allow MiniFi to send your messages viaMQTT, a fast and open-messaging protocol, though this option will reduce some of the MiniFi's data provenance and lineage. The last option is for IoT that is within a cloud or internal network that hasKafka available. Using MiniFi to send messages directly to ApacheKafka can allow for a durable queue to store a lot of data and allow for many consumers. All the major stream processing frameworks support reading from Apache Kafka in a streaming manner. The best approach I have seen in the real world is using MiniFi's S2SHTTPS transport to an Apache NiFi cluster that can do more filtering, conversation, routing, aggregating, and clean up before routing it to another stream processing framework via Apache Kafka.

Data Lineage and Provenance

One of the most difficult things to do when ingesting device data and stream processing is the distributed nature of the systems. You are running on many devices, networks, clouds, systems, computers, containers, VMs, and systems. An event can fail or be duplicated, and this is hard to monitor if you don't have a full lineage of all the steps involved and all the hops. Apache NiFi provides a rich set of data that can be processed easily in many ways since it is in JSON format.

This can be done with Apache NiFi itself or any other processing or monitoring tool that we can send these deep logs to.

Stream Processing

Once the data starts flowing into our stream processing engine via Apache Kafka, we can start doing advanced analytics, windowing, joins, complex aggregations, machine learning, deep learning, and more. I recommend using either Kafka Streams, Spark Streaming, or Streaming Analytics Manager for your complex stream processing.

But this is after Apache NiFi has performed routing, enrichment, transformation, cleanup, and prefiltered the data and assigned ita schema. Without a schema, our stream processing frameworks are not as efficient and often have to infer what your schema is to various degrees of success. They are amazing with fixed records with schemas that can be in the header, embedded in AVRO data or in a schema registry available via REST API. Kafka Streams lets you do some stream processing in ways you will be familiar with if you have done MapReduce or Spark programming. You can do aggregation, counts, reductions, grouping, or windowed aggregations. I have a simple example that you can see on my GitHub page, using Java to develop a very simple Kafka Streams microservice for processing data sent via Apache Kafka from Apache NiFi. In my example, I am checking for an alert condition upon which I am sending an MQTT message. I could, of course, do many more things inside my Kafka Streams application.


We have seen that we can collect, route, monitor, transform, enrich, and process device data at scale using open-source tools. This process is made more complex because of the nature of edge devices and the constant stream of data that they send. Using the correct open-source tools, we can make this process quick and easy.

This article is featured in the new DZone Guide to IoT: Connecting Devices & Data. Get your free copy for insightful articles, industry stats, and more! 

iot ,data streaming ,processing ,apache ,apache kafka ,analytics ,mqtt ,device data ,stream ,stream processing

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}