Summary: This is the first in a series of articles aimed at providing a complete foundation and broad understanding of the technical issues surrounding an IoT or streaming system so that the reader can make intelligent decisions and ask informed questions when planning their IoT system.
In talking to clients and prospects who are at the beginning of their IoT streaming projects, it’s clear that there’s a lot of misunderstanding and gaps in their knowledge. You can find hundreds of articles on IoT and inevitably they focus on some portion of the whole without an overall context or foundation. This is understandable since the topic is big and far ranging and not to mention changing fast.
So our intent is to provide a broad foundation for folks who are starting to think about streaming and IoT. We’ll start with the basics and move up through some of the more advanced topics, hopefully leaving you with enough information to then begin to start designing the details of your project or at least equipped to ask the right questions.
Since this is a large topic, we’ll spread it out over several articles with the goal of starting with the basics and adding detail in logical building blocks.
Is it IoT or Is it Streaming?
The very first thing we need to clear up for beginners is the nomenclature. You will see the terms “IoT” and “Streaming” used to mean different things as well as parts of the same thing. Here’s the core of the difference: If the signal derives from sensors, it’s IoT (Internet of Things). The problem is that there are plenty of situations where the signal doesn’t come from sensors but are handled in essentially the same way. Web logs, click streams, streams of text from social media, and streams of stock prices are examples of non-sensor streams that are therefore not “IoT."
What they share however is that all are data-in-motion streams of data. Streaming is really the core concept and we could just as easily have called this “Event Stream Processing," except that focusing on streaming leaves out several core elements of the architecture such as how we capture the signal, store the data, and query it.
In terms of the architecture, the streaming part is only one of the four main elements we’ll discuss here. Later we’ll also talk about the fact that although the data may be streaming, you may not need to process it as a stream depending on what you think of as real time. It’s a little confusing but we promise to clear that up below.
The architecture needed to handle all types of streaming data is essentially the same regardless of whether the source is specifically a sensor or not so throughout, we’re going to refer to this as “IoT Architecture." And since this is going to be a discussion that focuses on architecture, if you’re still unclear about streaming in general you might start with these overviews: Stream Processing – What Is It and Who Needs It and Stream Processing and Streaming Analytics – How It Works.
Basics of IoT Architecture: Open Source
Open source in Big Data has become a huge driver of innovation. So much so that probably 80 percent of the information available online deals with some element or package for data handling that is open source. Open source is also almost completely synonymous with Apache Institute. So to understand the basics of IoT architecture we’re going to start by focusing on open source tools and packages.
If you’re at all familiar with IoT, you cannot have avoided learning something about SPARK and Storm, two of the primary Apache open source streaming projects but these are only part of the overall architecture. Also, later in this series we’ll turn our attention to the emerging proprietary non-open source options and why you may want to consider them.
Your IoT architecture will consist of four components: Data Capture, Stream Processing, Storage, and Query. Depending on the specific packages you choose some of these may be combined but for this open source discussion we’ll assume they’re separate.
Think of the Data Capture component as the catcher's mitt for all your incoming sources—sensors, web streams, text, image, or social media. The Data Capture application needs to:
- Be able to capture all your data as fast as it can as it’s coming from all sources at the same time. In digital advertising bidding for example, this can easily be 1 million events per second. There are applications where the rate is even higher but it’s unlikely that yours will be this high. However, if you have a million sensors each transmitting once per second you’re already there.
- Must not lose events. Sensor data is notoriously dirty. This can be caused by malfunction, age, signal drift, connectivity issues, or a variety of other network, software and hardware issues. Depending on your use case you may be able to stand some data loss but our assumption is that you don’t want to lose any.
- Scale Easily. As your data grows, your data capture app needs to keep up. This means that it will be a distributed app running on a cluster as will all the other components discussed here.
Streaming data is time series so it arrives with at least three pieces of information: the time stamp from its moment of origination; sensor or source ID; and the value(s) being read at that moment.
Later you may combine your streaming data with static data, for instance about your customer, but that happens in another component.
Why Do You Need a Message Collector at All?
Many of the Stream Processing apps including SPARK and Storm can directly ingest messages without a separate Message Collector front end. However, if a node in the cluster fails, they can’t guarantee that the data can be recovered. Since we assume your business need demands that you be able to save all the incoming data, a front end Message Collector that can temporarily store and repeat data in the case of failure is considered a safe architecture.
Open Source Options for Message Collectors
In open source, you have a number of options. Here are some of the better known Data Collectors. This is not an exhaustive list.
- FluentD – General purpose multi-source data collector.
- Flume – Large scale log aggregation framework. Part of the Hadoop ecosystem.
- MQ (e.g. RabbitMQ) There are a number of these lightweight message brokers deriving from the original IBM MQTT (message queuing telemetry transport, shortened to MQ).
- AWS Kinesis – The other major cloud services also have open source Data Collectors.
- Kafka – Distributed queue publish-subscribe system for large amounts of streaming data.
Kafka is Currently the Most Popular Choice
Kafka is not your only choice but it is far and away today’s most common choice, being used by LinkedIn, Netflix, Spotify, Uber, and AirBNB among others.
Kafka is a distributed messaging system designed to tolerate hardware, software, and network failures and to allow segments of failed data to be essentially rewound and replayed, providing the needed safety in your system. Kafka came out of LinkedIn in 2011 and is known for its ability to handle very high throughput rates and to scale out.
If your stream of data needed no other processing, it could be passed directly through Kafka to a data store.
Here’s a quick way to do a back-of-envelope assessment of how much storage you’ll need. For example:
Number of Sensors: 1 Million
Signal Frequency: Every 60 seconds
Data packet size: 1 Kb
Events per sensor per day: 1,440
Total events per day: 1.44 Billion
Events per second: 16,667
Total data size per day: 1.44 TB per day
Your system will need two types of storage: ‘Forever’ storage and ‘Fast’ storage.
Fast storage is for real time look up after the data has passed through your streaming platform or even while it is still resident there. You might need to query 'Fast storage' in just a few milliseconds to add data and context to the data stream flowing through your streaming platform, like what were the min and max or average readings for sensor X over the last 24 hours or the last month. How long you hold data in Fast storage will depend on your specific business need.
Forever storage isn’t really forever but you’ll need to assess exactly how long you want to hold on to the data. It could be forever or it could be a matter of months or years. Forever storage will support your advanced analytics and the predictive models you’ll implement to create signals in your streaming platform, and for general ad hoc batch queries.
RDBMS is not going to work for either of these needs based on speed, cost, and scale limitations. Both these are going to be some version of NoSQL.
In selecting your storage platforms you’ll be concerned about scalability and reliability, but you’ll also be concerned about cost. Consider this comparison drawn from Hortonworks:
For on premise storage a Hadoop cluster will be both the low cost and best scalability/reliability option. Cloud storage also based on Hadoop is now approaching 1¢ per GB per month from Google, Amazon, and Microsoft.
Open Source Options for Storage
Once again we have to pause to explain nomenclature, this time about “Hadoop." Many times, indeed most times that you read about “Hadoop,” the author is speaking about the whole ecosystem of packages that are available to run on Hadoop.
Technically, however, Hadoop consists of three elements that are the minimum requirements for it to operate as a database. Those are HDFS (Hadoop file system–how the data is stored), YARN (the scheduler), and Map/Reduce (the query system). “Hadoop” (the three component database) is good for batch queries but has recently been largely overtaken in new projects by SPARK which runs on HDFS and has a much faster query method.
What you should really focus on is the HDFS foundation. There are other open source alternatives to HDFS such as S3 and Mongo, and these are viable options. However, almost universally what you will encounter are NoSQL database systems based on HDFS. These options include:
- And many others.
We said earlier that RDBMS was non-competitive based on many factors, not the least of which is that the requirement for a schema-on-write is much less flexible than the NoSQL schema-on-read (late schema). However, if you are committed to RDBMS, you should examine the new entries in NewSQL which are RDBMS with most of the benefits of NoSQL. If you’re not familiar, try one of these refresher articles here, here, or here.
The goal of your IoT streaming system is to be able to flag certain events in real time that your customer/user will find valuable At any given moment, your system will contain two types of data: Data-in-motion, as it passes through your stream processing platform, and Data-at-rest, some of which will be in fast storage and some in forever storage.
There are two types of activity that will require you to query your data:
Real time outputs: If your goal is to send an action message to a human or a machine, or if you are sending data to a dashboard for real time update you may need to enhance your streaming data with stored information. One common type is static user information. For example, adding static customer data to the data stream while it is passing through the stream processor can be used to enhance the predictive power of the signal. A second type might be a signal enhancement. For example, if your sensor is telling you the current reading from a machine you might need to be able to compare that to the average, min, max, or other statistical variations from that same sensor over a variety of time periods ranging from say the last minute to the last month.
These data are going to be stored in your Fast storage and your query needs to be completed within a few milliseconds.
Analysis Queries: It’s likely that your IoT system will contain some sophisticated predictive models that score the data as it passes by to predict human or machine behavior. In IoT, developing predictive analytics remains the classic two step data science process: First analyze and model known data to create the predictive model, and second, export that code (or API) into your stream processing system so that it can score data as it passes through based on the model. Your Forever data is the basis on which those predictive analytic models will be developed. You will extract that data for analysis using a batch query that is much less time sensitive.
Open Source Options for Query
In the HDFS Apache ecosystem, there are three broad categories of query options.
- Map/Reduce: This method is one of the three legs of a Hadoop Database implementation and has been around the longest. It can be complex to code though updated Apache projects like Pig and Hive seek to make this easier. In batch mode, for analytic queries where time is not an issue Map/Reduce on a traditional Hadoop cluster will work perfectly well and can return results from large scale queries in minutes or hours.
- SPARK: Based on HDFS, SPARK has started to replace Hadoop Map/Reduce because it is 10X to 100X faster at queries (depending on whether the data is on disc or in memory). Particularly if you have used SPARK in your streaming platform, it will make sense to also use it for your real time queries. Latencies in the milliseconds range can be achieved depending on memory and other hardware factors.
- SQL: Traditionally, the whole NoSQL movement was named after database designs like Hadoop that could not be queried by SQL. However, so many people were fluent in SQL and not in the more obscure Map/Reduce queries that there has been a constant drumbeat of development aimed at allowing SQL queries. Today, SQL is so common on these HDFS databases that it’s no longer accurate to say NoSQL. However, all these SQL implementations require some sort of intermediate translator so they are generally not suited to millisecond queries. They do however make your non-traditional data stores open to any analysts or data scientists with SQL skills.
Watch for Lessons 2 and 3 in the coming weeks.