IoT and Machine Learning — Intro and Infra setup

DZone 's Guide to

IoT and Machine Learning — Intro and Infra setup

In this article, I am trying to touch upon the necessary building blocks to enable Machine Learning in IoT.

· IoT Zone ·
Free Resource

As we are living in a fully connected world, enabling Machine Learning in data from IoT devices is essential as we all know “data driving the world today and insights from the data deriving the path to future.” In this article, I am trying to touch upon the necessary building blocks to enable Machine Learning in IoT and how cloud infrastructure helps here if we can use the power of “open source” tools effectively.

MQTT — Path to Device to Cloud

MQTT (http://mqtt.org/) Message Queuing Telemetry Transport is an open OASIS and ISO standard (ISO/IEC 20922) lightweight, a publish-subscribe network protocol that transports messages between devices. The protocol usually runs over TCP/IP; however, any network protocol that provides ordered, lossless, bi-directional connections can support MQTT. It is designed for connections with remote locations where a "small code footprint" is required or the network bandwidth is limited. (Source: https://en.wikipedia.org/wiki/MQTT)

Below diagram explains how to connect the device to Cloud,

IoT devices in the cloud

The above diagram is at a high level and let’s discuss the components in detail in the sections below.

MQTT Broker

MQTT Broker acts as a common point for receiving and publishing messages from clients who subscribed to the broker. Clients can connect to the broker and then, receive messages from the topics and also publish messages to the topics. In our case, clients going to be at the IoT device side and the broker resides at the cloud virtual machine. So, cloud MQTT broker will be receiving data from IoT devices through topics where devices publishing the messages, and also, the cloud can communicate to devices through publishing messages to topics that are subscribed by the device.

As we have number options here and one of popular choice is Mosquitto, you can download here — https://mosquitto.org/download/

Eclipse Mosquitto is an open-source (EPL/EDL licensed) message broker that implements the MQTT protocol versions 5.0, 3.1.1, and 3.1. Mosquitto is lightweight and is suitable for use on all devices from low power single board computers to full servers. The Mosquitto project also provides a C library for implementing MQTT clients, and the very popular mosquitto_pub and mosquitto_sub command line MQTT clients. (source: https://mosquitto.org/)

Eclipse Paho is another option and gets more details here — https://www.eclipse.org/paho/

MQTT Client

MQTT Client acting as a client connected to the broker can receive and publish messages from/to topics which are subscribed by the client from the IoT devices.

As we have number options for MQTT client and we already mentioned about Mosquitto C library as MQTT client in the above section — https://github.com/eclipse/mosquitto

Eclipse Paho is another option — Eclipse paho client for C — https://www.eclipse.org/paho/clients/c/

Data Transfer From Device to Cloud

Based on the business use cases, data format and frequency to be planned properly. You may have to consider below points,

  1. How the data help in day to day activities and also for future planning and forecasting.
  2. How frequently you need the data — this based on the data relevance and data changes applicable. In some cases, we may need data from devices in each second and some cases in each hour or days.
  3. Data from devices may need some transformations while storing into a cloud database and that based on the business use cases — you may need to display data in your web/mobile dashboards and/or data transformed for data analytics.
  4. How to process data — there may be challenges in this. In some cases, the system demands sequential data processing (one by one by the way data originated), in some cases, parallel processing is ok and in some cases, there may be a pre-condition before processing the new set of data.
  5. Processing the data for ML — this could be another area you may have to plan. Do we need to process daily/hourly or after a week or month.

Preparing the ML Infrastructure

Since we are choosing cloud here, we have lots of options and top in the list are AWS, Azure and Google Cloud. All these cloud providers have IoT and ML specific infrastructure and tools, but those are costly and you may not be needing that during the initial stages. We can create a normal virtual machine (VM) and choose memory, CPU, disk, etc based on the data and transaction volume. Below are the tools/frameworks needed for an Apache Spark-based ML infrastructure,

  1. Spark 2 — coming up with all the necessary tools from spark ecosystem — Hadoop, MLib, etc and get more details from https://spark.apache.org/docs/latest/
  2. Hadoop — you can either install spark2 with Hadoop or Hadoop as stand-alone.
  3. Python3/Scala/Java — based on what language you prefer to write ML programs
  4. PostgreSQL/MongoDB — install this if you have to store data into traditional DB other than Hadoop HDFS for future use/reference.
  5. MLib/Tensorflow/Keras/Scikit learn: choose ML libraries based on your choice.
  6. Data analytics tools — based your need

The above list is based on the Spark ecosystem and you may have to pick and choose based on the tools/frameworks you are familiar with or relevant for your business and technology choices.

Common ML Scenarios in IoT

Below are few use cases based on the data from devices (strictly based on my experience and may mostly differ in your business case)

  1. Data pattern for a specific period — eg: if data coming from a temperature sensor, the pattern of temperature data for a location where device installed for a day, so data can be analyzed from the pattern for specific days.
  2. Data missing/changes in duration/ change in pattern etc — this is important to understand the missing of data or changes in frequency because immediate action is required, otherwise, that leads to potential errors in our analytics/forecasts.
  3. Inactivities or other ambiguities in data flow — to avoid the errors in the data processing.
  4. Difference between the forecasted or real data — this may lead to correction in data models and training.
  5. User and location behaviors from device to device — each device data may differ if user or location behaviors may contribute some points to the data.
  6. Frequency of maintenance and root causes for that — it may be specific to location, usage, transaction volume, etc.

Importance of Security

As we always worry, security is the critical thing to take care of if you are handling data. Below are the few things to take care,

  1. Enable SSL/TLS while transferring data from device to cloud to make sure the data is encrypted and secured.
  2. Security in Cloud VM — enable proper security in the cloud to avoid potential data breach or hacking
  3. Database and Bigdata security — enable the security with proper user and group/role and secure data based on the customers/clients to avoid data access by unauthorized ends.


I am hoping this article gave you a high-level understanding of the integration between IoT devices and cloud. Apache Kafka is another option for MQTT, but the advantage of MQTT is its lightweight hassle-free architecture. Maybe it is also good to try out Cloud’s native IoT and ML tools that can enable ease of time while comparing our approach of setting up from scratch.

artifical intelligence, cloud infrastructure, internet of things, iot, machine learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}