Enterprise and IIoT Edge Processing With Apache NiFi, MiNiFi, and Deep Learning
Feast your eyes on this sample architecture you can make use of to get the most out of edge processing and deep learning for predictive analytics.
Join the DZone community and get the full member experience.Join For Free
Enterprise and industrial IoT projects have several more intense requirements than the standards for personal and home devices. The security needs are tighter with full lockdown from source to cloud or on-premise endpoint. At a minimum, solutions need to use SSL or encrypted channels for communication.
Another major feature required is to have support for many device types, from very tiny to large-scale industrial devices costing thousands of dollars. Many devices may include a GPU like the NVidia Jetson TX1, or an add-on compute device like the Movidius. Important features for rugged hardware are airtight cases for devices and backup power supplies. The ability to run without power for extended periods of time in at least a minimal logging mode can be important in remote locations.
These devices need to be remotely monitored, controlled, and updated. These command and control abilities become crucial when patches are required or changes to functionality occur. This can often happen frequently due to security requirements. These processes need to be automated.
Another key requirement is to have full end-to-end data provenance of every change to every piece of data as it travels through the system for full auditing and data governance purposes. GDPR and other laws may be applicable in many sensor-capturing situations, especially regarding camera ingestion.
Sensor and device data that cannot be used with enterprise analytic tools or combined with corporate data in a Hadoop data lake is nearly worthless. Combining device data with other data sources, such as weather or transactional data, is critical for prescriptive and predictive analytics at scale.
For my clients, these are some common use cases: container truck location monitoring, delivery truck monitoring, service truck and driver monitoring, security camera monitoring, utility asset anomaly detection, and temperature/humidity filtering for devices. The thing to remember is that while you can start small with a couple of inexpensive devices, a few sensors, a few data points, hourly data, and no SLAs, you should not plan your system this way. Enterprise and industrial IoT will quickly spread to millions of sensors, millions of devices, and continuous data streams.
If you do not plan to handle the volume, velocity, variety, and veracity of data, you will be doomed. If this sounds familiar, this is the big data use case. We just found the motherload of all data. With so many sensors packaged into so many devices located everywhere, IoT data can dwarf all other sources combined. Every truck, every item on a manufacturing floor, and every field sensor can quickly produce billions of streams of data per second with no end in sight. So, I am not giving you a quick start. This is your future-proof infrastructure to scale to massive industrial IoT use cases. This is a proven approach, so let's begin.
The first step is to determine what you need to monitor and to obtain a device that has the proper sensors, processing power, environmental suitability, and connectivity your use case demands. The good news is that you do not have to make a difficult decision on software. Apache NiFi is the choice for ingesting any type of data from any source, and it's trivial to connect to these devices. Depending on the size and type of the device, you can choose from a MiNiFi C++ Agent or MiNiFi Java agent. If the device is too small (like an Onion Omega) to support those, you can install a Micropython or C-based library to send MQTT messages. These messages can be sent to an aggregator, say, a Raspberry Pi-sized device attached to your truck. This will allow for localized aggregation, routing, filtering, compression, and even execution of machine learning and deep learning models at that edge. You will also have full control over how and when data is sent remotely to control data transmission costs, energy usage, and unnecessary data propagation.
Another feature that makes the MiNiFi and NiFi combination a no-brainer is data provenance. This is built into these tools, and transparently tracks all of the hoops that data travels through, from ingest on a device until it lands in its final home in the cloud or an on-premise data lake. Having encrypted the data and using HTTPS is great, but not knowing who touched the data — and when they did so — along the way is a weakness in most IIoT dataflows, but not in our software.
Let's dive into a use case with an NVidia Jetson TX1 device, with camera enabled, as our edge device. In my example setup, we have 4GB of RAM, 128GB of storage, WiFi, a USB web camera, and a 256-core Maxwell GPU. We are running a MiNiFi Java Agent along with Python, Apache MXNet, and NVidia's TensorRT. We run deep learning models on the edge device and send images, GPS data, sensor data, and deep learning results if values exceed norms. Using the site-to-site protocol over HTTPS, data is sent to an Apache NiFi cluster (HDF 3.1).
The data arrives securely for further processing, additional TensorFlow processing, and data augmentation in the cases of weather and geolocation. This data is streaming into a Hadoop-based big data platform for analysis, additional machine learning with Apache Spark, and queries via Apache Hive. The primary ingestion method is using Apache NiFi, which handles hundreds of data sources and many data types, and is ideal for simple event processing.
There are many ways to process our filtered data for storage and machine learning. The most common — and my recommended method — is using Apache Kafka. This is well integrated with Apache NiFi, Apache Storm, Streaming Analytics Manager, Apache Spark, Apache Beam, Apache Flink, and more. This data bus allows for the decoupling of the ingestion platform from our streaming and processing engines. Apache Kafka 1.0 also has support for schemas that make it easy for us to treat data as records from end-to-end when we have data structured enough to include a schema. We often have time series-oriented data with many small values and a timestamp.
Stream Processing Platform
The two main tools I recommend for most processing use cases are Streaming Analytics Manager and Apache Spark Streaming. The combination of the two supports most main use cases, SQL processing, joins, windowing, and executing PMML machine learning models. The Stream Processing platform is ideal for processing data in "real-time" as it comes out of the Apache Kafka topics. In SAM, for example, we can use Apache Calcite to query and manipulate these records via SQL in-stream.
Scalable Storage Platform
We need to store several types of data, including key-value, time series, structured table data, unstructured data like images and videos, and semi-structured data like tweets and text blobs. The perfect, safest place to do this is in Apache Hadoop. We can store trillions of rows and petabytes of data and still query it as needed. With the upcoming Hadoop 3.0 release, the platform will support even more data, more files, and more capabilities. We store data as files in HDFS, as well as in Apache Hive Acid tables and in Apache HBase. For some of the faster ingest cases, we store data in Apache Druid for sub-second OLAP analytics.
Data Science Platform
In our case, our data science platform leverages Apache Zeppelin for notebooks to experiment, explore, and run analytics and machine learning. We use Apache Hive and Apache Phoenix to run SQL queries to analyze, transform, and organize our data. We use Apache Spark to run various machine learning algorithms and Spark SQL queries, and we have access to a steady stream of real-time data, as well as the massive historic datasets stored in our Apache Hadoop data lake. It is very easy to deploy our models trained on our massive datasets to the streaming processing engines to provide real-time insights with predictive models.
The nice thing is that, as shown below in the chart, this is all one platform running a common security and authentication system and common administration via Apache Ambari. Our global data management platform includes everything that is needed for enterprise and industrial IoT. The GDMP is made up of HDP, HDF, DPS, and services that are built around an open-source system.
At each layer in the architecture, we can run various deep learning libraries as needed. At the edge, we run Nvidia TensorRT, Apache MXnet, and TensorFlow prebuilt models to scan web camera images for anomalies. In the ingestion phase, Apache NiFi can use TensorFlow, Apache OpenNLP, Apache Tika, and Apache MXnet for sentiment analysis, image analysis, document analysis, and other processing. The streaming engines are all well integrated with deep learning packages. Finally, our query and analytics platform notebooks can run various Apache MXnet and TensorFlow models, as well.
We can also run Apache HiveMall for machine learning in our Apache Hive queries. In the end, we have a continuously growing, always-learning, always-on, scalable platform for developing real solutions for IoT.
The funny part is that except for the little piece on the device and some of the ingestion logic, it's the same platform that addresses the same use cases for real-time financial information, real-time social media data, real-time CDC, REST feeds, and thousands of other data sources, types, and origins. In the final analysis, we see that enterprise and industrial IoT are not that much different in their requirements once we get past the first ten meters.
Opinions expressed by DZone contributors are their own.