A Big Data Reference Architecture for IoT
A Big Data Reference Architecture for IoT
Explore an industrial-strength architecture built from several technology elements that enables flexible deployment for new capabilities and reduces TCO.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
The Industrial Internet of Things (IIOT) is emerging as a key business and technology trend in society. IIOT enables various entities such as municipalities, industrial manufacturing, utilities, telecom, and insurance to address key customer and operational challenges. The evolution of technological innovation in areas such as Big Data, predictive analytics, and cloud computing now enables the integration and analysis of massive amounts of device data at scale while performing a range of analytics and business process workflows on this data.
This article aims to discuss a vendor- and product-agnostic reference architecture that covers an end-to-end IIOT implementation, covering various layers of such an architecture. The end goal is to enable the creation of enterprise business applications that are data-driven.
DataFrames in General
The first requirement of IIoT implementations is to support connectivity from the Things themselves or the Device layer. The Device layer includes a whole range of sensors, actuators, smartphones, gateways, industrial equipment, etc. The ability to connect with devices and edge devices like routers and smart gateways using a variety of protocols is key. These network protocols include Ethernet, Wi-Fi, and cellular, which can all directly connect to the Internet. Other protocols that need a gateway device to connect include Bluetooth, RFID, NFC, Zigbee, et al. Devices can connect directly with the data ingest layer, shown above, but it is preferred that they connect via a gateway, which can perform a range of edge processing. This is important from a business standpoint. For example, in certain verticals like healthcare and financial services, there exist stringent regulations that govern when certain identifying data elements (e.g. video feeds) can leave the premises of a hospital or bank. A gateway cannot just perform intelligent edge processing, but can also connect thousands of device endpoints and facilitate bidirectional communication with the core IIoT architecture. For remote locations, more powerful devices like the Arrow BeagleBone Black Industrial and MyPi Industrial, you can run a tiny Java or C++ MiniFi agent for your secure connectivity needs. These agents will send the data to an Apache NiFi gateway or directly into your enterprise HDF cluster in the cloud or on-premise.
The data sent by the device endpoints are modeled into an appropriate domain representation based on the actual content of the messages. The data sent over also includes metadata around the message. A canonical model can optionally be developed (based on the actual business domain) which can support a variety of applications from a business intelligence standpoint.
The ideal tool for these constantly evolving devices, metadata, protocols, data formats, and types is Apache NiFi. Apache NiFi supports the flexibility of ingesting changing file formats, sizes, data types, and schemas. Whether your devices send XML today and send JSON tomorrow, Apache NiFi supports ingesting any file type you may have. Once inside Apache NiFi, it is enveloped insecurity, with every touch to each flow file controlled, secured, and audited. You will have full data provenance for each file, packet, or chunk of data sent through the system. Apache NiFi can work with specific schemas if you have special requirements for file types, but it can also work with unstructured or semistructured data just as well. NiFi can ingest 50,000 streams concurrently on a zero-master, shared-nothing cluster that horizontally scales via easy administration with Apache Ambari.
Data and Middleware Layer
The IIoT Architecture recommends a Big Data platform with native message-oriented middleware (MOM) capabilities to ingest device mesh data. This layer will also process device data in such a fashion — batch or real-time — as the business needs demand.
Application protocols such as AMQP, MQTT, CoAP, WebSockets, etc. are all deployed by many device gateways to communicate application-specific messages. The reason for recommending a Big Data/NoSQL dominated data architecture for IIoT is quite simple. These systems provide Schema on Read, which is an innovative data-handling technique. In this model, a format or schema is applied to data as it is accessed from a storage location, as opposed to doing the same while it is ingested. From an IIoT standpoint, one must not just deal with the data itself but also metadata such as timestamps, device ID, other firmware data, such as software version, device manufactured data, etc. The data sent from the device layer will consist of time series data and individual measurements.
The IIoT data stream can be visualized as a constantly running data pump, which is handled by a Big Data pipeline that takes the raw telemetry data from the gateways, decides which ones are of interest, and discards the ones not deemed significant from a business standpoint. Apache NiFi is your gateway and gatekeeper. It ingests the raw data, manages the flow of thousandsof producers and consumers, does basic data enrichment, sentiment analysis in stream, aggregation, splitting, schema translation, format conversion, and other initial steps to prepare the data. It does that all with a user-friendly web UI and easily extendible architecture. It will then send raw or processed data to Kafka for further processing by Apache Storm, Apache Spark, or other consumers. Apache Storm is a distributed real-time computation engine that reliably processes unbounded streams of data. Storm excels at handling complex streams of data that require windowing and other complex event processing. While Storm processes stream data at scale, Apache Kafka distributes messages at scale. Kafka is a distributed pub-sub real-time messaging system that provides strong durability and fault tolerance guarantees. NiFi, Storm, and Kafka naturally complement each other, and their powerful cooperation enables real-time streaming analytics for fast-moving big data. All the stream processing is handled by NiFi-Storm-Kafka combination.Consider it the Avengers of streaming.
Appropriate logic is built into the higher layers to support device identification, ID lookup, secure authentication, and transformation of the data. This layer will process data (cleanse, transform, apply a canonical representation) to support business automation (BPM), BI (business intelligence), and visualization for a variety of consumers. The data ingest layer will also provide notification and alerts via Apache NiFi.
Here are some typical uses for this event processing pipeline:
Real-time data filtering and pattern matching.
Enrichment based on business context.
Real-time analytics such as KPIs, complex event processing, etc.
Business workflow with decision nodes and human task nodes.
Once the device data has been ingested into a modern data lake, key functions that need to be performed include data aggregation, transformation, enriching, filtering, sorting, etc. As one can see, this can get very complex very quick — both from a data storage and processing standpoint. A cloud-based infrastructure, with its ability to provide highly scalable compute, network, and storage resources, is a natural fit to handle bursty IIoT applications. However, IIoT applications add their own diverse requirements of computing infrastructure, namely the ability to accommodate hundreds of kinds of devices and network gateways, which means that IT must be prepared to support a large diversity of operating systems and storage types.
The business integration and presentation layer is responsible for the integration of the IIoT environment into the business processes of an enterprise. The IIoT solution ties into existing line-of-business applications and standard software solutions through adapters or enterprise application integration (EAI) and business-to-business (B2B) gateway capabilities. End users in business-to-business or business-to-consumer scenarios will interact with the IIoT solution and the special-purpose IIoT devices through this layer. They may use the IIoT solution or line-of-business system UIs, including apps on personal mobile devices, such as smartphones and tablets.
Once IIoT knowledge has become part of the Hadoop-based data lake, all the rich analytics, machine learning, and deep learning frameworks, tools, and libraries now become available to data scientists and analysts. They can easily produce insights, dashboards, reports, and real-time analytics with IIoT data joined with existing data in the lake, including social media data, EDW data, and log data. All your data can be queried with familiar SQL through a variety of interfaces such as Apache Phoenix on HBase, Apache Hive LLAP, and Apache Spark SQL. Using your existing BI tools or the open-sourced Apache Zeppelin, you can produce and share live reports. You can run TensorFlow in containers onYARN for deep learning insights on your images, videos, and text data while running YARN-clustered Spark ML pipelines fed byKafka and NiFi to run streaming machine learning algorithms on trained models.
Opinions expressed by DZone contributors are their own.