Streaming IoT Data and MQTT Messages to Apache Kafka
Streaming IoT Data and MQTT Messages to Apache Kafka
Use Apache Kafka for all your IoT data streaming and MQTT message needs!
Join the DZone community and get the full member experience.Join For Free
Apache Kafka is a real-time streaming platform that is gaining broad adoption within large and small organizations. Kafka’s distributed microservices architecture and publish/subscribe protocol makes it ideal for moving real-time data between enterprise systems and applications. By some accounts, over a third of Fortune 500 companies are using Kafka. On GitHub, Kafka is one of the most popular Apache projects with over 11K stars and over 500 contributors. Without a doubt, Kafka is an open-source project that is changing how organizations move their data inside their cloud and datacenter.
The architecture of Kafka has been optimized to stream data between systems and applications as fast as possible in a scalable manner. Kafka clients/producers are tightly coupled with a Kafka cluster, requiring each client to know the IP addresses of the Kafka cluster and to have direct access to all individual nodes. Inside trusted networks, this allows changing the broker topology, which means topics and partitions can be scaled out by using multiple nodes directly from Kafka clients. Kafka topic spaces are also kept reasonably flat in most cases, as usually multiple partitions are used for scaling a single Kafka topic. It is very often undesirable to have hundreds or even thousands of topics in the Kafka system but use a few topics for most of the data streams. Kafka is well suited for communication between systems inside the same trusted network and with stable IP addresses and connections.
For IoT use cases where devices are connected to the data center or cloud over the public Internet, the Kafka architecture is not suitable out-of-the-box. If you attempt to stream data from thousands or even millions of devices using Kafka over the Internet, the Apache Kafka architecture is not suitable. There are a number of reasons Kafka alone is not well suited for IoT use cases:
- Kafka brokers need to be addressed directly by the clients, which means the Kafka brokers are required to be reached by the clients directly. Professional IoT deployments usually use a load balancer as the first line of defense in their cloud, so devices just use the IP address of the load balancer for connecting to the infrastructure and the load balancer effectively acts as a proxy. If you want your devices to connect to Kafka directly, then your Kafka brokers must be exposed to the public Internet.
- Kafka does not support large amounts of topics. When connecting millions of IoT devices over the public Internet, individual and unique topics (often with some unique IoT device identifier included in the topic name) are usually used, so read and write operations can be restricted based on the permissions of individual clients. You don’t want your smart thermostat to get hacked and the credentials used to eavesdrop on all the data streams in the system.
- Kafka clients are reasonably complex and resource intensive compared to client libraries for IoT protocols. The Kafka APIs for most programming languages are pretty straightforward and simple, but there is a lot of complexity under the hood. The client will for example use and maintain multiple TCP connections to the Kafka brokers. IoT deployments often have constrained devices that require minimal footprint but don’t require very high throughput on the device side. Kafka clients are by default optimized for throughput.
- Kafka clients require a stable TCP connection for best results. Many IoT use cases involve unreliable networks, for example, connected cars or smart agriculture, so an IoT device needs to consistently reestablish a connection to Kafka.
- It’s unusual (and very often not even possible at all) to have tens of thousands or even millions of clients connected to a single Kafka cluster. In IoT use cases, there are typically a large number of devices, which are simultaneously connected to the backend and are constantly producing data.
- Kafka is missing some key IoT features. The Kafka protocol is missing features such as keep alive and last will and testament. These features are important to build an IoT solution that is resilient even when devices experience unexpected disconnections and suffer from unreliable networks.
Kafka still brings a lot of value to IoT use cases. IoT solutions create a large amount of real-time data that is well suited for processing by Kafka. The challenge is how do you bridge the IoT data from the devices to the Kafka cluster?
Many companies implementing IoT use cases are looking at options to integrate MQTT and Kafka to process their IoT data. MQTT is another publish/subscribe protocol that has become the standard for connecting IoT device data. The MQTT standard is designed for connecting large numbers of IoT devices over unreliable networks, addressing many of the limitations of Kafka. In particular, MQTT is a lightweight protocol requiring a small client footprint on each device. It is built to securely support millions of connections over unreliable networks and works seamlessly in high latency and low throughput environments. It includes IoT features such as keep alive, last will and testament functionality, different quality of service levels for reliable messaging, as well as client-side load balancing (Shared Subscriptions) and other features designed for public Internet communication. Topics are dynamic, which means there can be an arbitrary number of MQTT topics in the system, very often up to tens of millions of topics per deployment in MQTT server clusters.
While Kafka and MQTT have different design goals, both work extremely well together. The question is not Kafka vs MQTT, but how to integrate both worlds together for an IoT end-to-end data pipeline. In order to integrate MQTT messages into a Kafka cluster, you need some type of bridge that forwards MQTT messages into Kafka. There are four different architecture approaches for implementing this type of bridge:
Kafka Connect for MQTT
Kafka has an extension framework, called Kafka Connect, that allows Kafka to ingest data from other systems. Kafka Connect for MQTT act as an MQTT client that subscribes to all the messages from an MQTT broker.
If you don’t have control of the MQTT broker, Kafka Connect for MQTT is a good approach to pursue. This approach allows Kafka to ingest the stream of MQTT messages.
There are performance and scalability limitations with using Kafka Connect for MQTT. As mentioned, Kafka Connect for MQTT is an MQTT client that subscribes to potentially ALL MQTT messages passing through a broker. MQTT client libraries are not intended to process extremely large amounts of MQTT messages, so IoT systems using this approach will have performance and scalability issues.
This approach centralizes business and message transformation logic and creates tight coupling, which should be avoided in distributed (microservice) architectures. The industry-leading consulting company Thoughtworks called this an anti-pattern and even put Kafka into the “Hold” category in their previous Technology Radar publications.
Another approach is the use of a proxy application that accepts the MQTT messages from IoT devices but does not implement the publish/subscribe or any MQTT session features and thus is not an MQTT broker. The IoT devices connect to the MQTT proxy that then pushes the MQTT messages into the Kafka broker.
The MQTT proxy approach allows for the MQTT message processing to be done within your Kafka deployment so management and operations are from a single console. An MQTT proxy is usually stateless, so it could (in theory) scale independent of the Kafka cluster by adding multiple instances of the proxy.
The limitations to the MQTT Proxy is that it is not a true MQTT implementation. An MQTT proxy is not based on pub/sub but creates a tightly coupled stream between a device and Kafka. The benefit of MQTT pub/sub is that it creates a loosely coupled system of endpoints (devices or backend applications) that can communicate and move data between each endpoint. For instance, MQTT allows communication between two devices, such as two connected cars can communicate with each other, but an MQTT proxy application would only allow data transmission from a car to a Kafka cluster, not with another car.
Some Kafka MQTT Proxy applications support features like QoS levels. It’s noteworthy that the resumption of a QoS message flow after a connection loss is only possible if the MQTT reconnects to the same MQTT Proxy instance, which is not possible if a load balancer is used that uses least-connection or round-robin strategies for scalability. So, the main reason for using QoS levels in MQTT, which is no message loss, only works for stable connections in such a scenario, which is an unrealistic assumption in most IoT scenarios.
The main risks for using such an approach is the fact that a proxy is not a full-featured MQTT broker, so it is not an MQTT implementation as defined by the MQTT specification, as it only implements a tiny subset, so it’s not a standardized solution. In order to use MQTT with MQTT clients properly, a fully featured MQTT broker is required.
If message loss is not an important factor, and if MQTT features designed for reliable IoT communication are not used, the proxy approach might be a lightweight alternative if you only want to send data unidirectional to Kafka over the Internet.
Build Your Own Custom Bridge
Some companies build their own MQTT to Kafka bridge. The typical approach is to create an application using an open-source MQTT client library and an open-source Kafka client library. The custom application is responsible for transposing and routing the data between the MQTT broker and Kafka instance.
The main challenge with this approach is the custom application is typically not designed to be fault tolerant and resilient. This becomes important if the IoT solution requires the end-to-end guarantee of at least once or exactly once message delivery. For example, an MQTT message set to quality of service level 1 or 2 that is sent to the custom application will acknowledge receipt of the message. However, if the custom application crashes before forwarding the message to Kafka, then the message is lost. Similarly, if the Kafka cluster is not available, the custom application will need to buffer the MQTT messages. If the custom application crashes before the Kafka cluster is available, all the buffered messages will be lost. To solve these issues, the custom application will require significant development effort similar to build technology already found in Kafka and an MQTT broker.
MQTT Broker Extension
A final approach is to extend the MQTT broker to create an extension that includes the native Kafka protocol. This allows the MQTT broker to act as a first-class Kafka client and stream the IoT device data to multiple Kafka clusters.
To implement this approach, you need to have access to the MQTT broker and the broker needs to have the ability to install extensions.
This approach allows the IoT solution to use a native MQTT implement and a native Kafka implementation. IoT devices use an MQTT client to send data to a full-featured MQTT broker. The MQTT broker is extended to include a native Kafka client and transposes the MQTT message to the Kafka protocol. This allows the IoT data to be routed to multiple Kafka clusters and at the same time to non-Kafka applications. Using an MQTT broker will also provide access to all the MQTT features required for IoT devices, such as last will and testament. An MQTT broker, like HiveMQ, is designed for high availability, persistence, performance, and resilience, so messages can be buffered on the broker while Kafka is not writable, so important messages are never lost from IoT devices. So, this approach delivers true end-to-end message delivery guarantees even for unreliable networks, public Internet communication, and changing network topologies (as often seen in containerized deployments, e.g. with Kubernetes).
HiveMQ Enterprise Extension for Kafka
In conversations with HiveMQ customers, some operating clusters with millions of devices and very high message throughput, we saw the need for creating an MQTT broker extension for Kafka. Our customers wanted to benefit from the native implementations of both MQTT and the Kafka protocol with all delivery guarantees of both protocols. Therefore, we are excited to announce the HiveMQ Enterprise Extension for Kafka.
Our customers see tremendous value in a joint MQTT and Kafka solution. They view Kafka as an excellent platform for processing and distributing real-time data in their data center or cloud environments. They want to use MQTT and HiveMQ to move data from devices to different back-end systems. The back-end systems include Kafka and also non-Kafka systems. They also know if they are trying to connect millions of devices, like connected cars, they need to be using a native and battle-tested implementation of MQTT, like HiveMQ.
HiveMQ Enterprise Extension for Kafka implements the native Kafka protocol inside the HiveMQ broker. This allows for seamless integration of MQTT message with a single Kafka cluster or multiple Kafka clusters at the same time. It supports 100 percent the entire MQTT 3 and MQTT 5 specification. We even make it possible to map potentially millions of MQTT topics to a limited number of Kafka topics. Finally, we have extended the HiveMQ Control Center to make it possible to monitor MQTT messages written to Kafka.
We are excited to bring this new product to our HiveMQ customers. This is the best approach for using Apache Kafka with IoT use cases. The free trial of the HiveMQ Enterprise Extension for Apache Kafka can be downloaded in our Marketplace. It was never easier integrating MQTT and Apache Kafka for IoT use cases.
Published at DZone with permission of Ian Skerrett , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.