{{announcement.body}}
{{announcement.title}}

Kafka Connect on Kubernetes The Easy Way!

DZone 's Guide to

Kafka Connect on Kubernetes The Easy Way!

Tutorial that shows how to set up and use Kafka Connect on Kubernetes using Strimzi Operator

· Cloud Zone ·
Free Resource

This is a tutorial that shows how to set up and use Kafka Connect on Kubernetes using Strimzi, with the help of an example.

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems using source and sink connectors. Although it's not too hard to deploy a Kafka Connect cluster on Kubernetes (just "DIY"!), I love the fact that Strimzi enables a Kubernetes-native way of doing this using the Operator pattern with the help of Custom Resource Definitions.

In addition to bootstrapping/installing Kafka Connect, this also applies to operations such as scaling the Connect cluster, deploying and managing connectors, etc. (you will see this in action during the course of this blog post)

We will go through the process of deploying a Kafka Connect cluster on Kubernetes, installing a connector, and test it out — all this using kubectl and some YAML (of course!). I will be using Azure Event Hubs as the Kafka broker and Azure Kubernetes Service as the Kubernetes cluster - feel free to use other alternatives (e.g. with a local minikube cluster on your laptop)

All the artifacts are available on GitHub

Strimzi is responsible for all the heavy lifting.. In case you don't already know, here is a gist

Strimzi Overview

The Strimzi documentation is detailed yet very well organized and clear! Most of the below paragraph has been taken directly from the docs

Strimzi simplifies the process of running Apache Kafka in a Kubernetes cluster. Strimzi provides container images and Operators for running Kafka on Kubernetes. It is a part of the Cloud Native Computing Foundation as a Sandbox project (at the time of writing)

Strimzi Operators are fundamental to the project. These Operators are purpose-built with specialist operational knowledge to effectively manage Kafka. Operators simplify the process of: Deploying and running Kafka clusters and components, Configuring and securing access to Kafka, Upgrading and managing Kafka and even taking care of managing topics and users.

Here is a diagram which shows a 10,000 feet overview of the Operator roles:

I am not going to dive into the details of deploying Kafka using Strimzi in this post - probably something which I will tackle in future blogs

Pre-requisites

kubectl - https://kubernetes.io/docs/tasks/tools/install-kubectl/

If you choose to use Azure Event Hubs, Azure Kubernetes Service (or both) you will need a Microsoft Azure account. Go ahead and sign up for a free one!

Azure CLI or Azure Cloud Shell - you can either choose to install the Azure CLI if you don't have it already (should be quick!) or just use the Azure Cloud Shell from your browser.

Helm

I will be using Helm to install Strimzi. Here is the documentation to install Helm itself - https://helm.sh/docs/intro/install/

You can also use the YAML files directly to install Strimzi. Check out the quick start guide here - https://strimzi.io/docs/quickstart/latest/#proc-install-product-str

Let's start by setting up the required Azure services (if you're not using Azure, skip this section but please ensure you have the details for your Kafka cluster i.e. broker URLs and authentication credentials, if applicable)

I recommend installing the below services as a part of a single Azure Resource Group which makes it easy to clean up these services

Azure Event Hubs

Azure Event Hubs is a data streaming platform and event ingestion service. It can receive and process millions of events per second. It also provides a Kafka endpoint that can be used by existing Kafka based applications as an alternative to running your own Kafka cluster. Event Hubs supports Apache Kafka protocol 1.0 and later, and works with existing Kafka client applications and other tools in the Kafka ecosystem including Kafka Connect (demonstrated in this blog), MirrorMaker etc.

To setup an Azure Event Hubs cluster, you can choose from a variety of options including the Azure portal, Azure CLI, Azure PowerShell or an ARM template. Once the setup is complete, you will need the connection string (that will be used in subsequent steps) for authenticating to Event Hubs - use this guide to finish this step.

Please ensure that you also create an Event Hub (same as a Kafka topic) to act as the target for our Kafka Connect connector (details in subsequent sections)

Azure Kubernetes Service

Azure Kubernetes Service (AKS) makes it simple to deploy a managed Kubernetes cluster in Azure. It reduces the complexity and operational overhead of managing Kubernetes by offloading much of that responsibility to Azure. Here are examples of how you can setup an AKS cluster using Azure CLI, Azure portal or ARM template

Base install

To start off, we will install Strimzi and Kafka Connect, followed by the File Stream Source Connector

Install Strimzi

Installing Strimzi using Helm is pretty easy:

Shell


This will install the Strimzi Operator (which is nothing but a Deployment), Custom Resource Definitions and other Kubernetes components such as Cluster Roles, Cluster Role Bindings and Service Accounts

For more details, check out this link

To delete, simply helm uninstall strimzi-kafka

To confirm that the Strimzi Operator had been deployed, check it's Pod (it should transition to Running status after a while)

Shell


Check the Custom Resource Definitions as well:

Shell


I want to call out kafkas.kafka.strimzi.io which represents Kafka clusters in Kubernetes. We will focus on kafkaconnects.kafka.strimzi.io and kafkaconnectors.kafka.strimzi.io which represent Kafka Connect clusters and Connectors respectively.

I am going to skip over the other components but you can dig them out e.g. for Cluster Roles kubectl get clusterrole | grep strimzi

Now that we have the "brain" (the Strimzi Operator) wired up, let's use it!

Kafka Connect

We will need to create some helper Kubernetes components before we deploy Kafka Connect itself.

Before you proceed, clone the GitHub project

Shell


Kafka Connect will need to reference an existing Kafka cluster (which in this case is Azure Event Hubs). We can store the authentication info for the cluster as a Kubernetes Secret which can later be used in the Kafka Connect definition.

Update the eventhubs-secret.yaml file to include the credentials for Azure Event Hubs. Enter the connection string in the eventhubspassword attribute.

e.g.

YAML

Leave eventhubsuser: $ConnectionString unchanged

To create the Secret:

Shell


By default, Kafka Connect is configured to send logs to stdout. We will use a custom configuration (log4j) to ensure that logs are stored to /tmp/connect-worker.log (in addition to stdout) - you will understand why this is done, in a moment

the configuration itself is stored in a log4j.properties

The log configuration can be stored in a ConfigMap which will later be referenced by the Kafka Connect definition. For details, check https://strimzi.io/docs/latest/#con-kafka-connect-logging-deployment-configuration-kafka-connect

Shell


Before we deploy Kafka Connect, let's look into its definition. You can see it in its entirety here, but I will go through the important bits.

Notice that the resource kind is KafkaConnect - it is a Custom Resource Definition. Another interesting part is annotations (I will explain this in a bit)

YAML


bootstrapServers points to a Kafka broker. This could be a comma-separated value for nodes in a HA cluster. In this case its a single Kafka endpoint for Azure Event Hubs (yes, that's all you need!)

YAML


config is just good old Kafka Connect configuration similar to what you would use in connect-distributed.properties

YAML
 




xxxxxxxxxx
1


 
1
  config:
2
    group.id: connect-cluster
3
    offset.storage.topic: connect-cluster-offsets
4
    config.storage.topic: connect-cluster-configs
5
    status.storage.topic: connect-cluster-status



The authentication section simply refers to a Kubernetes Secret. In this case, we created one earlier with the name eventhubssecret which has the key eventhubspassword containing the connection string for azure event hubs

YAML


This is where the ConfigMap with log4j config is referenced. This will automatically configure Kafka Connect to use this configuration

YAML


tls section is used to configure TLS certificates (duh!). In case of event hubs, although we use SASL over PLAINTEXT, it required you to use SSL (i.e. set security.protocol to SASL_SSL). I initially faced an issue with this which was promptly clarified! Hence this piece of configuration was added:

YAML


Cool! We are ready to create a Kafka Connect instance. Before that, make sure that you update the bootstrapServers property with the Azure Event Hubs host name e.g.

YAML


To create the Kafka Connect instance:

Shell


To confirm:

Shell


This will create a Deployment and a corresponding Pod

Shell


You have a Kafka Connect cluster in Kubernetes! Check out the logs using kubectl logs <pod name>

Install File Source connector

Let's deploy a connector! To keep things simple, we will use the File Stream Source Connector which comes bundled with Kafka Connect by default. A common way of installing and managing connectors is to use the Kafka Connect REST API, but there is another way that Strimzi offers. This is a Kubernetes-centric approach where a Kakfa Connect connector is represented by a custom resource definition called KafkaConnector. All we need to do is create/update/delete KafkaConnector definitions with the details of our connectors, and Strimzi will take care of the rest!

check out the details in the Strimzi docs https://strimzi.io/docs/latest/#con-creating-managing-connectors-str

Here the definition of our connector:

YAML


Just like we did before, let's understand what each component means:

YAML


This is a KafkaConnector resource (specified by kind) whose name is my-source-connector

YAML


This is where we refer to the Kafka Connect cluster - remember this annotation in the Kafka Connect definition shown above?

YAML


This simply "activates" the feature and ensures that we are able to deploy connectors using the KafkaConnector CRD and we simply refer to the name of our kafkaconnect resource using the strimzi.io/cluster label

YAML


Finally, in the connector spec, we define the attributes for our connector. Notice the config property which points to the /tmp/connect-worker.log file? Recall that we modified our Kafka Connect instance to push logs to this file. Now, we have configured our File source connector to stream contents of this (log) file and send it to a Kafka topic named strimzi. This makes for a nice demo since the file will keep getting updated and we should be able to see each line as a different message in the destination Kafka topic (in Azure Event Hubs)

I have used strimzi as the topic name. This needs to be the same as the Event Hub created in the previous section (while setting up Azure Event Hubs)

To see this in action, let's deploy the connector

YAML


To confirm, simply list the connectors:

Shell

You can install other connectors as well. One of the ways (and the easiest IMO) to do this is by extending the Strimzi base image and adding the required connector artifacts on top of it. Check out the documentation https://strimzi.io/docs/latest/#using-kafka-connect-with-plug-ins-str

Kafka Connect in action...

First, let's confirm that the Kafka Connect logs are being piped to the intended location. This is important since we're using the log file as a source for the File stream connector. For this, we need to peek inside the Kafka Connect Pod e.g.

Shell


To make this easier, simply use the below command:

Shell


Now, you should see the Kafka Connect logs...

In a different terminal window, start a consumer process connecting to your Azure Event Hubs topic. I used kafkacat but there are other options such as the console consumer in the Kafka CLI itself or a programmatic consumer using Java, .NET, Go etc. (although it might a bit of an overkill in this case)

You should see the same logs here as well! For e.g.

JSON

The log itself is captured as part of the payload e.g. [2020-04-16 04:59:25,785] INFO WorkerSourceTask{id=my-source-connector-0} Finished commitOffsets successfully in 55 ms (org.apache.kafka.connect.runtime.WorkerSourceTask)

Managing Kafka Connect resources

To scale out Kafka Connect, simply update the no. of replicas in the spec e.g. from 1 to 2 in this case:

YAML


Appy the updated manifest

Shell


Please ensure that you increase the no. of replicas by updating the manifest and not by updating the Deployment using kubectl scale. This is because, the Strimzi operator reconciliation loop will check the KafkaConnect resource, find that the replicas count is 1 and scale the Deployment back

There should be two Pods now:

Shell

my-connect-cluster-connect-5bf9db5d9f-pzn95 is the new Pod

You can update the connector specification. For e.g. to allocate more tasks, update tasksMax from 2 to 5

YAML

Note: this will restart the connector

Clean up

To delete the connector and the Kafka Connect instance:

Shell


To clean up the AKS cluster and Azure Event Hubs, simply delete the resource group:

Shell


That concludes this blog post!

Like I mentioned, the Strimzi documentation is detailed, yet very clear and easy to navigate. To wrap things up, I will leave you with additional references from the Strimzi docs which I found useful in addition to the ones I mentioned in the post:

Strimzi doc references

I hope you find it useful for getting started with Kafka Connect on Kubernetes :)

Topics:
azure, cloud, cncf, kafka, kafka connect platform, kubernetes

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}