DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
11 Monitoring and Observability Tools for 2023
Learn more
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. Scaling Data Collectors on Azure Kubernetes Service-StreamSets: Where DevOps Meets Data Integration

Scaling Data Collectors on Azure Kubernetes Service-StreamSets: Where DevOps Meets Data Integration

In this post, we take a look at how to use these tools to remove complexity from and add efficiency to the process of creating data pipelines and batch jobs.

Dash Desai user avatar by
Dash Desai
·
Mar. 19, 19 · Tutorial
Like (4)
Save
Tweet
Share
4.51K Views

Join the DZone community and get the full member experience.

Join For Free

KubernetesIn this blog post, I will present a step-by-step guide on how to scale Data Collector instances on Azure Kubernetes Service (AKS) using provisioning agents which help automate the upgrading and scaling of resources on-demand, without having to stop the execution of pipeline jobs. AKS removes the complexity of implementing, installing, and maintaining Kubernetes in Azure and you only pay for the resources you consume.

Provisioning Agent

Provisioning agents are containerized applications that run within a container orchestration framework, such as Kubernetes. In our case, the StreamSets Control Agent runs as a Kubernetes deployment and it automatically provisions Data Collector containers in a given Kubernetes cluster.

For more details, click here.

Prerequisites

In order to follow along, you'll need the following prerequisites.

  • Access to StreamSets Control Hub (SCH)
    • With Auth Token Administrator and Provisioning Operator roles
    • If you don't have access to SCH, sign up for 30-day free trial
  • Azure Kubernetes Service
    • Access to existing Kubernetes cluster or privileges to create one
  • Azure CLI
  • kubectl

Ok, let's get started!

Step 1. Prepare Control Agent Deployment

Download a copy of the deployment template found here, and optionally update lines 4 and 41 with your desired name. For example, I set it to ' dash-aks-agent.'

Step 2. Prepare Control Agent Deployment Script

Download a copy of the script template found here and update the following variables:

* SCH_URL = YOUR CONTROL HUB URL
* SCH_ORG = YOUR CONTROL HUB ORG
* SCH_USER = YOUR CONTROL HUB USER
* SCH_PASSWORD = YOUR CONTROL HUB PASSWORD
* CLUSTER_NAME = YOUR AZURE CLUSTER NAME
* RESOURCE_GROUP = YOUR AZURE CLUSTER RESOURCE GROUP

Things to note about the above script:

  • It expects the file control-agent.yaml referenced in Step 1 to be present in the local directory
  • It requires utilities such as jq and perl, so you may need to install those/
  • If you're using an unsigned SSL certificate for your SCH installation, you may need to add a  -k parameter to the curl command.

Step 3. Deploy Control Agent

To deploy the control agent, run the script referenced in Step 2 as follows.

$ chmod +x deploy-control-agent-on-aks-template.sh

$ ./deploy-control-agent-on-aks-template.sh

This script might take a few minutes to run and if all goes well, you'll see output similar to:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   203  100   203    0     0    966      0 --:--:-- --:--:-- --:--:--   962
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   405  100   405    0     0    757      0 --:--:-- --:--:-- --:--:--   758

DPM Agent "0D5DC09F-B9E3-467D-A39E-DC56644BE727" successfully registered with SCH

Step 4. Control Agent in SCH

The control agent created in Step 3 should now show up in SCH as shown below.

You may also use kubectl to confirm that the control agent was deployed successfully.

$ kubectl get deployments
NAME             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
dash-aks-agent   1         1         1            1           14m
$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
dash-aks-agent-6cf96997b7-6hhwn   1/1       Running   0          15m

Step 5. Data Collector Containers in SCH

Now let's look at how to create a logical grouping of Data Collector containers that can be deployed by our provisioning agent created in step 3.

Select Deployments on the left sidebar menu and then click on the + shownhow below.

  • Fill in Name and Description.
  • For Provisioning Agent select the one created in Step 3.
  • For the Number of Data Collector Instances, we'll start with 2 and then we'll see how we can add more while the job is running.
  • For DataCollector Labels, enter 'aks-datacollectors' — more on this later.

Step 6. Activate Deployment in SCH

To activate the deployment created in Step 5, click on the Play button as show below.

This may take a couple of minutes and then you should see the status change to Active.

This also means that two Data Collector instances would have been created and made available for running pipeline jobs. You can verify this by selecting Execute >> Data Collectors as shown below.

You can also run the following kubectl command to confirm Data Collector container pod deployments.

$ kubectl get pods
NAME                                        READY     STATUS              RESTARTS   AGE
dash-aks-agent-6cf96997b7-6hhwn             1/1       Running             0          1h
datacollector-deployment-6468d7fbcd-snvqv   0/1       ContainerCreating   0          3s
datacollector-deployment-6468d7fbcd-spv52   0/1       ContainerCreating   0          3s

Step 7. Create a Pipeline and a Job

To test our deployment, let's create a simple pipeline (with Dev Raw Data Source origin and Trash destination) and a job to run the pipeline as shown below.

Important: Make sure to select the correct pipeline and enter ' aks-datacollectors' for Data Collector Labels, which matches the label of Data Collectors created by the container deployment in Step 5 (Note: Jobs use labels as selection criteria for Data Collectors to run associated pipelines).

Step 8. Execute Job

Select the job created in Step 7 and click on the Play button. If all goes well, the job will start executing the pipeline on two Data Collector instances, as shown below.

Step 9. Scale Up Data Collectors

Now let's see how we can scale up Data Collectors without having to stop job execution.

Select Execute >> Deployments from the left sidebar menu and click on the deployment created in Step 5. Then, increase the Number of Data Collector Instances to three and click on the Scale button, as shown below.

This should almost instantaneously bring up another Data Collector instance identical to the other two. To confirm, select Execute >> Data Collectors from the left sidebar menu, as shown below.

Important: Notice that the two Data Collectors are showing that there's one pipeline running on each via the running job, but the third Data Collector is running zero pipelines. This is because we need to synchronize the job so that it is instructed to start the pipeline(s) on additional Data Collectors that match the labels specified in the job.

To synchronize the job without stopping its execution, click on the Sync button located on the top right corner and then select Yes, as shown below.

Give it a few minutes and you should see the pipeline running on the third Data Collector as shown below.

Similarly, you can scale down Data Collector instances without having to stop the job execution.

That's all folks!

Summary

StreamSets Control Hub makes it extremely easy to create and manage your Data Collector deployments on Kubernetes both on Azure and Google Cloud Platform (GCP). For Data Collector deployments on GCP Kubernetes Engine, check out this blog post.

To learn about StreamSets and Microsoft partnership, click here.

Kubernetes Data integration azure Scaling (geometry) DevOps

Published at DZone with permission of Dash Desai, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Java Code Review Solution
  • The 5 Books You Absolutely Must Read as an Engineering Manager
  • GitLab vs Jenkins: Which Is the Best CI/CD Tool?
  • The Power of Zero-Knowledge Proofs: Exploring the New ConsenSys zkEVM

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: