{{announcement.body}}
{{announcement.title}}

CPU-Based Pod Auto-Scaling in Kubernetes

DZone 's Guide to

CPU-Based Pod Auto-Scaling in Kubernetes

This article demonstrates and explains how different types of autoscaling work and goes in-depth with horizontal autoscaling with visualizations.

Free Resource

You know what that is? Growth.

The auto-scaling allows us to optimally allocate the resource to an application based its current resource consumption.

Kubernetes offers three main types of autoscaling:

  1. Horizontal Pod Autoscaler (HPA): HPA controls the number of pods
  2. Vertical Pod Autoscaler (VPA): VPA controls the resources in individual pods
  3. Cluster Autoscaler (CA): CA controls the number of nodes in a cluster

Horizontal Pod Autoscaler (HPA)

In this article, we will focus on HPA. By default, Kubernetes supports CPU-based and memory-based pod auto-scaling. You can, however, configure it to scale based on a custom metric or multiple metrics as well. Horizontal pod scaling
Horizontal pod scaling
HPA continuously checks metric threshold values you configure (note: the default HPA check interval is 15 seconds and this can be altered as per our requirement using the  --horizontal-pod-auto-scaler-sync-period  flag) and if the current threshold is higher than the specified threshold 
HPA attempts to increase the number of pods. HPA controller assumes a linear relationship between the metric and the number of pods. It operates on the ratio between the desired metric value and the current metric value. The formula used to compute the desired replicas is as follows (refer to the K8s documentation for more details).

Compute desired replicas

Compute desired replicas

In this article, we will focus on horizontal pod autoscaling and we will scale based on the CPU utilization (which is one of the most commonly used metrics). Note that higher CPU utilization indicates higher latency. Therefore, maintaining the CPU utilization at lower levels allows us to maintain the latency (of the application) at lower levels as well. The following figure shows the variation of the CPU utilization of an (I/O bound) microservice. 

You may also enjoy: Microservices Architecture: Introduction to Auto Scaling

Variation of CPU utilization

Variation of CPU utilization

Deploying the App

Let us now deploy a microservice in Kubernetes and study the performance behaviors with auto-scaling enabled. We will deploy a Spring Boot microservice (see here for the Github repo) in K8s. The following is the Kubernetes YAML file for the deployment.

YAML




xxxxxxxxxx
1
28


 
1
#deploymet/app/app.yaml
2
 
          
3
apiVersion: apps/v1
4
kind: Deployment
5
metadata:
6
 name: springboot-app
7
 labels:
8
   app: springboot-app
9
spec:
10
 replicas: 1
11
 selector:
12
   matchLabels:
13
     app: springboot-app
14
 template:
15
   metadata:
16
     labels:
17
       app: springboot-app
18
   spec:
19
     containers:
20
     - name: springboot-app
21
       image: anushiya/app:latest
22
       resources:
23
           limits:
24
               cpu: "100m"
25
           requests:
26
               cpu: "100m"
27
       ports:
28
       - containerPort: 9000


Configuring the Horizontal Pod Autoscaler

Let us now enable Horizontal Pod Autoscaling for the deployment created above. We configure the HPA to scale based on CPU utilization. The YAML file is shown below.

YAML




xxxxxxxxxx
1
17


 
1
#springboot-app-hpa.yaml
2
apiVersion: autoscaling/v2beta2
3
kind: HorizontalPodAutoscaler
4
metadata:
5
  name: springboot-app-hpa
6
spec:
7
  scaleTargetRef:
8
    apiVersion: apps/v2beta2
9
    kind: Deployment
10
    name: springboot-app
11
  minReplicas: 1
12
  maxReplicas: 20
13
  metrics:
14
    - resource:
15
        name: cpu
16
        targetAverageUtilization: 50
17
      type: Resource


HPA is also an API resource in the Kubernetes, with apiVersion, kind, metadata and spec fields (refer to K8s documentation for more details). The scaling condition is defined by  resource: targetAverageUtilization. Here we specify a value of 50. This means that if the CPU utilization exceeds the given value, the scaling process starts. The value should be in between 1 and 100.

Deploying JMeter

To test the performance of the application, we use JMeter as the load testing client. To deploy the JMeter we created a Docker image. The following is the Dockerfile for the JMeter. The files used can be found in this repo.

Shell




xxxxxxxxxx
1
14


 
1
FROM anushiya/jmeter-plugins:v1
2
 
          
3
ADD bash /home/kubernetes-performance/bash
4
ADD jar /home/kubernetes-performance/jar
5
ADD jmx /home/kubernetes-performance/jmx
6
ADD python /home/kubernetes-performance/python
7
 
          
8
WORKDIR /home/kubernetes-performance/bash
9
 
          
10
RUN chmod +x start_performance_test.sh
11
 
          
12
RUN apt-get update && apt-get install python3.5 -y
13
RUN apt-get install python-pip -y
14
RUN pip install numpy requests schedule


Since we want to store the performance test results permanently we use host volume to store the results of the tests performed. To create a host volume ssh into any of the nodes and create a directory to mount.

Shell




xxxxxxxxxx
1
11


 
1
#Get the list of nodes
2
kubectl get node
3
 
          
4
#Select a node and ssh into it
5
sudo gcloud beta compute --project "[name of the project]" ssh --zone "[zone]" "[name of the node]"
6
 
          
7
#example
8
sudo gcloud beta compute --project "performance-testing" ssh --zone "us-central1-a" "gke-performance-testing-default-pool-b6e4d476-78zn"
9
 
          
10
#Create a directory to mount
11
sudo mkdir /mnt/data/results


Create a persistent volume.

YAML




xxxxxxxxxx
1
15


 
1
#pv-volume.yaml
2
apiVersion: v1
3
kind: PersistentVolume
4
metadata:
5
 name: pv-volume
6
 labels:
7
   type: local
8
spec:
9
 storageClassName: manual
10
 capacity:
11
   storage: 10Gi
12
 accessModes:
13
   - ReadWriteOnce
14
 hostPath:
15
   path: "/mnt/data/results"


Create a persistent volume claim.

YAML




xxxxxxxxxx
1
12


 
1
#deployment/volume/pv-claim.yaml
2
apiVersion: v1
3
kind: PersistentVolumeClaim
4
metadata:
5
 name: pv-claim
6
spec:
7
 storageClassName: manual
8
 accessModes:
9
   - ReadWriteOnce
10
 resources:
11
   requests:
12
     storage: 6Gi




Apply the YAML files to create persistent volume and persistent volume claim

Shell




xxxxxxxxxx
1


 
1
#create persistent volume
2
kubectl apply -f deployment/volume/pv-volume.yaml
3
 
          
4
#create persistent volume claim
5
kubectl apply -f deployment/volume/pv-claim.yaml


For more details about PersistentVolume and PersistenetVolumeClaim see this. Now that we have created volumes to store the test results, we'll move on to creating a Job to perform the tests. The test results can be found in the directory specified above.

YAML




x
22


 
1
#perf-test.yaml
2
apiVersion: batch/v1
3
kind: Job
4
metadata:
5
  name: perf-test
6
spec:
7
  template:
8
    spec:
9
      containers:
10
      - name: perf-test
11
        image: anushiya/perf-test:v1
12
        imagePullPolicy: Always
13
        command: ["bash", "start_performance_test.sh"]
14
        volumeMounts:
15
        - mountPath: "/home/kubernetes-performance/results"
16
          name: pv-storage
17
      restartPolicy: Never 
18
      volumes:
19
      - name: pv-storage
20
        persistentVolumeClaim:
21
          claimName: pv-claim
22
  backoffLimit: 4


Analyzing the Behavior of CPU Utilization, Latency and Pod Count

Let us now take a look at how the CPU, pod count, and latency vary with time. The following figures show the variation in CPU utilization, pod count and the latency when we test the performance using a single concurrency user. We have used Stackdriver Monitoring API to get the performance statistics (see this link for more details).
CPU utilization vs time

CPU utilization vs time

Response time

Response time

Number of pods vs time
Number of pods vs time
Initially, the pod count =1 (this is the minimum number of pods specified) in the HorizontalPodAutoscaler configuration. As the load increases, the pod count increases to 20, to maintain the CPU utilization below 50%.  When we increase the number of concurrent users to 100, we notice the following behaviour. 


CPU utilization vs time

CPU utilization vs time



Response time

Response tim


Pods vs time, concurrency 10
Pods vs time, concurrency 10

Further Reading

Vertical Scaling and Horizontal Scaling in AWS

Reasons to Scale Horizontally


Topics:
auto-scaling ,k8s

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}