Kubernetes Autoscaling 101

DZone 's Guide to

Kubernetes Autoscaling 101

Want to learn more about autoscaling in Kubernetes? Check out this post where we take a look at working with specific autoscalers in Kubernetes.

· Cloud Zone ·
Free Resource

Kubernetes, at its core, is a resources management and orchestration tool. In this post, we are going to focus day-1 operations to explore and play around with its cool features to deploy, monitor, and control your pods. However, you need to think of day-2 operations as well. In this post, we will focus on features like:

  • How am I going to scale pods and applications?
  • How can I keep containers running in a healthy state and running efficiently?
  • With the on-going changes in my code and my users’ workloads, how can I keep up with such changes?

How do I know? In my work at Magalix, we help companies and developers find the right balance between performance and capacity inside their Kubernetes clusters. We went through a lot of pain and learning cycles to make Kubernetes work properly.

We have been using Kubernetes for more than a year now and what I’m sharing here are some highlights related to autoscaling Kubernetes.

Feel free to ask questions about your specific situation in the comments below. I’ll be happy to share how we solved similar problems in our infrastructure.

Configuring Kubernetes clusters to balance resources and performance can be challenging, and requires expert knowledge of the inner workings of Kubernetes. Your app or services’ workload isn’t constant; it fluctuates throughout the day (if not the hour) — think of it as a journey and ongoing process.

Remember, to truly master Kubernetes, you need to master different ways to manage the scale of cluster resources, that’s the core of promise of Kubernetes.

You need to focus on questions like:

  • How am I going to scale pods and applications?
  • How can I keep containers running in a healthy state and running efficiently?
  • With the on-going changes in my code and my users’ workloads, how can I keep up with such changes?

Kubernetes Autoscaling Building Blocks

Effective Kubernetes autoscaling requires coordination between two layers of scalability: (1) Pods layer autoscalers, this includes Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA); both scale available resources for your containers, and (2) cluster level scalability, which managed by the Cluster Autoscaler (CA); it scales up or down the number of nodes inside your cluster.

Horizontal Pod Autoscaler (HPA)

As the name implies, HPA scales the number of pod replicas. Most DevOps use CPU and memory as the triggers to scale more pod replicas or less. However, you can configure it to scale your pods based on custom metrics, multiple metrics, or even external metrics.

High-Level HPA Workflow

HPA high-level workflow

  1. HPA continuously checks metrics values you configure during setup at a default 30-second intervals
  2. HPA attempts to increase the number of pods if the specified threshold is met
  3. HPA mainly updates the number of replicas inside the deployment or replication controller
  4. The Deployment/Replication Controller would then roll-out any additionally needed pods

Consider These as You Rollout HPA:

  • The default HPA check interval is 30 seconds. This can be configured through the horizontal-pod-autoscaler-sync-period flag of the controller manager
  • Default HPA relative metrics tolerance is 10 percent
  • HPA waits for 3 minutes after the last scale-up events to allow metrics to stabilize. This can also be configured through the horizontal-pod-autoscaler-upscale-delay flag
  • HPA waits for 5 minutes from the last scale-down event to avoid autoscaler thrashing. This is configurable through the horizontal-pod-autoscaler-downscale-delay flag.
  • HPA works best with deployment objects as opposed to replication controllers. It does not work with a rolling update using direct manipulation of replication controllers. It depends on the deployment object to manage the size of the underlying replica sets when you perform a deployment.

Vertical Pods Autoscaler

Vertical Pods Autoscaler (VPA) allocates more (or less) CPU or memory to existing pods. Think of it as giving pods some growth hormones! It can work for both stateful and stateless pods, but it is built mainly for stateful services. However, you can use it for stateless pods as well, if you would like to implement an auto-correction of resources you initially allocated for your pods. VPA can also react to OOM (out of memory) events. VPA currently requires the pods to be restarted to change the allocated CPU and memory. When VPA restarts pods, it respects the pods' distribution budget (PDB) to make sure there is always a minimum number of required pods. You can set the min and max of resources that the VPA can allocate to any of your pods. For example, you can limit the maximum memory limit to be no more than 8 GB. This is useful, in particular, when you know that your current nodes cannot allocate more than 8 GB per container. Read the VPA’s official wiki page for detailed spec and design.

VPA has also an interesting feature called the VPA Recommender. It watches the historic resources usage and OOM events of all pods to suggest new values of the “request” resources spec. The Recommender generally uses some smart algorithm to calculate memory and CPU values based on historical metrics. It also provides an API that takes the pod descriptor and provides suggested resources requests.

It worth mentioning that VPA Recommender doesn’t work on setting up the “limit” of resources. This can cause pods to monopolize resources inside your nodes. I suggest you set a “limit” value at the namespace level to avoid crazy consumption of memory or CPU

High-Level VPA Workflow

VPA high-level workflow

  1. VPA continuously checks metrics values you configured during setup at a default 10-second intervals
  2. VPA attempts to change the allocated memory and/or CPU if the threshold is met
  3. VPA mainly updates the resources inside the deployment or replication controller specs
  4. When pods are restarted, the new resources are all applied to the created instances.

A Few Points to Consider as you Rollout the VPA:

  • Changes in resources are not yet possible without restarting the pod. The main rationale, so far, is that such a change may cause a lot of instability. Hence, the thinking is to restart the pods and let them be scheduled based on the newly allocated resources.
  • VPA and HPA are not yet compatible with each other and cannot work on the same pods. Make sure you separate their scope in your setup if you are using them both inside the same cluster.
  • VPA adjusts only the resources requests of containers based on past and current resource usage observed. It doesn’t set resources limits. This can be problematic with misbehaving applications, which start using more and more resources that lead to pods being killed by Kubernetes.
  • VPA is in its early stage. It will evolve in the next few months, so be prepared for that! Details on known limitations can be found here and on future work here

Cluster Autoscaler

Cluster Autoscaler (CA) scales your cluster nodes based on pending pods. It periodically checks whether there are any pending pods and increases the size of the cluster if more resources are needed and if the scaled-up cluster is still within the user-provided constraints. CA interfaces work with the cloud provider to request more nodes or deallocate idle nodes. It works with GCP, AWS, and Azure. Version 1.0 (GA) was released with Kubernetes 1.8.

High-Level CA Workflow

  1. The CA checks for pods in a pending state at a default interval of 10 seconds.
  2. If there are one or more pods in the pending state, this is because there are not enough available resources on the cluster to allocate on the cluster, then it attempts to provide one or more additional nodes.
  3. When the node is granted by the cloud provider, the node is joined to the cluster and becomes ready to serve pods.
  4. The Kubernetes scheduler allocates the pending pods to the new node. If some pods are still in pending state, the process is repeated and more nodes are added to the cluster.

Consider These as You Roll-Out the CA

  • Cluster Autoscaler makes sure that all pods in the cluster have a place to run, no matter if there is any CPU load or not. Moreover, it tries to ensure that there are no unneeded nodes in the cluster. (source)
  • CA realizes a scalability need in about 30 seconds.
  • CA waits for  10 mins by default after a node becomes unneeded before it scales it down.
  • CA has the concept of expanders. Expanders provide different strategies for selecting the node group to which new nodes will be added.
  • Use “cluster-autoscaler.kubernetes.io/safe-to-evict”: “true” responsibly. If you set many of your pods or enough pods that are on all your nodes, you will lose a lot of flexibility to scale down.
  • Use the PodDisruptionBudgets to prevent pods from being deleted and end up part of your application fully non-functional.

How Kubernetes Autoscalers Interact Together

If you would like to reach the nirvana of autoscaling your Kubernetes cluster, you will need to use pod layer autoscalers with the CA. The way they work with each other is relatively simple, as shown in the below illustration.

  1. HPA or VPA update pod replicas or resources allocated to an existing pod.
  2. If there are not enough nodes to run pods for a post-scalability event, CA picks up some or all of the scaled pods in the pending state.
  3. CA allocates new nodes.
  4. Pods are scheduled on the provisioned nodes.

Common Mistakes

I’ve seen mistakes made in many different forums, such as Kubernetes, Slack channels, and StackOverflow questions, as well as common issues that many DevOps miss while getting their feet wet with autoscalers.

HPA and VPA depend on metrics and historical data. If you don’t have enough resources allocated, your pods will be OOM killed and never get a chance to generate metrics. Your scale may never take place in this case.

Scaling up is the mostly a time-sensitive operation. You want your pods and cluster to scale fairly quickly before your users experience any disruption or crash in your application. You should consider the average time it can take your pods and cluster to scale up.

Best Case Scenario — 4 Minutes

  1. 30 seconds — Target metrics values updated: 30–60 seconds
  2. 30 seconds — HPA checks on metrics values: 30 seconds ->
  3. < 2 seconds — Pods are created and go into a pending state — 1 second
  4. < 2 seconds — CA sees the pending pods and fires up the calls to provision nodes — 1 second
  5. 3 minutes — Cloud providers provision the nodes, and K8 waits for them until they are ready, this can take up to 10 minutes — it depends on multiple factors.

(Reasonable) Worst Case Scenario — 12 minutes

  1. 60 seconds — Target metrics values updated
  2. 30 seconds — HPA checks on metrics values
  3. < 2 seconds — Pods are created and go into a pending state
  4. < 2 seconds — CA sees the pending pods and fires up the calls to provision nodes
  5. 10 minutes — Cloud provider provision the nodes, and  K8 waits for them until they are ready minutes. This depends on multiple factors, such as provider latency, OS latency, bootstrapping tools, etc. 

Do not confuse cloud provider scalability mechanisms with the CA. CA works from within your cluster while the cloud provider’s scalability mechanism (such as ASGs inside AWS) work based on nodes allocation. It is not aware of what’s taking place with your pods or application. Using them together will render your cluster unstable and hard to predict behavior.


The insights I have shared in this article come from my work building Magalix — an AI that provides resource management and recommendations to companies using Kubernetes. Building fully elastic Kubernetes-managed microservices is difficult and still requires a lot of legwork.

Here’s the quick version of what you need to understand about Kubernetes Autoscaling:

  • Kubernetes is a resources management and orchestration tool. Day-2 operations to manage your pods and cluster’s resources is a key milestone in your journey of mastering Kubernetes.
  • Have the right mental model in mind, focusing the pods' scalability using HPA and VPA.
  • CA is recommended if you have a good understanding of your pods and containers needs.
  • Understanding how different autoscalers work together will help you configure your cluster.
  • Make sure you plan for the worst case and best case scenarios when it comes to how long it will take your pods and cluster to scale up or down.

I have also written about Kubernetes Monitoring and will continue to share what I have learned here.

Feel free to ask questions in the comments below!

application scaling, autoscaling, cloud, cluster autoscaler, founders, kubernetes, resource allocation

Published at DZone with permission of Mohamed Ahmed . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}