Please Don’t Evict My Pod: Priority and Budget Disruption
Please Don’t Evict My Pod: Priority and Budget Disruption
In this post, we are going to cover the pod priority class, pod disruption budget, and the relationship of these constructs' with pod eviction.
Join the DZone community and get the full member experience.Join For Free
In this post, we are going to cover the pod priority class, pod disruption budget, and the relationship of these constructs' with pod eviction. Okay, enough of talking, let’s start with pod priority class.
PriorityClass and Preemption
PriorityClass is a stable Kubernetes object from version 1.14, and it is a part of the scheduling group used for defining a mapping between priority class name and the integer value of the priority. PriorityClass is straightforward to understand; the higher the value of the integer, the higher is the priority. Take, for example, a PriorityClass with an integer value of ten and another with an integer value of twenty; the later one holds a higher priority than the first one.
PriorityClass is a non-namespaced object and has one particular optional boolean field named as
globalDefault. Among all the PriorityClass objects in a cluster, only one object in a cluster can have this value as
globalDefault=true, which means the integer value of this object represents the default priority value of all the pods in a K8s cluster without specific
priorityClassName value in pod definition. By default, if there is no PriorityClass object with
globalDefault=true value, then default pod priority value is set to zero.
Later, if we add an object with
globalDefault=true value, then all new pods without a specific
priorityClassName value have a priority value equals to the integer value of the PriorityClass object; however, the old pod priority remains zero. By default, Kubernetes cluster ships with two PriorityClasses:
system-node-critical is the highest available priority, even higher than
Let’s see how the priority of a pod affects the behaviour of the K8s cluster
kube-scheduler and results in the eviction of the other pods from a node.
Kube-scheduler tries to schedule a newly created pod on the K8s cluster; however, if the resources required for a pod is not available on any node, PriorityClass preemption logic comes into the picture. Based on the priority of the pod,
kube-scheduler determines the node where eviction of low priority pods results in its execution.
The preemption process results in the eviction of the low priority pods from
a node to schedule high priority pod on a node.
A PriorityClass object has a field named
PreemptionPolicy, which defines the behaviour of the object that corresponds to preemption. By default, its values are
PreemptionPolicy=PreemptLowerPriority, which will allow pods of that PriorityClass to preempt lower-priority pods. If
PreemptionPolicy=Never, pods in that PriorityClass will be non-preempting other pods. Let’s quickly see the example of preempting and non-preempting:
Hang on with a preemption here, and we will revisit it after formalizing our understanding of the pod disruption budget.
Pod Disruption Budget
PodDisruptionBudget (PDB) is also a Kubernetes object that works at the application level. PDB defines the limits of the number of pods of a replication-set to go down simultaneously. PDB is an indicator of how much disruption an application can handle at a given time. One of the best use-cases of the PDB is to use it with the app, which requires quorum management, for example, zookeeper. Below is the definition of a PDB object, which defines
min availability of the pod should be two.
Commands for PDB
PDB of an application is an import aspect which takes into consideration while performing disruption voluntarily in a K8s cluster. It will halt the disruption process to maintain the disruption budget of the app. PDB is very helpful in-case of cluster activities like node drain or balancing the K8s cluster using projects like Descheduler, but is PDB is useful in preemption too?
Preemption respects PDB with best effort, which means the scheduler tries to find the victim for eviction considering the PDB of an application and tries not to violate. Still, if no such option is available, then preemption will happen to dishonor the PDB of an app. For testing the PDB and eviction, you can try a
Warning: In a cluster where not all users are trusted, a malicious user could create Pods at the highest possible priorities, causing other pods eviction and pending for scheduling. Also, improper use of PriorityClass may lead to the cascading failure eventually results in production outage, like the following one shared by Grafana community.
How Pod PriorityClass, QoS Class, and Eviction Policy Are Linked
PriorityClass and QoS class of a pod are two independent and unrelated features. There is no specification and rules related to the QoS class of a pod and its priority. Hence, it is possible that for scheduling high priority pod, the node can evict the Guaranteed QoS class pod because of low priority.
The only component that considers both QoS and Pod priority is kubelet out-of-resource eviction. The kubelet ranks Pods for eviction first by whether or not their usage of the starved resource exceeds requests, then by priority, and then by the consumption of the starved compute resource relative to the Pods’ scheduling requests.
Putting it All Together
Pod priority, QoS class, and eviction policy all together create a balancing combination in the K8s cluster. Adding new objects without considering the effects on another will destabilize the cluster state and can lead to catastrophe. In another post, I will share some of the best practices that would help in managing the cluster state better without many evictions.
Opinions expressed by DZone contributors are their own.