How kubelet Eviction Policies Impact Cluster Rebalancing
Familiarize yourself with how exactly kubelet's evictions impact performance and how you can utilize them to lower bills and effectively commit resources.
Join the DZone community and get the full member experience.Join For Free
In this post, we’ll help you understand the automatic pod eviction and rescheduling that occurs when a particular host resource is being depleted.
The “kubelet” agent daemon is installed on all Kubernetes hosts to manage container creation and termination. By default, this daemon has the following eviction rule: memory.available<100Mi. So, when a host is low on memory, kubelet will one of the running pods to free memory on the host machine. (This is not a random decision, and we’ll describe its logic a bit later in this post.)
You can set special flags for kubelet on start, based on memory and disk space usage, to specify how pods are evicted to free resources. Separate settings exist for disk space of type “container images” and for running containers themselves. The thresholds are provided through additional flags on the start of the kubelet.
Here are the possible settings for these flags:
memory.available: Free memory on host server
nodefs.available: Containers filesystem free space (docker volumes, logs, etc’)
nodefs.inodesFree: Containers filesystem available inodes
imagefs.available: Images filesystem free space (docker images and container writable layers)
imagefs.inodesFree: Images filesystem available inodes
And these are the flags to apply to kubelet eviction policies:
1. The “
--eviction-soft” and “
--eviction-soft-grace-period” flags must be used together. If you don’t specify a grace period for soft limit, the kubelet will fail to start, displaying an error like:
error: failed to run Kubelet: failed to create kubelet: grace period must be specified for the soft eviction threshold nodefs.available
So, at “
--eviction-soft”, you specify thresholds per resource. For example:
And set the grace period (in time units) to pass before eviction starts. For example:
In this example, if we have less than 1Gi of disk space for more than an hour, or have inodes less than 500, at the nodefs filesystem (container volumes and logs), kubelet will select a pod to terminate.
To allow your pods time for a clean process shutdown, set the “
--eviction-max-pod-grace-period” time. This allows kubelet to signal your pod containers for graceful shutdown.
--eviction-hard” does not allow any time for graceful shutdown of a container. If the limit is exceeded, kubelet will take immediate action and terminate a chosen pod to free the resource. This flag is used exactly like “
--eviction-soft”, and you can choose from the five values in the earlier table.
--eviction-minimum-reclaim” helps to avoid flapping. Flapping is when hosts frequently reach eviction thresholds and, after a minimal cleanup by kubelet, quickly become full again. You can set the amount of “extra” space/memory/inodes that must be cleaned when the eviction signal occurs, in addition to the minimum required. Let’s look at the following scenario.
Your host has 8Gi of memory, with a hard limit set at 200Mi. You run tens of small pods with an average consumption of 60–80Mi. After some time, 180Mi free memory is left and kubelet kills one pod, freeing up 60Mi. The host now has 240Mi free.
The next pod is scheduled, because it requests only 20Mi. (Without a limit set, or with a 500Mi limit set, it can still be scheduled to this pod because Kubernete's scheduler makes decisions based on “request”, not limit, as described here.) This 20Mi pod quickly eats up 100Mi, triggering another eviction to occur. But if you’d set a minimum reclaim to 500Mi, upon reaching <200Mi, kubelet would clean up much more space, and the host would operate with no evictions for a longer time period.
When an eviction signal is received by Kubernetes (indicating a hard or soft limit is exceeded), it will switch one of the “MemoryPressure” or “DiskPressure” node conditions to true. No new “BestEffort” “quality of service” (QoS) pods will be sent to a node that has “MemoryPressure.” No new pods will be scheduled on a node with “DiskPressure.”
After kubelet does its cleanup, the node state transitions back to normal is delayed for the duration of the “eviction-pressure-transition-period” kubelet setting. You should use this option to avoid oscillation of node conditions, which may happen if pods exceed the soft limits frequently, without exceeding the grace period time of soft limits (which gets pods terminated).
The logic behind the decision to terminate a particular pod is based on the QoS class of the pod and its current resource usage relative to what was requested during pod start. There are 3 QoS classes for each resource:
QoS classes are determined automatically, based on requests and limits specified in the pod manifest.
If a pod has “requests” set (either CPU or memory) but no “limits” (or its limits are higher than the request), it is assigned a “burstable” class, which means it will be scheduled based on its requests and can utilize more resources (provided other pods don’t need resources). On the other hand, “guaranteed” pods (those which specify the same numbers in both, requests and limits, or has only limits specified) are considered top-priority and are guaranteed not to be killed until they exceed their limits or until the system is under memory pressure and there are no lower priority containers that can be evicted.
Important: Pods will not be killed if they exceed their CPU limit.
Instead, their container processes will be throttled and will receive CPU derived from other pods’ consumption and limits. This also takes into account the running system processes on a host.
When a pod is terminated by kubelet, Kubernetes will reschedule the pod immediately on another host where enough resources are available. Kubernetes may reschedule the pod again to the same machine if you have an incorrect combination of settings on kubelet and pod requests/limits.
Fine-tuning cluster rebalancing mechanisms is not an easy task and requires a good understanding of all your options. But a correct combination of settings and the needs of your workload is definitely worth the effort because you will be able to overcommit. You’ll also achieve a good utilization of cluster resources, leading to fewer servers running and a lower monthly bill (assuming you’re in the cloud) for computing and storage.
Published at DZone with permission of Oleg Chunikhin. See the original article here.
Opinions expressed by DZone contributors are their own.