{{announcement.body}}
{{announcement.title}}

Beyond Kube-Scheduler, a Need for a K8s Cluster Balancer

DZone 's Guide to

Beyond Kube-Scheduler, a Need for a K8s Cluster Balancer

In this article, we discuss a dev's recent experience with a containerized app, problems he faced managing it, and how Kubernetes became a solution.

· Cloud Zone ·
Free Resource

Before writing about Kube-scheduler and a premise about a cluster balancer, I think it is mandatory to say a few lines about Kubernetes. Kubernetes' popularity reaches new height every day, and it's becoming a de-facto solution for many usual and unusual distributed-systems problems.

Kubernetes (k8s) is an open-source, distributed-system for automated deployments, scaling the deployments as per traffic and other business availability needs and managing the containerized applications.

I want to share my initial experience with a containerized application, problems I faced with managing it, and how Marathon, and later Kubernetes became a savior for me. However, today I am picking a little advanced, although not a complicated, topic related to Kubernetes cluster management. Okay, let’s get started to dive-in on Kube-scheduler and K8's scheduling framework, and finally I'll introduce a not-very-popular, though elegant project for balancing Kubernetes cluster, named Descheduler.

Concepts in Kube-scheduler and Kubernetes Scheduling Framework

Well, we know this fact that all distributed-systems need a process or a program to schedule a job or a task on the cluster for execution; similarly, Kubernetes needs it too, and as the name suggests, Kube-scheduler is performing the same corresponding role for Kubernetes. Kube-scheduler runs as a part of the control plane and monitors all the newly created pods that have no node assigned to them. 

In simple words, Kube-scheduler is responsible for discovering the newly created pod and selects an optimal node for them to run on. Currently, you guys might be wondering how Kube-scheduler determines the best available node, right? As per K8s official documentation, node selection for a pod is a two-step process:

  1. Filtering: Find the set of nodes where it is feasible to schedule the pod. Nodes that meet the scheduling criteria of the pod are known as feasible nodes. If the filtering step results in empty feasible nodes, set the pod to remain unscheduled until the time that the scheduler finds the feasible-nodes set. Please follow the filtering documentation for more information about filtering criteria.
  2. Scoring: In this step, the scheduler ranks the feasible nodes set to choose the most suitable node for the pod placement. The scheduler follows the active-scoring-rules for determining the rank of a node.

Once Kube-scheduler determines the most suitable node for pod placement, it notifies the API server about the decision in a process called binding. As per the K8s scheduling-framework documentation, node selection and binding process are known as the Scheduling cycle and Binding cycle. Together, a scheduling cycle and binding cycle are known as a scheduling context

Scheduling-framework is a new pluggable architecture for k8s scheduler that makes scheduler customizations easy. The scheduling cycle is serial; however, the binding cycle is concurrent. The following picture depicts the scheduling-framework

Pod scheduling context

I am planning to write another post detailing each extension point in scheduling context for introducing the custom logic of scheduling.

Why Do We Need a Kubernetes Cluster Balancer?

From the perspective of Kube-scheduler, it did the perfect job by assigning each pod on an optimal node. Scheduler took a decision based on its view of a Kubernetes cluster at the point in time when a new pod appears for scheduling. Kubernetes clusters are very dynamic, and their state changes over time because of the cluster-wide changes, like the addition or removal of a node, addition or removal of constraints on nodes, for example, tainting a node.

Because of the changes, over the period Kubernetes cluster reaches an unbalanced state, and there is a need for balancing a cluster.

Approach for the Solution

Manually balancing out a cluster is a tedious process because, firstly, we need to determine wrongly placed pods, as per the new cluster state and then to figure out the strategy for their movement on an optimal node. But, if we look at the second part of the problem, it is already taken care of by Kube-scheduler. We just need a process that determines wrongly placed pods and removes it. Correct? Okay, that's it, our problem is elementary now. We need a rule-based descheduler.

Time for the Solution

So, it is high time for introducing the benefactor, tada, drum-role please. The hero is named Descheduler. Without any delay, let me first introduce the GitHub repo of the project. As I said, the requirement of rule-based descheduling of pods and the following five configurable strategiesDescheduler can help us.

  1. RemoveDuplicates: It makes sure that there is only one pod associated with a Replica Set (RS), Replication Controller (RC), Deployment, or Job object is running on the same node.
  2. LowNodeUtilization: Find under-utilized nodes and evicts pods from other nodes so that Kube-scheduler place them on under-utilized one.
  3. RemovePodsViolatingNodeTaints: It makes sure the eviction of pods violating NoSchedule taints.
  4. RemovePodsViolatingNodeAffinity: It makes sure the eviction of pods violating inter-pod anti-affinity.
  5. RemovePodsViolatingInterPodAntiAffinity: It makes sure the eviction of pods violating node affinity.

Next Question, it Looks Promising, How can I use it?

From the README.MD itself, The Descheduler can be run as a Job or CronJob inside of a k8s cluster. It has the advantage of being able to be run multiple times without needing user intervention. The Descheduler pod runs as a critical pod in the Kube-system namespace to avoid being evicted by itself or by the Kubelet.

Simple Kubernetes object for running a Descheduler

  • RBAC Object
YAML
 


ConfigMap Object

YAML
 




xxxxxxxxxx
1
29


 
1
---
2
apiVersion: v1
3
kind: ConfigMap
4
metadata:
5
  name: descheduler-policy-configmap
6
  namespace: kube-system
7
data:
8
  policy.yaml: |
9
    apiVersion: "descheduler/v1alpha1"
10
    kind: "DeschedulerPolicy"
11
    strategies:
12
      "RemoveDuplicates":
13
         enabled: true
14
      "RemovePodsViolatingInterPodAntiAffinity":
15
         enabled: true
16
      "LowNodeUtilization":
17
         enabled: true
18
         params:
19
           nodeResourceUtilizationThresholds:
20
             thresholds:
21
               "cpu" : 20
22
               "memory": 20
23
               "pods": 20
24
             targetThresholds:
25
               "cpu" : 50
26
               "memory": 50
27
               "pods": 50
28
 
          
29
 
          



ConfigMap Object

YAML
 




xxxxxxxxxx
1
34


1
---
2
apiVersion: batch/v1
3
kind: Job
4
metadata:
5
  name: descheduler-job
6
  namespace: kube-system
7
spec:
8
  parallelism: 1
9
  completions: 1
10
  template:
11
    metadata:
12
      name: descheduler-pod
13
    spec:
14
      priorityClassName: system-cluster-critical
15
      containers:
16
        - name: descheduler
17
          image: us.gcr.io/k8s-artifacts-prod/descheduler/descheduler:v0.10.0
18
          volumeMounts:
19
          - mountPath: /policy-dir
20
            name: policy-volume
21
          command:
22
            - "/bin/descheduler"
23
          args:
24
            - "--policy-config-file"
25
            - "/policy-dir/policy.yaml"
26
            - "--v"
27
            - "3"
28
      restartPolicy: "Never"
29
      serviceAccountName: descheduler-sa
30
      volumes:
31
      - name: policy-volume
32
        configMap:
33
          name: descheduler-policy-configmap
34
 
          




Finally, I will recommend the project for balancing the Kubernetes cluster. You should try if you are facing the same problem.

Topics:
autoscaling ,distributed systems ,kubectl ,kubernetes ,scheduler

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}