Kubernetes — Replication, and Self-Healing
The benefit of using replication for your microservices and how the Kubernetes cluster can automatically recover from a service failure.
Join the DZone community and get the full member experience.Join For Free
To start with, I would like to explain what does self-healing means in terms of Kubernetes. Self-healing is a fantastic feature of Kubernetes to recover from service or node failure automatically. In the following article, we will consider the benefit of using replication for your microservices and how the Kubernetes cluster can automatically recover from a service failure.
One of the great features of Kubernetes is the ability to replicate pods and their underlying containers across the cluster. So, before we set up our self-healing feature please make sure you have managed replication, and here is a simple example of a deployment file that will deploy
nginx container with replication factor
All right, so let's create the deployment:
Now, let's check whether our
nginx-deployment-example was created:
You should see your
nginx-deployment-example deployment in the default namespace. If we want to see more details about those pods, please run the following command:
We will see our 3
Kubernetes ensures that the actual state of the cluster and the desired statue of the cluster are always in-sync. This is made possible through continuous monitoring within the Kubernetes cluster. Whenever the state of a cluster changes from what has been defined, the various components of Kubernetes work to bring it back to its defined state. This automated recovery is often referred to as self-healing.
So, let's copy one of the pods mentioned in the prerequisite and see what happens when we delete it:
And after a few seconds, we see that our pod was deleted:
pod "nginx-deployment-example-f4cd8584-f494x" deleted
Let's go ahead and list the pods one more time:
And we see that the pod
nginx-deployment-example-f4cd8584-sgfqq was automatically created to replace our deleted pod
nginx-deployment-example-f4cd8584-f494x. And the reason is that
nginx deployment is set to have 3 replicas. So, even though one of these was deleted, our Kubernetes cluster works to make sure that the desired state is the actual state that we have.
So, now let's consider the case when there is an actual node failure in your cluster.
First, let's check our nodes:
You will see your Master and Worker nodes. Now, let's figure out what server our pods are running on. To do that, we have to
Under the Events, you can see to which server the pod was assigned. We can also scroll up and under Node will also see where it has been assigned, which is of course will be the same server. Once you've identified all servers that are pods running on, you have to pick one server and simulate a node failure by
shutting down the server.
Once the node has been shut down let us head back to the Master and check on the status of the nodes:
We see that the cluster knows that one node is down. Let's also list our pods:
You see that the state of the pod that was running on
"failed" node is Unknown, but you see that another pod took its place. Let's go ahead and list the deployments:
And as we expected we have 3 pods available which are now in sync with our Desired amount of pods.
Now, let's try to describe the Unknown pod:
You see that the Status of Unknown pod is
Termination Grace Period will be the
30s by default and the reason for this is
NodeLost. Also, you will see the messages that specifies that our node which was running our pod is unresponsive.
"failed" node and wait till it successfully rejoins the cluster.
"failed" node is up and running, let's go ahead and check the status of our deployment:
We will see that our pod count is in sync. So, let's list our pods out:
We will see that our Kubernetes cluster has finally terminated the old pod, and we are left with our desired count of
As we see, Kubernetes takes a self-healing approach to infrastructure that reduces the criticality of failures, making fire drills less common. Kubernetes heals itself when there is a discrepancy and ensures the cluster always matches the declarative state. In other words, Kubernetes kicks in and fixes a deviation, if detected. For example, if a pod goes down, a new one will be deployed to match the desired state.
Opinions expressed by DZone contributors are their own.