Successful GKE Node Pool Updates With Cloud Deployment Manager
One of the few difficuties of CDM. Fortunately, we've got you covered.
Join the DZone community and get the full member experience.
Join For FreeThe Cloud Deployment Manager is actually a great tool and we’re loving using it.
But as with any other technology, there are also tricky tasks. One of them is when it comes to node pool updates. Basic operations like node version upgrade or node count changes are straightforward, but changing OAuth scopes or changing machine type does not work out-of-the-box, because this requires the creation of a new node pool.
Let’s see how to update a Google Kubernetes Engine (GKE) cluster with a new machine type.
Initial Setup
First, we create a GKE cluster with 2 nodes and machine type n1-standard-2
. We use the following Deployment Manager configuration:
resources:
- name: np-playground
type: container.v1.cluster
properties:
zone: "europe-west3-a"
cluster:
initialClusterVersion: "1.10.9-gke.5"
## Can be used to update master version, even if the official docu states that this field is r/o.
## ref: https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1/projects.zones.clusters
currentMasterVersion: "1.10.9-gke.5"
## Initial NodePool config, change only for node count or node version changes.
nodePools:
- name: "np-playground-np"
initialNodeCount: 2
version: "1.10.9-gke.5"
config:
machineType: "n1-standard-2"
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/ndev.clouddns.readwrite
preemptible: true
## Duplicates node pool config from v1.cluster section, to get it explicitly managed.
- name: np-playground-np
type: container.v1.nodePool
properties:
zone: europe-west3-a
## This is very important, as its actually controls the creation order by adding implicit a dependsOn constraint.
## ref: https://cloud.google.com/deployment-manager/docs/configuration/use-references
## ref: https://cloud.google.com/deployment-manager/docs/configuration/create-explicit-dependencies
clusterId: $(ref.np-playground.name)
nodePool:
name: "np-playground-np"
The configuration file has some specials conditions. First, the node pool configuration (line 25ff.) works because of the default policy for adding resources which is CREATE_OR_ACQUIRE
. If the Deployment Manager finds a resource which matches on name, type, and zone or region, then the resource will be acquired instead of created.
The second special is the reference usage (line 32). As a node pool cannot be created without an existing cluster, the deployment manager command would fail. So we've to ensure that the node pool will be created (actually acquired) after the cluster becomes available.
Without references, Deployment Manager creates all resources in parallel, so there is no guarantee that dependent resources are created in the correct order.
Using references would enforce the order in which resources are created. (source: GCP doc)
$ gcloud deployment-manager deployments create upgrade-test --config dm.yaml
NAME TYPE STATE ERRORS INTENT
np-playground container.v1.cluster COMPLETED []
np-playground-np container.v1.nodePool COMPLETED []
Adding a New Node Pool
At this point, we are going to create a new node pool with different machine type n1-highmem-2
.
The following section must be appended to configuration from the previous section:
## New NodePool with desired config
- name: np-playground-np-highmem
type: container.v1.nodePool
properties:
zone: europe-west3-a
clusterId: $(ref.np-playground.name)
nodePool:
name: "np-playground-np-highmem"
initialNodeCount: 2
version: "1.10.9-gke.5"
config:
## different machine type
machineType: "n1-highmem-2"
## scopes can be changed as well
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/ndev.clouddns.readwrite
preemptible: true
Applying these changes will result in following:
$ gcloud deployment-manager deployments update upgrade-test --config dm.yaml
NAME TYPE STATE INTENT
np-playground container.v1.cluster COMPLETED
np-playground-np container.v1.nodePool COMPLETED
np-playground-np-highmem container.v1.nodePool COMPLETED
Migrate the Workloads
After creating a new node pool, workloads are still running on the old NodePool.
Kubernetes does not reschedule Pods as long as they are running and available. (Source: GCP documents)
To migrate these Pods to the new node pool:
Cordon the existing node pool: This operation marks the nodes in the old node pool as unschedulable. Kubernetes stops scheduling new Pods to these nodes once you mark them as unschedulable.
Drain the existing node pool: This operation evicts the workloads running on the nodes of the old node pool gracefully.
Cordon Old Node Pool
Connect to the K8s cluster:
gcloud container clusters get-credentials np-playground --zone europe-west3-a
Select all nodes from old node pool:
kubectl get nodes -l cloud.google.com/gke-nodepool=np-playground-np
Cordon all nodes from old node pool:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=np-playground-np -o=name); do
kubectl cordon "$node";
done
Drain old Node Pool
The following command iterates each node in old node pool and drains them by evicting pods with an allotted graceful termination period of 10 seconds:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=np-playground-np -o=name); do
kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done
Delete Old Node Pool
Finally, we only have to delete the old node pool. This can be achieved by deleting or commenting out the old node pool configuration.
Find the full YAML below:
resources:
- name: np-playground
type: container.v1.cluster
properties:
zone: "europe-west3-a"
cluster:
initialClusterVersion: "1.10.9-gke.5"
currentMasterVersion: "1.10.9-gke.5"
## Initial NodePool config, change only for node count or node version changes.
nodePools:
- name: "np-playground-np"
initialNodeCount: 2
version: "1.10.9-gke.5"
config:
machineType: "n1-standard-2"
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/ndev.clouddns.readwrite
preemptible: true
## Duplicates node pool config from v1.cluster section, to get it explicitly managed.
#- name: np-playground-np
# type: container.v1.nodePool
# properties:
# zone: europe-west3-a
# clusterId: $(ref.np-playground.name)
# nodePool:
# name: "np-playground-np"
## New NodePool with desired config
- name: np-playground-np-highmem
type: container.v1.nodePool
properties:
zone: europe-west3-a
clusterId: $(ref.np-playground.name)
nodePool:
name: "np-playground-np-highmem"
initialNodeCount: 2
version: "1.10.9-gke.5"
config:
machineType: "n1-highmem-2"
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/ndev.clouddns.readwrite
preemptible: true
And apply:
$ gcloud deployment-manager deployments update upgrade-test --config dm.yaml
NAME TYPE STATE INTENT
np-playground container.v1.cluster COMPLETED
np-playground-np-highmem container.v1.nodePool COMPLETED
Conclusion
As you can see, it's actually not a big deal to perform node upgrades with the Deployment Manager.
Even if it is to be hoped that this will become a feature in future Deployment Manager versions (like node version upgrade).
But this procedure has two main drawbacks. First, an update cannot be done with a single Deployment Manager run and needs manual actions. This prevents us from a full automated cluster upgrade (in matter of machine type change).
The second issue is when the configuration file from the last step (delete old node pool) is used to create a new cluster (e.g. re-creation in case of disaster recovery) a cluster with two node pools is created — the one configured in nodePools
section beneath cluster and the new node pool with highmem machines.
Opinions expressed by DZone contributors are their own.
Comments