How We Diagnosed a Hidden Scheduler Failure in a Docker Swarm Cluster Serving 2 Million Users
A real production incident in a Docker Swarm cluster — how a routine service update triggered a silent scheduler failure, and how we uncovered it.
Join the DZone community and get the full member experience.
Join For FreeContext: 120 Nodes, Strict SLAs, and Legacy Infrastructure
Our team is responsible for the mobile backend infrastructure serving over 2 million registered users. The Docker Swarm cluster consists of 120 nodes: 5 manager nodes, 40 worker nodes, and the rest are infrastructure servers. The cluster runs about 50 services, totaling hundreds of replicas.
We inherited Swarm from the previous contractor. The client is not yet ready to migrate to Kubernetes, and Swarm is currently sufficient for the current scale. Services are distributed across nodes in groups and bound by labels: up to 4 worker nodes are allocated to heavier services, 2 to less loaded ones, and 1 to non-critical services. Nodes can host replicas of multiple services.
Our SLAs are strict: If any part of the mobile app is completely unavailable, we have 30 minutes to resolve the issue, after which penalties begin to accrue.
What Happened
The issue was detected thanks to a monitoring alert regarding the unavailability of service replicas. While investigating the incident in the manager-node logs, we found the following warning:
Mar 03 07:46:32 swarm3 dockerd[875]:
time="2025-03-03T07:46:32.123554337Z"
level=warning
msg="underweighting node nt98wn9he8my6tsuasgkhrrjp
for service 86jgkc35ctasmu8ubpnilsrqo because it experienced
5 failures or rejections within 5m0s"
module=scheduler
node.id=gaip86ri06jyrdwxcogl9j2p5
This message indicates that Swarm's internal scheduler is lowering the priority (weight) of a specific worker node when scheduling service tasks. The reason is 5 failures or rejections in the last 5 minutes. Swarm effectively excludes this node from the pool of candidates for running replicas.
There was no critical downtime: Several replicas of the problematic services were running, and traffic was routed to the live instances. However, some replicas could not start — meaning the cluster was operating with reduced fault tolerance. With this SLA, that's a ticking time bomb.
Why Swarm Lowers a Node's Weight
Before describing our diagnosis, it's worth understanding the mechanics. Swarm lowers a node's weight for several reasons:
- Resource constraints. A container requires more CPU, memory, or disk space than is available on the node. Swarm cannot place the task and records a failure.
- Network issues. The node is unresponsive, or the connection is unstable. The manager loses contact with the worker and marks it as unreliable.
- Previous failed launches. If a container fails to start on a specific node several times in a row, Swarm temporarily excludes it from the list of candidates.
- Docker Daemon or hardware issues. Unstable Docker daemon operation or hardware failures lead to a cascade of failures when launching tasks.
- Mismatch between the number of replicas and the number of nodes with the required labels. This turned out to be our case. The service is bound to specific nodes via placement constraints with labels. If the number of replicas in the service configuration exceeds the number of nodes with the required label, the scheduler enters a cycle of failed placement attempts — even if there are enough free worker nodes in the cluster without that label.
- Service errors. The container starts but immediately terminates with an error or fails the health check. Swarm attempts to restart it, incrementing the failure count.
What We Tried First
The initial response to such errors is the standard set of steps:
- Rebuilding the service. We recreated the service using
docker service update --force. The replicas restarted, but the problem returned after a few minutes. - Changing the number of replicas. We reduced and then increased the number of replicas again. It didn't help.
- Reading container logs. The container logs themselves didn't show anything meaningful — the service was fine when it managed to start.
None of this yielded a consistent result. It became clear that the problem wasn't with the service, but at the infrastructure level — specifically, in how the scheduler makes placement decisions.
Troubleshooting: Identifying the Root Cause
Step 1: Checking Node Status
docker node ls
If any node has a status of Down or Unreachable, it is the first candidate. We look for the specific node mentioned in the error message:
docker node ls | grep nt98wn9he8my6tsuasgkhrrjp
In our case, all nodes were in the Ready state — the issue wasn't related to availability.
Step 2: Identify the Problematic Service
Using the first 12 characters of the service ID from the log, we find its name:
docker service ls | grep 86jgkc35ctas
Next, check the status of the tasks:
docker service ps 86jgkc35ctasmu8ubpnilsrqo
Here you can see on which node the task failed to start and why: Rejected, Shutdown, No suitable node.
Step 3: Checking Placement Constraints
This is where we found the cause. Let's see what placement constraints are configured for the service:
docker service inspect 86jgkc35ctasmu8ubpnilsrqo \
--format '{{json .Spec.TaskTemplate.Placement}}' | jq .
The service was bound to nodes with a specific label. Let's check how many nodes have this label:
docker node ls --filter "label=cli=1"
And then it became clear: The number of replicas in the service configuration exceeded the number of nodes with the required label. Most likely, the mismatch occurred during a routine service update, when the number of replicas was set higher than the number of available labeled nodes during reconfiguration. Replicas for which suitable nodes were found started normally, while for the rest, the scheduler repeatedly attempted to find a suitable node, received a rejection, and logged a failure.
Step 4: Checking Resources (for a Complete Picture)
Even after identifying the root cause, we checked the resources on the problematic nodes to rule out a combined issue:
docker node inspect nt98wn9he8my6tsuasgkhrrjp \
--format '{{json .Description.Resources}}' | jq .
And also the load directly:
top -o %CPU
free -m
df -h
The resources were fine — it was confirmed that the issue was indeed due to a configuration mismatch.
Solution
Main action: We adjusted the number of service replicas to match the number of available nodes with the required label — we reduced the number of replicas in the .yml configuration file:
deploy:
replicas: 2 # Match the number of nodes with the label
After applying the updated configuration, the error disappeared — the scheduler no longer attempted to place replicas on non-existent nodes.
Additionally, we reviewed the configuration of the remaining services, verifying that the number of replicas and nodes matched the required labels. We found several more services with a similar potential issue — and fixed them proactively.
If the Cause Is Different, Additional Solutions
Our specific case was related to a configuration error, but there are other scenarios that can cause the same error:
Resource shortage. Free up space and clean up unused images:
docker system prune -a
Or lower the limits for the service:
docker service update --limit-cpu 0.5 --limit-memory 512M <SERVICE_ID>
Issues with the Docker Daemon on the node. Restart the daemon:
systemctl restart docker
Temporarily excluding a problematic node. Switching to drain mode so that all tasks migrate to other nodes:
docker node update --availability drain <NODE_ID>
Reconnecting the node to the cluster. If nothing else works, remove the node and add it again:
docker swarm leave --force
docker swarm join --token <TOKEN> <MANAGER_IP>:2377
Conclusion
This situation taught us a few things:
The underweighting node error is a symptom, not a diagnosis. The same warning in the logs can stem from a wide variety of causes, ranging from a lack of resources to a configuration error.
Configuration errors are the most insidious cause. In a cluster with dozens of services and labels, it's easy to introduce a mismatch between the number of replicas and available nodes during a routine update.
The absence of downtime does not mean there is no problem. The cluster continued to operate thanks to live replicas, but it was running with reduced fault tolerance. One more failure, and the SLA would have been violated.
Opinions expressed by DZone contributors are their own.
Comments