Welcome back! Go back and check out Part I of the series to learn how to install the environment we’ll be using!
Redis Server With Swarm Rescheduling On-Node-Failure
This portion of the post will focus on deploying a Redis container and testing out the current state of the experimental rescheduling feature.
Note: rescheduling is very experimental and has bugs. We will walk through an example and review known bugs as we go through.
If you want your container to be rescheduled when a Swarm host fails, then you need to deploy that container with certain flags. One way to do this is with the following flags:
--restart=always -e reschedule:on-node-failure or with a label such as
-l 'com.docker.swarm.reschedule-policy=["on-node-failure"]'. The example below will use the environment variable method.
First, let’s deploy a Redis container with the rescheduling flags and a volume managed by Flocker.
Next, SSH into the Docker host where the Redis container is running and take a look at the contents of the
appendonly.aof file we instructed Redis to use for persistence. The file should be located on the Flocker volume mount-point for the container and contain no data.
Next, let’s connect to the Redis server and add some key/values. After, look at the contents of the appendonly.aof file again to show that Redis is storing the data correctly.
View the data within our Flocker volume to verify that Redis is working correctly.
Now we want to test the fail-over scenario making sure our Flocker volume moves the data stored in Redis to the new Docker host where Swarm reschedules the container.
To do this let’s point Docker at our Swarm manager and monitor the events using the
docker events command.
To initiate the test, run
shutdown -h now on your Docker host that is running the Redis container to simulate a node failure. You should see events (below) that correlate to the node and container dying.
What the events tell us is that the container and its resources (network, volume) need to be removed, disconnected or unmounted because the host is failing. The events you see below are:
- Container Kill
- Container Die
- Network Disconnect
- Swarm Engine Disconnect
- Volume Unmount
- Container Stop
Then, some bit of time after the Docker host dies, you should eventually see an event for the container the same container being rescheduled (created again). This is where there is still some work to be done, as of 1.1.3 and our testing we noticed that Swarm has an issue running
Start on the container after it has been
Created on the new Docker host.
You should see the
Create event logged while watching
docker events and this actually does initiate the re-creation of the container and the movement of the Flocker volume it was using.
We found that you may need to manually
Start the container on the new host after it was rescheduled.
Note: Some of the issues with the container creating but not starting and others are tracked in this Docker Swarm issue.
This is the event we see when the container was rescheduled and created on a new Docker host automatically. Notice the IP address changed to a different IP from the last message; this is because the container is rescheduled on a new Docker host.
Here is what happened so far:
If we run a
docker ps against Swarm we can see the Redis container as
Created. So, in this case, we can start it manually and Redis is back up and running on a new node!
Let’s connect to the Redis server and make sure the data we added still remains.
The data is still there! Given the current state of rescheduling, it’s not recommended to rely on it.
During our tests, we did come across users that said the container did start. We also came across users that said rescheduling didn’t work at all, or they wound up with two identical containers if the Docker host came back.
Either way, there are certainly kinks to work out and it's part of the community's job to help test, report and fix these issues so they can work reliably. We will update this post along the way to make sure to show you how rescheduling works in the future!
Happy Swarming! Be sure to check out Part III!
We’d love to hear your feedback!