Docker Swarm Part II: Rescheduling Redis
In this second of three examples of deploying distributed databases using Docker Swarm, the subject is Redis.
Join the DZone community and get the full member experience.
Join For FreeWelcome back! Go back and check out Part I of the series to learn how to install the environment we’ll be using!
Redis Server With Swarm Rescheduling On-Node-Failure
If you noticed in part 1, we deployed Swarm with the --experimental
flag. This includes the feature for rescheduling containers on node failure as of Swarm 1.1.0.
This portion of the post will focus on deploying a Redis container and testing out the current state of the experimental rescheduling feature.
Note: rescheduling is very experimental and has bugs. We will walk through an example and review known bugs as we go through.
If you want your container to be rescheduled when a Swarm host fails, then you need to deploy that container with certain flags. One way to do this is with the following flags: --restart=always -e reschedule:on-node-failure
or with a label such as -l 'com.docker.swarm.reschedule-policy=["on-node-failure"]'
. The example below will use the environment variable method.
First, let’s deploy a Redis container with the rescheduling flags and a volume managed by Flocker.
$ docker volume create -d flocker --name testfailover -o size=10G
# note: overlay-net was created in Part 1.
# note: `--appendonly yes` tells Redis to persist data to disk.
$ docker run -d --net=overlay-net --name=redis-server --volume-driver=flocker -v testfailover:/data --restart=always -e reschedule:on-node-failure redis redis-server --appendonly yes
465f490d8a80bb53af4189dfec0c595490ebb454f91ded65b9da2edcb4264c2d
Next, SSH into the Docker host where the Redis container is running and take a look at the contents of the appendonly.aof
file we instructed Redis to use for persistence. The file should be located on the Flocker volume mount-point for the container and contain no data.
$ cat /flocker/9a0d5942-882c-4545-8314-4693a93fde19/appendonly.aof
# there should be NO data here yet :)
Next, let’s connect to the Redis server and add some key/values. After, look at the contents of the appendonly.aof file again to show that Redis is storing the data correctly.
$ docker run -it --net=overlay-net --rm redis sh -c 'exec redis-cli -h "redis-server" -p "6379"'
redis-server:6379>
redis-server:6379> SET mykey "Hello"
OK
redis-server:6379> GET mykey
"Hello"
View the data within our Flocker volume to verify that Redis is working correctly.
$ cat /flocker/9a0d5942-882c-4545-8314-4693a93fde19/appendonly.aof
*2
$6
SELECT
$1
0
*3
$3
SET
$5
mykey
$5
Hello
Testing Failover
Now we want to test the fail-over scenario making sure our Flocker volume moves the data stored in Redis to the new Docker host where Swarm reschedules the container.
To do this let’s point Docker at our Swarm manager and monitor the events using the docker events
command.
To initiate the test, run shutdown -h now
on your Docker host that is running the Redis container to simulate a node failure. You should see events (below) that correlate to the node and container dying.
What the events tell us is that the container and its resources (network, volume) need to be removed, disconnected or unmounted because the host is failing. The events you see below are:
- Container Kill
- Container Die
- Network Disconnect
- Swarm Engine Disconnect
- Volume Unmount
- Container Stop
2016-03-08T21:25:45.832049184Z container kill f3d724a37bdf040baac5a06616e39956610d5409acf9ed62508d43b84f79410d (com.docker.swarm.id=48d1d26490c4943fdb98e6f08be9b62003cd5105e06d9bad94dde2e5913c374b, com.docker.swarm.reschedule-policies=["on-node-failure"], image=redis, name=redis-server, node.addr=10.0.57.22:2375, node.id=256G:KRWJ:D5D4:IZHE:TJUO:FHAO:6HII:ET3F:EULJ:NYFT:LBIX:4HBS, node.ip=10.0.57.22, node.name=ip-10-0-57-22, signal=15)
2016-03-08T21:25:45.975166262Z container die f3d724a37bdf040baac5a06616e39956610d5409acf9ed62508d43b84f79410d (com.docker.swarm.id=48d1d26490c4943fdb98e6f08be9b62003cd5105e06d9bad94dde2e5913c374b, com.docker.swarm.reschedule-policies=["on-node-failure"], image=redis, name=redis-server, node.addr=10.0.57.22:2375, node.id=256G:KRWJ:D5D4:IZHE:TJUO:FHAO:6HII:ET3F:EULJ:NYFT:LBIX:4HBS, node.ip=10.0.57.22, node.name=ip-10-0-57-22)
2016-03-08T21:25:46.422580116Z network disconnect 26c449979ac794350e3a3939742d770446494a9d17adc44650e298297b70704c (container=f3d724a37bdf040baac5a06616e39956610d5409acf9ed62508d43b84f79410d, name=overlay-net, node.addr=10.0.57.22:2375, node.id=256G:KRWJ:D5D4:IZHE:TJUO:FHAO:6HII:ET3F:EULJ:NYFT:LBIX:4HBS, node.ip=10.0.57.22, node.name=ip-10-0-57-22, type=overlay)
2016-03-08T21:25:49.818851785Z swarm engine_disconnect (node.addr=10.0.57.22:2375, node.id=256G:KRWJ:D5D4:IZHE:TJUO:FHAO:6HII:ET3F:EULJ:NYFT:LBIX:4HBS, node.ip=10.0.57.22, node.name=ip-10-0-57-22)
2016-03-08T21:25:46.447565262Z volume unmount testfailover (container=f3d724a37bdf040baac5a06616e39956610d5409acf9ed62508d43b84f79410d, driver=flocker, node.addr=10.0.57.22:2375, node.id=256G:KRWJ:D5D4:IZHE:TJUO:FHAO:6HII:ET3F:EULJ:NYFT:LBIX:4HBS, node.ip=10.0.57.22, node.name=ip-10-0-57-22)
2016-03-08T21:25:46.448129059Z container stop f3d724a37bdf040baac5a06616e39956610d5409acf9ed62508d43b84f79410d (com.docker.swarm.id=48d1d26490c4943fdb98e6f08be9b62003cd5105e06d9bad94dde2e5913c374b, com.docker.swarm.reschedule-policies=["on-node-failure"], image=redis, name=redis-server, node.addr=10.0.57.22:2375, node.id=256G:KRWJ:D5D4:IZHE:TJUO:FHAO:6HII:ET3F:EULJ:NYFT:LBIX:4HBS, node.ip=10.0.57.22, node.name=ip-10-0-57-22)
Then, some bit of time after the Docker host dies, you should eventually see an event for the container the same container being rescheduled (created again). This is where there is still some work to be done, as of 1.1.3 and our testing we noticed that Swarm has an issue running Start
on the container after it has been Created
on the new Docker host.
You should see the Create
event logged while watching docker events
and this actually does initiate the re-creation of the container and the movement of the Flocker volume it was using.
We found that you may need to manually Start
the container on the new host after it was rescheduled.
Note: Some of the issues with the container creating but not starting and others are tracked in this Docker Swarm issue.
This is the event we see when the container was rescheduled and created on a new Docker host automatically. Notice the IP address changed to a different IP from the last message; this is because the container is rescheduled on a new Docker host.
# The node:addr was `node.addr=10.0.57.22` before being rescheduled
2016-03-08T21:27:32.010368912Z container create 1cee6bff97a4d86995dd9b126130d40be9b02c6d373e6c0faa32c05110f98475 (com.docker.swarm.id=48d1d26490c4943fdb98e6f08be9b62003cd5105e06d9bad94dde2e5913c374b, com.docker.swarm.reschedule-policies=["on-node-failure"], image=redis, name=redis-server, node.addr=10.0.195.84:2375, node.id=DWJS:ES2T:EH6C:TLMP:VQMU:4UHP:IBEX:WTVE:VTFO:E5IZ:UBVJ:ITWW, node.ip=10.0.195.84, node.name=ip-10-0-195-84)
Review
Here is what happened so far:
If we run a docker ps
against Swarm we can see the Redis container as Created
. So, in this case, we can start it manually and Redis is back up and running on a new node!
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1cee6bff97a4 redis "/entryp.." 4 minutes ago ip-10-0-195-84/redis-server
root@ip-10-0-204-4:~# docker start redis-server
redis-server
Let’s connect to the Redis server and make sure the data we added still remains.
$ docker run -it --net=ryan-net --rm redis sh -c 'exec redis-cli -h "redis-server" -p "6379"'
redis-server:6379> GET mykey
"Hello"
The data is still there! Given the current state of rescheduling, it’s not recommended to rely on it.
During our tests, we did come across users that said the container did start. We also came across users that said rescheduling didn’t work at all, or they wound up with two identical containers if the Docker host came back.
Either way, there are certainly kinks to work out and it's part of the community's job to help test, report and fix these issues so they can work reliably. We will update this post along the way to make sure to show you how rescheduling works in the future!
Happy Swarming! Be sure to check out Part III!
We’d love to hear your feedback!
Published at DZone with permission of Ryan Wallner, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments