The reason we used GlusterFS was to be able to have a shared storage between each node of the cluster, so we can spin an instance of any Docker image on any node without issues, as the container will use the shared storage for their business data (mounted as volume).
I first installed GlusterFS across the ocean, with one server in France and another one in Canada. It looked fine but when I started using it, my first Git clone on a GlusterFS mount point took so long that I had time to make coffee, drink a cup, and then drink a second one! So, it was not usable in production.
I decided to test the mount point by copying a big file just to see how fast it would be and whether the speed was okay. It reminds me of one good exercise by Kirk Pepperdine for optimizing a website that was way too slow because of too many connections to the database. We suffered the same issue, lots of files mean lots of connections that need to cross the ocean, and this delay in the handshake delays the copy.
I decided to benchmark the FS. Since I didn’t find a lot of frameworks to test a filesystem, I just wrote a small script in bash to generate a file with dd and /dev/urandom. I would then run this script on every partition: one ext4 partition, one Gluster partition in the same datacenter, and one Gluster partition across the ocean.
for i in $(seq 1 $NUMBER); do dd if=/dev/urandom of=$TARGET/file_$i bs=$SIZE count=$COUNT 2>&1 | grep -v records done
Then I called this generation script with another one, to create 1GB in different settings.
# Creating 10240 files of 100k export NUMBER=10240 export TARGET=`pwd`/100k export SIZE=100K sh generate.sh > 100k.log # Creating 1024 files of 1M export NUMBER=1024 export TARGET=`pwd`/1M export SIZE=1M sh generate.sh > 1M.log # Creating 100 files of 10M export NUMBER=100 export TARGET=`pwd`/10M export SIZE=10M sh generate.sh > 10M.log # Creating 10 files of 100M export NUMBER=10 export COUNT=100 export TARGET=`pwd`/100M export SIZE=1M sh generate.sh > 100M.log # Creating 1 file of 1G export NUMBER=1 export TARGET=`pwd`/1G export SIZE=1M export COUNT=1024 sh generate.sh > 1G.log
Let’s see what the results were.
In the graphs below:
- The Y-Axis is expressed in seconds.
- The ext4 partition is represented by ext4 (blue color)
- The Gluster partition in the same datacenter is represented by gluster (orange color)
- The Gluster partition across the ocean is represented by gluster-atlantic. (grey color)
Here, only one file is copied. We can see that gluster-atlantic is 1.5 times slower, and the difference between ext4 and gluster is about 30%. However, to get the replication and the security—it is worth it.
For 100 million files we have pretty much the same kind of result–one file took more time but nothing too abnormal.
For 10 million files we can see that ext4 is getting ahead of gluster by 2.2 times, and the difference between gluster and gluster-atlantic is not as much this time. We can also see some spikes that seem to appear for the same amount of data. The tests were run in different timings so we can suppose that GlusterFS triggers some work when the cache is full.
Now, gluster is closer to ext4 and we can see that crossing the Atlantic seems to take at least 0.4s!
Here is the final one. We can round the gluster-atlantic to 0.3s.
To better see the difference between ext4 and gluster, I’ve created a logarithmic graph (keep in mind that the following graph is not a linear Y axis).
The above graph shows how problematic the small files are. Unfortunately, as I’m using my own git server (gist) and since any website or app is basically now a git clone, it makes it unusable in production.
This benchmark confirmed what I learned a long time ago: establishing a connection does take time; you have incompressible latency and this latency is directly related to the distance between objects (sadly the speed of light is not enough). That’s why CDN are so useful and are used frequently.
It would have been fun to test with data centers on the West Coast and the East Coast of the United States, and then between the West Coast and Europe.
Geo-replication was the Solution
In the GlusterFS documentation, the following table lists the differences between replicated volumes and geo-replication:
With geo-replication, we can have fault tolerance in the data centers. We might still lose a really small amount of data in the last operations, but it won’t end up with a total loss of data as it’s unlikely that a data center will go down completely. If the data is crucial (like data for banks or other financial institutions) then I would probably create a replica in a nearby data center, but definitely not thousands of miles away. This is also why AWS recommends using two availability zones to ensure speed and reliability, and replicate across regions for recovery.
The Final Solution
For my cluster, I finally took another server in Canada to handle the replication and added the server in France as a geo-replication. The final architecture looks like this:
You enter the cluster by any of the NGINX servers that will redirect you to the right Kubernetes Service, which maps to the right containers.
The shared partition is handled by GlusterFS, replicated between the two servers in Canada and another safety geo-replication in the server in France.
The containers can then be shifted from one server to another without any trouble.
If I want to migrate a server, I just have to install a server, make it join the cluster, move containers around, and then stop the other server.