At the end of my Jenkins World 2016 talk, “So you want to build the world's largest Jenkins cluster,” I gave a brief demonstration of a Jenkins cluster with 100,000 concurrent builds to give people an idea of just how far Jenkins clusters can scale.
My talk did not have anywhere near the budget of Sacha’s keynote, where they were able to fire up a PSE cluster of over 2,000 masters with 8,000+ concurrent builds. The idle reader might be wondering how exactly I was able to achieve 100,000 concurrent builds and what exactly were the tricks I was playing to get there.
Okay, let’s get this over with. I did cheat a little, but only in the places where it was safe to do so.
If you want to have a Jenkins cluster with 100,000 concurrent builds, you need to ask yourself, “what is it exactly that we want to show?”
I can think of two answers to that question:
- We are really good at burning money.
- The Jenkins masters can handle that level of workload.
Given my constrained budget, I can only really try to answer the second question.
Can a Jenkins cluster handle the workload of 100,000 concurrent builds?
Most of the work that a Jenkins master has to do when a build is running on the agent can be broken down as follows:
- Streaming the console log from the agent over the remoting channel and writing that log to disk.
- Copying any archived artifacts over the remoting channel onto the master’s disk when the build is completed.
- Fingerprinting files on the remote agent.
- Copying any test reports over the remoting channel onto the master’s disk when the build is completed.
A well-integrated Jenkins cluster might also include:
- Copying artifacts from upstream jobs into the build agent’s workspace (potentially from a different master in the cluster’s disk).
- Triggering any downstream jobs (potentially on a different master in the cluster).
The rest of the workload of the build is actually compiling tests, running tests, etc. These all take place on the build agent and do not have any effect on the master.
So as long as:
- the agent streams back a console log (at more than 60 lines per minute based on my survey of typical builds), potentially with I/O flushes for every line output,
- there are new files (with random content to defeat remoting stream compression) on the agent workspace to be archived and fingerprinted, and
- there are new test results with different content each build written to the agent workspace...
...then we don’t actually have to do a real build.
So in April 2014, I created the Mock Load Builder plugin. This plugin allows you to define a build step that will appear to the Jenkins master just like a regular build, but without generating nearly as much of a CPU requirement on the build agent.
However, when you are aiming for 100,000 concurrent builds, even the Mock Load Builder plugin is not enough as each build will fork a JVM to perform the “mock” build. Now, okay, we don’t need lots of memory in that JVM, but it’s still at least 128Mb, and that will add up to quite a lot of RAM when we have 100,000 of them running at the same time.
So I added another layer of mocking to the Mock Load plugin: fakeMockLoad. With this system property set, the mock load will actually be generated directly on the agent JVM instead of in a JVM forked from the agent JVM.
We are still generating all of the same console logs, build artifacts, test reports, etc., only now we are not paying the cost of forking another JVM. Phew, that was 13Tb of RAM saved.
But hang on a second. Each build agent is going to use at least 512Mb of RAM. That’s over 50Tb of RAM, or 25
x1.32xlarge AWS instances. Almost $350/hr for On Demand instances just for the Agents (plus these are not exactly doing real work). We won’t have much to show other than a headline number.
Well, as part of my load testing for the JNLP4 protocol, I wrote a test client that can set up at least 4,000 JNLP connections from the same JVM. Maybe we could use a modified version of that to multi-tenant the JNLP build agents on the same JVM. The workload on the master is a function of how many remoting channels there are and how much data is being sent over those channels.
It turns out that with a special multi-tenant remoting.jar I can run nearly 10,000 build agents using
c4.8xlarge. At $1.675/hr, that is much more reasonable than $16/hr. Plus, even better, we have fewer machines to set up.
Everything else in my cluster is real: 500 real masters (running in Docker containers divided between a
x1.32xlarge and a pair of
c4.8xlarge) and a CloudBees Jenkins Operations Center (running naked on a dedicated
I was somewhat constrained by disk space packing all those masters into a small space. If I had divided the masters across a larger number of physical machines rather than trying to cram 400 masters onto the same
x1.32xlarge, I could have probably had the cluster run for more than 90 minutes.
There is a video I remembered to capture while spinning up the cluster just before my talk. Two of the build agent machines were running out of disk space at the time, which is why the masters I checked are running about 160 concurrent builds each.
I had (for all of 90 minutes) a Jenkins cluster of 500 masters each with 200 build agents (per master) for a combined total concurrent built rate of 100,000 concurrent builds. Yes, there were issues keeping that cluster running within the budget I had available. Yes, there are challenges maintaining a system with that number of concurrent builds. Yes, I did make some cheats to get there. But Jenkins masters and Jenkins clusters can handle that workload, provided you have the hardware to actually support the workload in the first place!