1,000 Jenkins Jobs: A DevOps Journey to Nirvana
As the head of DevOps, my frustration was approaching Will Smith slapping Chris Rock levels. Here’s what happened.
Join the DZone community and get the full member experience.Join For Free
Living the DevOps Dream (Almost)
My life heading a DevOps team theoretically should have been a dream. We all were in high demand and paid incredibly well for simply being better at Googling stuff than most people. And the best part is that no one really knew we existed... at least until things didn’t work.
And therein lay the rub. Nothing ever works all the time. So our Slack on-call channel was like a booby-trapped war zone, literally exploding with repetitive messages such as:
- “Metrics aren’t reporting to…”
- “We need a new server”
- “My deployment is stuck”
- “Where are my logs”
- “I need admin access to…”
- “How much did we spend on EC2”
- “VPN is not working…”
- “Why did my pipeline fail”
- “Can I get a firewall rule for..”
- “What does this error mean”
- “Help me onboard”
- “Need to update this manifest file in the repo”
- “My dog is stuck in a tree” (OK, we didn’t get that one, but I wanted to make sure you are still reading this)
Sound familiar? If you are a DevOps (or SRE, Platform Engineer, and the like), then I’m guessing that, unfortunately, it does. But fear not, this tale eventually does get better.
Days of Toil and Zero Innovation
As a DevOps practitioner, my primary job is to drive organizational innovation and efficiency. So, of course, I had big plans for improving infrastructure, replacing legacy systems with more modern ones, designing sophisticated alerting to know what was happening in our systems at any given time, and more.
Instead, my team and I found ourselves drowning in a sea of “help me.” Everyone — from R&D to Finance and HR — wanted a piece of us to provision cloud resources, trigger and track complex workflows, generate cost reports, onboard new employees, and more. And of course, we always needed to investigate and grant the appropriate permission levels for each request.
This left us with zero bandwidth, as our days, nights, and weekends were endlessly overloaded with context switching, repetitive requests, and “super-urgent” tasks.
But What Happened to “You Build It, You Run It?”
“You build it, you run it.” This DevOps approach attributed to Werner Vogel, Amazon CTO describes how Amazon improved the quality of their services and the speed with which they were released by erasing the separation between developers and operations. This also should have eased the workload on ops teams, allowing them to focus on innovation.
However, the DevOps reality for many companies is far different, as evidenced by countless articles and Reddits. I’ve personally spoken to dozens of DevOps professionals in different organizations, and the story is the same. While developers are keen on coding and creating the next best thing, they are much less interested in learning and managing all the underlying infrastructure their applications run on.
Did You Try Automating it All?!
If that’s what you are wondering, then yes, as any self-respecting DevOps or SREs, we created automations for practically everything we could. But we found that even end users like the R&D team, who often had domain expertise in the tools we were using (and we used many tools, such as Airflow, Argo, GitHub Actions, Helm Charts, Jenkins, Pulumi, Terraform, and more), were often unable to easily navigate their way around our automations.
Despite being fairly common tools, the unique way they were set up in our organization made it nearly impossible for the end user to figure out whether the automated workflows we had created were actually the right ones for what they needed.
Additionally, developers were usually not granted the requisite permissions to directly access many cloud resources, so even if they would know exactly what to do, they still could not do so without getting permission from our team.
Bottom line, we were right back where we started. An endless stream of requests to help trigger the right automated workflows.
All this grief led to me creating a slackbot with around one thousand hard-coded workflows that end users could choose from. I first made sure to give each flow a very clear and descriptive name so end users would be able to figure out what it was meant for.
I then connected the slackbot to all our tools and associated workflows, and end users could use a simple command to list all workflows.
To illustrate how the slackbot was used, say one particular ECS service had a memory leak. The responsible developer could now trigger a workflow to query AWS and find out exactly what the full name of the service was (the developers usually would not know that sort of thing). They could then use the full name (which was required) in a different workflow that would restart the specific service. (Of course we “rarely” would restart services to patch memory leaks and only did this when we didn’t have time for a proper root cause analysis.)
All in all, the slackbot solved both the lack of domain expertise and also acted as a proxy with guardrails for the developers, allowing them to do what they needed without over-permissioning. And, most importantly, it helped reduce my team’s toil by 70% while eliminating the long delays end users typically experienced.
However, there were many drawbacks to that slackbot.
Chatbots Are Kinda… Robotic (and Hard to Maintain)
The slackbot I created was not exactly user-friendly. It forced users to learn and choose from a static list of workflows and use predetermined words or slash commands. It certainly could not handle anything not already included in its rule-based, canned interactions. These “out of scope” requests would leave end users empty-handed until, of course, they came knocking on our DevOps door.
But what was far worse was the maintenance. I tried to enforce a standard programming language for each workflow, but with so many to create and many DevOps cooks in the kitchen, this proved to be impossible. If one workflow broke, figuring out all the dependencies and how to fix it took way too much sweat, blood, and tears. If I wanted to add a brand new workflow, it also required a very significant effort.
Conversational AI and DevOps Nirvana
In summary, my personal experience (along with first-hand reports from many DevOps I've spoken to) has driven me to explore the use of conversational AI for solving DevOps toil. I’m setting the bar high as my dream solution would allow me to innovate all day, meditate all night and keep end users happy, without having to spend insufferable amounts of time on slackbot maintenance or worse, manually handling all the repetitive requests I used to get.
So far things are looking promising.
Opinions expressed by DZone contributors are their own.