Creating Self-Serve DevOps Without Creating More Toil
My team was mouthing off RTFM every time a dev would ask them a question. It needed to end.
Join the DZone community and get the full member experience.Join For Free
My journey to co-founding Kubiya.ai was triggered by the very real pain of being a DevOps leader supporting both broader organizational goals along with the day-to-day support of software engineers and others. This is my story (or at least the somewhat interesting parts) of what it was like and how I found a productive approach to managing it all.
DevOps Opening Hours: 2:45 p.m. until a Quarter to 3:00 p.m.
It’s really not a joke. DevOps have no time. We need to make sure everything is running smoothly from development to production and often beyond.
We are busy enhancing CI/CD processes, upgrading and configuring monitoring and alerting systems, and optimizing infrastructure for both cost and security, as well as sitting in lots of meetings. But on top of all that, we need to deal with endless and oftentimes repetitive requests from developers (and others).
This was exactly my experience as the head of DevOps at my previous company.
The repetitive requests were not only taking up around 70% of my team’s time and energy, it had them mouthing off RTFM every time a dev would ask them a question.
In Search of DevOps Self-Serve that Works
So I started exploring different solutions for enabling a self-service culture in the organization to reduce the burden of the DevOps team while empowering developers to do more on the operational side.
But before exploring the solutions I explored, I want to mention several things that were constantly on my plate as head of DevOps:
- Planning, building, and managing multiple types of pipelines for the R&D teams which included CI/CD pipelines, self service automation, and infrastructure as code (IAC)
- Dealing with permissions requests from different personas in the organization while keeping the security team in the loop
- Taking care of the onboarding and offboarding process of engineers in the company
- And of course the maintenance of all the different tools to accomplish those tasks
While doing all of this, my team had to keep an eye on several Slack channels to answer questions from the tech organization, such as:
- “Where are my logs?”
- “Why did my pipeline fail?”
- “Why did my service stop responding?”
- Kubernetes questions, cloud questions, and more
So we tried several different approaches.
Internal Developer Platforms: Loads of Development
Internal developer platforms, typically created and maintained by a dedicated team, are a combination of common tools and technologies that can be used by software developers for easier access and management of complex infrastructure and processes.
While we considered building a developer platform, there simply was too much planning required to make this a reasonable endeavor.
For starters, we had so many tools in use and with a very dynamic organization the number and types of tools were constantly changing making ongoing curation a real challenge. But aside from bringing all these tools together into a single platform, training and adoption was not realistic with a software development team singularly focused on coding the next best thing. We found ourselves struggling to decide how to create such a platform and what features should be included.
Naturally we worried about creating a new kind of toil. Even if we reduced the workload coming from developer requests, embracing an internal developer platform looked like it would bring fresh pain with managing the platform lifecycle, adding new features, and as always, supporting different users.
Workflow Automation Is Not So Automatic
Of course our industry’s standard solution — and one that our tech organization already understood — was to automate everything possible.
So, utilizing tools that our developers were already familiar with, such as Jenkins, we created automation for nearly any issue you can think of. As soon as an issue occurred, we were able to create a dedicated pipeline to solve it.
The main problem was that developers didn't know where to find these workflows. And even if they knew where to find the workflow, they did not usually know which parameters to fill in. So we were back to the same routine of devs approaching us for help.
Permissions were another issue. It was very risky for us to allow large groups to trigger workflows. With so much potential for chaos, deciding who should have authority to run workflows was no easy task.
Even granting permissions and approvals on an ad hoc basis with the security team took a lot of time and effort. Permission lifecycles were also problematic, as offboarding a user who left the company required special handling.
New workflows required adding permissions and defining who would receive those permissions.
Finally, each time an existing workflow was edited to include more actions, a fresh review of both the workflow and associated permissions was required.
Chatbot to the Rescue
In light of the fact that developer requests came to us via a chat interface (in our case it was Slack), whether by direct messaging me or my team members, or via the on-call channel, I decided to create a chatbot or Slackbot. For permissions management, I connected our chatbot to our company’s identity provider.
This allowed the Slackbot to know the role of anyone who approached it. This made it easy to create unique policies for different user roles (defined by the identity provider) in terms of consuming the operational workflows via the Slackbot.
For context gathering, the Slackbot would ask the relevant questions, guiding users to provide the details needed to fill in as parameters of the different workflows that already existed for CI/CD tools like Jenkins, cloud infrastructure, and more.
Besides solving the lack of domain expertise and lack of interest in operations, our Slackbot acted as a proxy with guardrails for developers. This allowed them to do what they needed without over-permissioning.
Most importantly, it reduced my team workload by 70% while eliminating long delays for end users, avoiding long waits in a virtual queue.
Trouble in Chatbot Paradise
While this was amazing, our Slackbot was still not 100% user-friendly. Users had to choose from a static list of workflows and use predetermined words or slash commands.
Our Slackbot was also unable to handle anything not included in its rule-based, canned interactions. As a result, our Dev team would be left empty-handed in cases of "out of scope" requests.
The Slackbot maintenance, however, was far worse. With so many workflows to create and so many DevOps cooks in the kitchen, I could not enforce a standard programming language. Whenever one workflow broke, figuring out all the dependencies to find a fix took way too much effort. If I wanted to add a brand-new workflow, it would also require very significant effort and domain expertises.
Which brought us all the way back to the same problem of more toil in managing the Slackbot.
AI-Driven Virtual Assistants
Exploring and experiencing the pros and cons of the various solutions led me to understand that the key to success is finding a solution that benefits both the developers AND DevOps.
A system that can ask the developer questions if context is missing, transforming context gathering (which DevOps would normally have to handle) into simple conversations.
Using NLU to infer user intent from a phrase fragment and offer to execute the relevant workflows is another area where AI can improve the end-user experience — so even if a developer only knows, for example, a part of the name of a cluster or service, the virtual assistant can figure out what it is that they need.
This combination of all of the features of a chatbot — plus the ability to understand or learn the user's intentions (on the fly) even if it’s something new and unfamiliar — keeps workflows flowing.
In addition to all this, my conversations with Kubiya.ai customers made it clear that a self-service approach needs to be tied in with security as well. Being able to easily manage user permissions both upfront in the form of policy creation for different users and groups as well as with just-in-time, ad hoc approvals is key to a successful self-serve solution.
In summary, my experience building a self-serve culture has shown me that having an efficient system in place is essential for companies who want to move fast as it ensures that all parties — operations, development, security teams — can get their work done with the least amount of toil and friction.
Opinions expressed by DZone contributors are their own.