Faced with limited financing and a high burn rate, many startups focus on product development and application coding at the expense of back of operations engineering. The reasons for this focus are understandable to some extent. Companies need to develop product and unseasoned CEOs don’t always see the value in investing in IT Ops. Some call this movement towards operating without IT Ops, “NoOps,” or “serverless.”
Yet, there are “dusty old concepts,” as Charity Majors calls them, that arise when companies fail to worry about things like scalability and graceful degradation and think that those will take care of themselves. The problem becomes even more significant when developers try to remedy the issues that arise from not having an Ops team by creating DIY tools to remedy the shortfall. To paraphrase the poet Robert Browning, their reach is further than their grasp. With this reach comes technical debt.
What Is Technical Debt?
Martin Fowler has a good notion of technical debt. He describes it as follows:
You have a piece of functionality that you need to add to your system. You see two ways to do it, one is quick to do but is messy – you are sure that it will make further changes harder in the future. The other results in a cleaner design, but will take longer to put in place.
In this explanation, you can see the tradeoff made. The quick and messy way sets up technical debt — which, like financial debt, has implications for the future. If we choose not to do away with the technical debt, we will continue to pay interest on the debt. In the case of development, it means we will often have to go back to the quick and dirty piece of technology and pay with extra efforts that weren’t really necessary.
Alternatively, developers can invest in better design and, in the case of this argument, bring in IT operations to worry about the important things Ops worry about — like scalability, graceful degradation, queries, and availability.
Unfortunately, DIY practitioners often don’t invest the time in thinking through this tradeoff. Given financial circumstances, rushed deadlines, short-sightedness, or some combination thereof, they choose the quick and dirty option.
A Cautionary DIY Tale
My colleague Andrew Ben who is OnPage’s VP of Research and Development spoke with Nick Simmonds, the former Lead Operations Engineer at Datarista. Nick described his experience at a company he used to work at where one of his first jobs was to gain control over a DIY scaling tool that had been developed in-house. The tool was created before any operations engineers had been hired. As such the tool was designed to be a “quick and dirty” method of provisioning servers.
According to Simmonds, the faults of the tool were significant. For example, the tool was designed to eliminate the need for manual scaling of microservices, but instead, the tool simply spun up new instances with no code on them at all. Furthermore, when the servers were spun down, the tool never checked as to whether the code was working on the newest servers before it destroyed the old ones. Additionally, as the tool didn’t efficiently push code to the new instances, the company was left with new servers that had no code running on them.
A significant part of the problem with the tool Nick’s colleagues built is that the tool didn’t come with any alerting component. His team only recognized the failure when they were in production and live. No monitoring, no alerting. Nothing was in place to let Simmonds’ team know the deployment had failed.
Never Try DIY on DevOps Alerting
I don’t want this cautionary tale to cause nightmares for any young DevOps engineers out there. I wouldn’t want that on my resume. Instead, I want to impress upon application developers the need to be mindful of how “quick and dirty” impacts future operations and releases. Tools shouldn’t be created as a temporary hack until you get Ops gets on board.
Teams should invest the time into creating a robust piece of code or tool. Alternatively, if they don’t have the time, they should try to invest in tools that accomplish the desired result. Furthermore, and this is something Nick brought out in his interview with Andrew, never try to hack together a tool for monitoring and alerting.
Alerting is too important and complex to leave to a hack or technical debt. For example, here are some of the main points a robust alerting tool needs to accomplish:
Doesn’t Rely on Email for Alerting
Alerting through email is a way for emails to get lost. Just remember that Target’s engineers received an email alert indicating anomalous traffic several days before they noticed the extent of theft of their users’ credit card information. The email alert got downplayed.
Alerts need to continue in a loud and persistent manner until they are responded to.
If the person singled out for the alert cannot answer it, the alert needs to escalate to the next person on-call
Elevates Above the Noise
There is so much noise from alerts in IT, that the alert needs to grab your attention. This elevation can be through redundancies that alert you via phone, email, and app (or by being loud enough to wake you up at 2 a.m.).
Creates Actionable Alerts
When alerts are sent to the engineer, make sure the alerts come with actions that the recipient should follow. Simple "alert" statements don’t help rectify the situation.
For reporting and service improvement reasons, you want to know when an alert was sent and at what time it was received.
You want to be able to attach text files or images to alerts to amplify the amount of useful data sent in the messages.
Uses Web Hooks or APIs So It Can Grow
Enable your alerting tool to grow as your application grows so that it can integrate with software that enhances your capabilities.