A more apt title of this article could have been "A Checklist to Build Production Engineering Organization," but I wanted to stress the importance of automation in every aspect of production engineering operations, and, these days, DevOps is a good buzzword to invoke that theme. Without the core value of automation built into the production engineering processes, a newly built team will find it hard to support the business when the latter is ready to scale up. This is especially true at a time when development groups want to focus on new product features and will be eager to move off their plates the responsibilities that are not essentially part of building and fine-tuning product features.
An automated operations environment helps businesses make quick changes with minimum defects and downtime. An earlier attempt to summarize my thoughts on the subject can be found here. Typically, DevOps teams will be responsible for tasks of the following kind in an organization.
Infrastructure as Code
In virtualized, cloud-based environments, the computing resources can be provisioned as needed.￼ When the allocation of computing resources tends to be available on-demand and elastically, there is no way such environments can be built manually. Your team should know how to automate it and integrate those steps with provisioning tools and configuration management systems.
Platform as Code
Identify software roles in an application stack and automate the steps to build them. Using those as the building blocks, large-scale application environments can be stood up with the help of configuration management tools. That is the only way you can scale up operations if the consumer app you support or the internal storage service you manage or the newly released SaaS offering of the company should become an instant hit. If you ask your prospective customers to wait, you will lose them to competition or lose your credibility as an internal infrastructure service provider, depending on what you have been supporting.
A well-tested feature should be deployed in production with minimum delay. Continuous Integration is an effective method to both test and deploy code, but in real life, the state of that in a company would be somewhere between manual code push and a fully automated code deployment process.
Monitoring and Operational Intelligence
Don't settle and only use the out-of-the-box features of your favorite monitoring tool. To effectively monitor the application stacks, custom plugins have to be developed and the team should have the necessary skills to do it. The advent of log aggregation tools such as LogStash and Splunk made it possible to dig out errors and insights from server logs. Again, to make these tools more useful, code has to be written to instrument, mine, and present operational data.
Automation of Routine Tasks
The production engineering team in any company will have a long list of items to carry out periodically that tend to defy any classification. Some of the things I have done or have been responsible for in the past from this category are the following:
- Weekly reports containing aggregates from business systems.
- Operational performance and computational usage data.
- Data extracts generated for both for internal and external customers.
- Updating of various metadata used by applications.
- Reprocessing of data to correct issues with aggregation done earlier.
- Security audits and both internal and regulatory requirements as needed for SOX compliance.
These chores are normally handed over to production engineering team to handle with lengthy procedures. They should be automated as much as possible to avoid burdening the team members with rut work and to avoid mistakes that could be committed by a bored worker who may find nothing exciting in carrying out a routine task.
Tools and processes alone will not solve any problem. You will always need talented people on a team to get things done in line with the larger goals of the company with the use of minimum resources. Complex tools in the hands of incompetent people will only result in creating more chaos. With that general warning, let’s see what we need in this area.
I already repeated the importance of automation a few times, and that normally means writing code. Tool vendors will always argue that you can do everything from the dashboards of the products they peddle. However, a production engineering team that has good coding skills can extend third-party tools, build custom tools if the situation demands it, and collaborate well with development teams for adding features that would improve the operability of the applications.
A traditional operations team during the data center era consisted of system and network admins, database and third-party application admins, and application support engineers. In the last such company where I had worked, we had system admins, Oracle and MySQL admins, Tableau and Microstrategy admins, and application support engineers. Very few new companies can afford to have such a division of labor, but they might still need to have resources to cover similar job responsibilities.
The important thing is to find people who are not married to certain technologies and products but who are open to learning new technologies and comfortable using an ability to code as one of the items in their toolbox to solve problems.
It is important to have both system administration and coding skills available in a production engineering team. If the company can afford to have only a few people in the production engineering team, then its members need to be versatile. A large team can still afford to have specialists. The exact composition of a team will ultimately depend on the specific requirements of the business, but a team without substantial coding skills as a group will not get much automation done. Failure to automate operational tasks will become a bottleneck as the company grows and the requirement to scale-up will become essential.
An operations infrastructure must be in place to roll out the related processes in a new organization. Some components will be there already even before a production engineering group is formally set up because those things (such as a ticketing system) are essential to running a high-tech company. Parts of that infrastructure will be shared with other groups also, mainly development, as part of collaboration. Though it is hard to generalize, a production engineering organization would require some form of the tools and applications from the following list.
The infrastructure can be divided into two broad categories. First is the set of processes that need to be rolled out and owned by the production engineering group. Examples are release process, incident management, and on-call. The tools needed for rolling out automation projects and the production engineering process are the second category of items.
This is the first of a four-part series on building a DevOps organization. Stay tuned for more!