A Checklist for Building a DevOps Organization: Part III
A Checklist for Building a DevOps Organization: Part III
If you have team members who resist to automate routine tasks, you will soon notice that your DevOps efforts will get stuck with their inability or lack of motivation.
Join the DZone community and get the full member experience.Join For Free
Read why times series is the fastest growing database category.
Operations Intelligence and Management Reporting
The leadership team will be interested in various summary data such as utilization of computing resources, uptime of applications, and various performance indexes such as the percentage of meeting SLAs (Service Level Agreement). Core monitoring systems will provide basic information for such reporting, but further aggregation and presentation will be required. Custom batch jobs that collect and aggregate operational data will have to be designed and implemented. Presentation layers can be custom dashboards built using popular frameworks using PHP or Node.js or standard reporting tools such as Actuate, Tableau, or Microstrategy. Once the task of collecting various operational data of interest is completed, insights can be drawn from the data using any BI tool. Such tools might already be used by business groups.
Popular log aggregation tools such as Logstash and Splunk provide another set of operational intelligence data by indexing the logs. In addition to mining the standard log files, operational data can be generated on computational nodes and these tools can be used to aggregate and index custom operational metrics for analysis.
There are products available in the market to help with this, but largely home-grown solutions tend to be the norm in this category with the support of reporting applications.
Production Engineering Processes
The tools discussed above help roll out the standard production engineering processes that are essential to a matured organization. However, when such processes are implemented in a new organization, care must be taken to ensure that a new process adds some value and will not slow down things as a result of its implementation.
Release Process and Change Management
The release process normally refers to code deployment in production, and change management refers to any change that would have an impact on systems in production. By definition, the change management process covers application releases. It also keeps track of changes in infrastructure, OS and third-party software upgrades, database changes, and even one-off jobs that may have an impact on the computing resources.
The main objectives of a change management process should be tracking changes done in production and documenting and socializing the changes for better visibility within the company.
It is important that the proposed changes are reviewed and approved by a dedicated team, and that stakeholders and business owners are notified of the changes before and after those are implemented.
Product Documentation and Runbooks
This is something built on top of the documentation platform. In a new company, product documentation would be non-existent and such efforts will be ongoing as the applications will be enhanced in every release. It is important to create operational run books for the applications. Set up a process to maintain them and tie that to release management. One standard question to ask in a release review meeting would be about the changes needed in the operational runbook.
Document the application errors that will be distributed by monitoring systems and log aggregation tools as alerts. Even though a self-healing production environment is the ideal situation, there could be some manual interventions needed.
Document routine maintenance tasks. Generating reports for both internal and external customers, meta-data updates, and taking backups and purging -- there could be several application specific chores you may need to do routinely. Though these tasks are typically automated, some manual steps will be needed to deliver the services to the end-customers.
Make sure that run books are not excuses for not automating repetitive tasks. There is a tendency on the line management side to throw manpower at maintenance tasks to address them manually. As indicated earlier, such an expensive strategy will never scale up in the long run, and that could drive away staff who may not want to perform the rut work. If you have team members who are happy to do routine tasks and resistant to automate them, you will soon notice that your DevOps efforts will get stuck with their inability or lack of motivation to implement automation.
Applications are expected to always be available. Even if the application has internal users, it may have a userbase from multiple geographies. The downtime of consumer web or SaaS applications should be very minimal if any; business can't afford that. Outages and other incidents can happen in production in the most unexpected ways and the response to such incidents should be quick.
To have a smooth on-call process, the following things have to be in place:
Contact information of members of both development and operations teams.
A vacation calendar with up-to-date information on who is available in a specific time window.
An on-call calendar that clearly indicates who is responsible for responding to critical alerts and incidents at a given point of time.
Escalation procedures specific to applications. Normally, the on-call person has to contact a point-of-contact (PoC) in the development group as the first escalation step.
This is the third of a four-part series on building a DevOps organization. Stay tuned for more!
Opinions expressed by DZone contributors are their own.