What is the number one sysadmin skill?
The ability to problem solve, right? We’re not talking about sudoku and crosswords here. Errors and delays can cost millions. With scale comes complexity, and an exponential increase in things that could go south. In production. At four in the morning.
And here lies the challenge. Sysadmins are not superhumans. They are susceptible to stress and fatigue just like everybody else.
We know that prolonged stress is detrimental to health. We also know that fatigue impairs our ability for basic problem solving. A diminished problem-solving capacity may not pose a problem in jobs dictated by the traditional metrics of productivity, i.e. output per hour. But for those jobs where ideas and innovative solutions are required, productivity is a rather poor measure of success.
It’s hard to shoehorn some of the most important things we do in life into the category of “being productive.”
Kevin Kelly, The Post Productive Economy
Sysadmin teams consist of problem-solving humans. And the pressing question is, how can those teams reach and sustain their potential?
We, too, have pondered about this for years.
As more engineers joined the Server Density team, we’ve been reassessing how we handle production incidents, how we escalate issues, how we distribute our on-call workload, how we collaborate, and how we learn from failure. All those efforts were aimed at the same goal. To nurture the problem-solving capacity of the humans behind systems.
How do we minimize interruptions? How do we safeguard downtime and renewal? How do we minimize stress and fatigue? How do we build software that is more in line with how the human brain works?
HumanOps is a collection of principles that address those questions. It advances our focus away from systems and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.
At Server Density we’ve observed a strong correlation between human and system metrics. Reduced stress leads to fewer errors and escalations. Reduction in incidents and alerts leads to better sleep and reduced stress. Better sleep leads to better time-to-resolution metrics.
What’s the average number of interruptions and wake-ups our engineers experience per month? How many late shifts and weekend calls do they get?
As software makers, we have significant opportunity and responsibility here. How do you spot issues before they cause downtime? How do you reduce incidents and mitigate stress? How do you present this data in a more intuitive way?
Here is a wireframe for an upcoming Server Density feature called alert history. Notice the Cost column? It measures the cost of incidents in actual human hours.
Below is a preview of an upcoming feature for iOS, called sparklines. Sparklines condense full blown charts into smaller inline expressions that illustrate trends. Sparklines are a perfect match for the iPhone because they offer visual cues about “what’s happening?” allowing sysadmins to quickly decide whether to go home, or whether they can finish dinner before reaching for their laptop.
We will expand on this, and many more, HumanOps features in the near future. The important thing to remember is that HumanOps features create bridges between systems and humans. And, they present information in a way that is easy for humans to pick up at a glance.
Another key area of focus for HumanOps is on-call work.
The anxiety associated with being available out of hours stems from the lack of control. It doesn’t matter if the phone rings or not. Being on-call and not being called is, in fact, more stressful than a “busy” shift.” It is this non-stop vigilance, having to keep checking for possible “threats” that is unhealthy.
How do you restore the feeling of control? How do you measure and track the human cost of out-of-hours incidents and escalations? All those considerations fall squarely under the HumanOps agenda.
We Want to Hear From You
HumanOps is a collection of questions, principles, and ideas aimed at improving the life of sysadmins.
- Humans build and fix systems.
- Humans get tired and stressed, they feel happy and sad.
- Systems don't have feelings yet. They only have SLAs.
- Humans need to switch off and on again.
- The wellbeing of human operators impacts the reliability of systems.
- Alert Fatigue == Human Fatigue
- Automate as much as possible, escalate to a human as a last resort.
- Document everything. Train everyone. Save time.
- Kill the shame game.
- Human issues are system issues.
- Human health impacts business health.
- Humans > systems
A challenge like this could never be tackled by one engineer, team, or company on their own. So we couldn’t be more excited about having Spotify, Barclays, Yelp, M&S, and Gov.uk join HumanOps. And, even more teams are contributing their insights here.
The first two HumanOps meetups (London and SF) were a tremendous success, and more worldwide events are coming soon. Stay tuned.