Log data can be an indispensable tool for doing an effective Retrospective following a technical disaster. Yet, often the data is misused. And many think that the entire Retrospective process is flawed altogether. More often than not Retrospectives, also known as Post Mortems, turn into technical autopsies. A bunch of people get together to find out how the system died and who is responsible. Data becomes nothing more than evidence of a crime. It’s not fun or particularly productive.
However, there is opportunity to do things differently. You have a learning Retrospective. A learning Retrospective uses data to learn the truth about disastrous incidents and to find effective ways to decrease the likelihood of such incidents happening again.
I have organized these ideas into a list of 7 Rules for Using Log Data Effectively in a Retrospective, The rules are:
- Learn, don’t blame
- Know the scope of the system
- Make sure you have all the relevant logs
- Make sure the logs lineup with the timeline
- Separate the noise from the information
- Make sure the biases are known
- Make sure you deal in facts and not counterfacts
Allow me to elaborate
1. Learn, don’t blame
Before we move onto the nut and bolts of how to use log data effectively within a Retrospective, we need to understand the fundamental difference between blame seeking Post Mortem and a learning based Retrospective. As the name implies, a blame seeking Post Mortem looks to find the person responsible for the disaster at hand and to inflict punishment on the identified culprit.
Dave Zwieback (@mindweather) does a good job of describing a blame seeking Post Mortem in his book, Beyond Blame. It goes like this: the network at a Wall Street investment firm goes down. Traders can’t work; money is lost. A network engineer is identified as the culprit. He is fired. The company believes that the problem is solved now that the guilty party has been removed from the scene. Yet, nobody really understands what happened. Why did an experienced engineer with no malevolent intent do what he did? Was there something about the system that contributed to the outage? Is the existing system likely to cause another outage? The questions never get answered mostly because the person who was on the front line battling the disaster and has the most information has been removed from the scene. It’s a typical scenario that produces no useful outcome.
Those in the know understand that assigning blame does not work. You can have all the data in the world. But, if it is used as a tool for inquisition, you’re going to have a hard time learning the truth. If the result of a Retrospective is someone losing his or her job, you perpetuate a Cover Your Ass culture. Yes, people will delete log files or reset machine clocks, if they think their jobs are on the line. When it comes to making the choice between eating and lying, eating usually wins out.
When you move from a punishment seeking post mortem to a learning focused retrospective, the process is to establish facts, analyze the facts in an unbiased manner and then to make changes based on rational determination. Also, it is explicitly understood that no one ever gets fired during an ongoing Retrospective for telling the truth or accurately reporting his or her behavior. Information shared during a Retrospective is privileged. Why? Because protecting people when they truthfully report facts at hand eliminates the cover your ass mentality that hinders organizational improvement.
2. Know the scope of the system
Data lives in a system. Yet often Retrospectives do not have enough data because the data provided falls short of the true scope of the system. For example, let’s imagine that an online movie site goes down. The usual practice is to assemble the System Admins, networking people and maybe the application team. These people get together in a room and go over machine logs trying to figure out what went wrong, yadda, yadda, yadda. But, there is an important group missing. What about Customer Service? When a site goes down, it is not unusual for the phones to light up. People want to know when the movie will continue. In fact, a lot of callers might want to know when a particular movie will continue. Wouldn’t it be interesting to have it known that a particular movie was the subject of a lot of calls when the system went down?
The point here is that you need to understand the scope of your system in order to gather all the data that is relevant to the incident at hand. If you have only a subset of the system represented in the Retrospective, you won’t get the whole picture. And, without all the information from all parties within the system, you run the risk of making determinations that are inaccurate and establishing new processes that are not effective.
3. Make sure you have all the relevant logs
Logs come in a variety of forms, no pun intended. Or course there are the ones that IT folks typically rely upon: machine logs, event logs, and application logs. Yet there are more logs to be had, chat for example.
Chat is a critical part of a customer service representatives day. And, ChatOps is a growing practice in a number of development shops. In a ChatOps environment, teams use chat technology as a daily part of their work activity, particularly when doing software releases and system upgrades. For all intents and purposes, these chat conversations become relevant log data and need to be known during a Retrospective.
Another source for relevant log data is voice recording. Those in transportation and law enforcement have been using voice data for years. IT work is primarily computer based. Thus, the notion of using voice recordings might feel awkward. However, customer service representatives at banks such as Chase and CitiBank and tech sites such as GoDaddy and eHost have hundreds, if not thousands of recorded voice conversations occurring on a given day. You’d better believe that when disaster occurs, the voice recordings within the scope of the incident become very relevant. If you don’t have such information on hand, you have a problem.
4. Make sure the logs lineup with the timeline
Creating an accurate timeline is the first step in a learning based Retrospective. All members of the Retrospective need to know what happened and when it happened. Log files provide the concrete data needed in order to establish an accurate timeline, provided the timestamps on the log files are in sync with the actual events. An entry marked 1 AM in the log file needs to have actually happened at 1 AM, not 1:10 AM.
J Paul Reed (@SoberBuildEng) runs a consultancy, Release Engineering Approaches. His firm helps companies establish learning based Retrospectives. Reed recommends that companies make sure that all enterprise devices emanating data are synced to a common reference clock using NTP. Otherwise, when an incident does a occur, should a device’s time synchronization be unknown at the time of the incident, you are going to have to go to a great deal of labor figure out the device’s time synchronization after the fact.
Timestamps in Logentries
Logentries captures data in real time and during this process associates a timestamp to each event. As a best practice, you should always synchronize devices and systems using NTP. However, when using Logentries you can ensure an accurate timeline by using the timestamp available in the Logentries Web User Interface.
The important thing to understand is that having an incident timeline is critical and the logs contributing to that timeline need to be synchronized, otherwise to some degree you’re are guessing.
5. Separate the noise from the information
To quote Beyond Blame’s author, Dave Zwieback, mentioned above, “It’s better to have log data available than to not have it. However, raw logs are not information, they are noise”
This is an important concept to remember: Data is noise. Information is a layer above noise. It’s similar to a flock of birds sitting a tree, chirping away. The chirps become information when you can determine a pattern. For example, the birds chirp at a greater frequency at sunrise. When a squirrel comes by, there is an increase in chirping frequency. There are times that one bird is chirping louder than the others. Or, at sunset the chirping stops. These correlations become information.
Just because you have a lot of log data, it does not necessarily follow that you have information. You need to be able to identify patterns. Thus, being able to segment the lines of log entries into useful data structures becomes important. The same holds true should you be working with chat data or voice media. You need to be able to structure and index the data in order to have the capabilities to extract information. Log collectors allow you to structure and index data into meaningful information. You’ll do well to take advantage of such capabilities.
6. Make sure the biases are known
Bias is a natural part of information processing. For example, if you speak only English, your bias toward language is to filter all spoken sounds as English. When you meet a person for the first time and he or she speaks to you in Chinese, you are going to try to hear English. That’s your bias. You might become bewildered. But, should you be with a third person who speaks Chinese and English, that person can say, “of course you don’t understand, he’s speaking Chinese and here is what is being said.”
There is a wide variety of biases in play when people attempt to analyze data. And, most often individuals are unaware of their own biases. In order to achieve object analysis, it becomes important for each member of the Retrospective to rely upon the other members of the team to point out when the individual is under the influence of a bias. Thus, the team will do well to have a common understanding of cognitive bias and have a list of the biases on hand always. Each team member will do well to be accepting when others point out that a bias is in play and be willing to adjust accordingly. Turning data into unbiased information is an important step when pursuing a learning focused Retrospective.
7. Make sure you deal in facts and not counterfacts
Once information is extracted data, it can be analyzed in a blameless manner with the intent of making improvement. The trick in a learning Retrospective is to analyze facts, not counterfacts. A counterfact is a statement that describes what a person should have done, not what he or she did. For example, a fact is, “at 11 AM the CPU usage hit 100%”.
A counterfact is, “at 11 AM the system should have kicked off another node when the CPU usage hit 100%”
Separating fact from counterfact is an important distinction. In terms of understanding an incident — what happened, when it happened and why it happened — counterfacts are useless.
Yes, any piece of log data can be turned into a counterfact. But, for what purpose? There’s a line of people around the block that have been haunted by an incident mishap to the point of becoming incapacitated when it comes to moving forward productively. Facts that are understood against a reliable timeline provide the information required to make systemic improvements. Creating a list of counterfacts that describes mistakes determined by hindsight bias offer little probability of long lasting, effective change. Working with facts can improve the system. Working with counterfacts offers nothing more than a list of indictments that punish the individual and maintain the status quo.
Putting it all together
Log data is a valuable asset when it comes to having the details required for an effective learning based Retrospectives. However, log data is not information. We need to see the patterns in the data in order to extract information. Applying data to a reliable timeline is an important first step toward getting the information we need to think objectively.
Our biases can be obstacles to understanding facts clearly. Usually, we are not aware of our biases and thus, we need to rely upon our team member’s observations to identify when our biases come into play. Also, we need to avoid thinking in terms of counterfacts. We want to understand what happened, not what should have happened. When we avoid blame and focus on learning, we can determine what we want to happen moving forward. Once we understand what we want to happen, we can use our log data to make systemic improvements that are effective and long lasting.
Following the 7 Rules for Using Log Data Effectively in a Retrospective is an easy way to get your company on the way to having learning based Retrospectives that are constructive and make a difference.