This post was written by Karl Matthias at the New Relic blog.
All ops teams share the need to meld an interruptive work stream with a planned one, and it’s hard to get that right. In the Site Services team in Site Engineering at New Relic, we have a Kanban process (Wikipedia article) that we use to manage our workflow. We’re pretty happy with how it’s working for us, so in this post I’ll share what we did and why.
Oh, That Interruptive Work
The biggest challenge of running an effective ops team is dealing with interruptive issues while managing a planned work stream. It’s all too easy to prioritize interruptive work over ongoing projects, or important work already in progress. This leads to long delays in achieving new projects, as well as a lot of wasted effort when work that was begun is never finished. For example, if you spent two days on a small project but are pulled onto a new task and never get the time to return, those two days are just wasted time. And if everyone on your team is doing that often, that wasted time really adds up. In the longer term is can be demoralizing to a team because important or interesting projects are never completed or are hugely behind schedule.
Switching tasks introduces an overhead on everything the team does. When an engineer picks back up a task, it takes some time for their productivity return to where it was when they put it down. Having multiple tasks in progress at a time for any engineer causes all of the tasks to take longer than otherwise necessary and introduces stress that further lowers morale.
Ops teams also need to provide expected delivery dates since they often serve many other parts of the company. Development groups may be waiting on new environments, or upgraded packages, or help troubleshooting. Without the ability to fairly accurately predict completion times, frustration grows among the team’s internal customers.
Any engineer working in ops for any length of time is familiar with these issues. I’ve worked on teams where we never overcame any of these challenges. It was hard to feel successful even when we did get things accomplished, and the focus of the team gradually became wholly reactive with little forward planning. But it doesn’t have to be like that.
A Light at The End of The Tunnel
Here’s one tool to help achieve an effective team. It’s no silver bullet, and without effective team management it won’t solve anything. But if the team buys into it and really works to use it effectively, it can make a huge difference.
Kanban (“signboard” in Japanese) has been a growing trend in development teams over the last decade or so. It’s based on using a physical representation of work in the form of a card detailing the task to be completed, and visually tracking the workflow on a board shared by the team. The benefits for development teams are strong, but we’ve found that it also works very well for ops teams. Kanban itself comes from Toyota’s manufacturing system and research on just-in-time production done in the late 1940’s. The version we use in software and operations looks only a little like the original Toyota system, but it shares the most important focus: on-time delivery of high quality product. There are a few critical ideas:
- Limit work in progress: Only a set number of things can be in progress at any one time for the whole team.
- Prioritize completion of work in progress over new work: Anything already in progress should be completed before new work is taken into the system. Getting completed work approved should come before taking new work into the system.
- Manage the flow of work through the system: Actively monitor and identify hold-ups in the system by daily review.
- Visualize the workflow: Make a clear visual representation of the work so that progress can be obviously monitored.
How We Use It
Much of the original thinking about how to implement Kanban for ops as part of a development team came to me by way of a previous role at AboutUs in Portland, that I later brought to and helped refine at MyDrive Solutions in England. At New Relic we started from that foundation and we’re further evolving it as we go.
We track each task with a biggish, 4×6” card. The task is written with Sharpie in big, readable letters on the front of the card; the description should make the objective obvious. Any needed detail can be written on the back of the card.
Cards are estimated on a scale of “1” to “3”. We don’t try to estimate based on some fictional effort scale; we estimate based on how long we think it will take a member of our team to do a task. A “1” is up to one day of work, a “2” is more than one day of work, and a “3” is up to half a week of work. If the card would be higher than a “3”, it is deemed too large and needs to be broken into smaller tasks. We try very hard to estimate tasks as “1” or “2” because cards rated a “3” tend to be much less accurately estimated. We then write the estimate and circle it at the top left of the card where it’s clearly visible.
Crucially, the name of the stakeholder who requested the work is also written on the top of the card. This is the person who will be responsible for accepting that the work was completed, and to whom any questions about the work can be directed.
We’re not yet using these two additions at New Relic, but they have worked well for me in the past: 1) writing the request date and completion dates on the bottom of the card to track delay in getting work accomplished, and 2) putting a check mark on the card for each day it remains on the board. This lets you review estimates in your retrospectives to determine where things went wrong and to learn from them. (We’ll be adding these at New Relic shortly.)
Cards come from a lot of sources. Various teams make requests of us, we write things up that we feel the team needs to do to be stewards of our infrastructure, and we also have ongoing projects.
Finally, we tape the cards to the board with a little piece of painters tape. It seems to work better than other solutions — cards don’t fall off and go missing.
We’re a small team so we currently operate from a foam core board which is nicely portable, but at MyDrive we had a large magnetic whiteboard, and at AboutUs we had an even larger plexiglass board.
Here’s our board:
On the left is the planned queue of new work, followed by work slots available to do work. Each team member should have an avatar showing who is doing what. For very small teams, engineers can each have a named slot. When work is completed, it moves to the acceptance queue. When it has been signed off, only then do we move it to done. It is the responsibility of the person(s) who worked on the card to get it signed off.
One of the critical things to get right in any workflow system is scheduling. We prioritize tasks each week and try to get them done in roughly that order. Because cards are different sizes, you need to pay attention to where they fall in the week if you truly want to accomplish them. For example, a “3” that falls on a Thursday is not likely to get completed. But front-loading the week with big tasks is not a great idea, either, because it risks all the smaller tasks getting stuck behind some incorrectly estimated large tasks. It takes a little practice to get the feel for what makes sense for your team, but here are some principles to help:
- Make sure there is at least one card scheduled each week for any ongoing projects in order to keep momentum going, even at the expense of pushing off some other important work to the next week. This has a noticeable effect on keeping things moving.
- Keep a good balance of customer-driven demand and team-driven needs. Make sure that team-sourced cards are scheduled each week so that you don’t fall into too much technical debt while trying to serve customers.
- Be ruthless about breaking up big tasks. Smaller cards flow through the system much better and are more likely to be predictable in size.
- Don’t schedule more work than you think you can take on. It doesn’t make anyone feel good to see a huge list of tasks that can’t be done–and it doesn’t help stakeholders get accurate estimates of when their work will be completed. If you are lucky enough to finish all your work early, have a extraordinary planning session and put a few more cards on the board. Then follow the normal weekly process afterward.
Here’s how we operate on a weekly basis. We have a daily stand-up focused around progress on the Kanban board. We plan one week of work at a time because this is longest time period where it’s still possible to make fairly accurate predictions. In stand-up we try to clear things from the acceptance queue, then from work in progress, and only then do we pull in new work. We have no more than one work slot per person on the team. I’ve found in the past that with larger teams you should limit this to fewer than one per team member; teammates often work together or pair program, and having too many slots violates the principle of limiting work in progress. In that case, you tend to have things on the board that aren’t being worked on.
Here’s what our schedule looks like:
- Monday morning: Schedule the work for the week. (We take on planning of one week at a time.)
- Daily: Stand-up near the beginning of the day, the same time every morning. Focus on completing tasks from the right to left on the board.
- Daily: Write up and estimate new cards as requests come in for new work. Put into the backlog if they aren’t immediately urgent. Prioritize into the queue as needed if they are. Notify affected customers if this changes the schedule of their work being delivered.
- Friday: Planning session for the next week. Everyone on the team presents the things they think need to be done. Those are added to the backlog and we then take on the amount of work we think we can complete. The planned work is then actually scheduled on Monday, and often with a few changes.
Moving things along
We track the velocity per week by totaling the number of points accomplished in the week and keeping track of how many person-days were available to the team that week (e.g. we reduce that if someone was out sick, or was on-call, etc). The average velocity for the previous few weeks is used in planning how much work to accept for a week. Because tasks tend to come up during the week, we only ever take on about the average velocity up front — it’s almost certain 4-5 points will get added during the week as important things come up (we’re an ops team after all).
During the week, for those things that are truly urgent, we write a card and run it up to the top of the queue to be the next task. For things that really have to be interruptive (site down!) we select someone to work on it and their card sits until they come back, or someone else can finish it.
It’s important to note that we don’t expect that anyone “owns” a task. If they have to move to something else because of a particular expertise, or someone is ill, etc, we expect that someone else can complete it. It stays on the board, blocking a slot, until it is completed. We have a sense of interchangeability: the teams owns the tasks and the product, but no individuals do. This helps facilitate the flow of work through the system and also has the benefits of exposing everyone in the team to different parts of the system. More eyes on different parts of the system also improves quality.
Managing this workflow works best if you schedule retrospectives every so often to take a look back at work completed, how the process is addressing the needs of the team, and how you can improve things. Each team will be a little different and this review lets you fine tune your methods for your particular needs.
What we see from using Kanban for our team is that we don’t leave things unfinished, we communicate more clearly with stakeholders about the work we’re delivering, we’re more accurate at predicting when something will be completed, and stakeholders feel reassured knowing that their card is on the board for the week or will be in the following week. Most people are willing to wait a little while to get work accomplished as long as they know it will be done, it will be done to a high standard, and you can accurately predict when it will be completed.
Team members are happy to see work getting completed, have the visual representation of accomplishment, and be able to accurately identify which weeks were highly productive. Even though we’re all still working really hard, the lowered stress of completing one thing at a time has pays dividends in productivity.
That’s how we use Kanban. We think it does a lot to address the core struggle of managing an ops team, in particular by limiting the costs of interruptive work and making more accurate estimates of how long work will take to complete. Hopefully this gave you a good picture of how your team might use it too.