Oyster’s Underground Nightmare: When DevOps Kills Retail
How to avoid the mistakes and lost revenue suffered by Starbucks and a host of other retail customers in 2015.
Join the DZone community and get the full member experience.Join For Free
How great it must have been for more than 100,000 passengers who enjoyed free rail, bus and tube travel last week after London’s ticketing system failed and station barriers had to be left open.
How nice it was for the Starbucks customers who got free coffee on April 24 last year, and how annoying for the hundreds of thousands of customers at Co-op food stores in the UK that were double charged for their shopping in July.
These are just a few examples of embarrassing errors made by large retailers in 2015. So what caused them and how can they be avoided? To answer this, we need to consider the bigger picture: It seems 2015 marked a turning point in the enterprise software world, one that presents retailers across the globe with both the threat of costly error and an exciting opportunity.
DevOps Done Badly Can Harm Your Business
Development and operations are embracing DevOps practices, which means increasing the pace of development. It also means reacting quickly to customer demands and competitive pressures by shipping out updates to retail and customer service software in smaller and more frequent batches. Indeed, speed and agility are key competitive advantages that every retailer needs these days.
The question is, do you really have the practices and tools in place to ensure that as you make frequent, small updates, you don’t end up killing your company like in this old Knight Capital example from the investment world?
In short, you can’t just do what you always did but faster. Instead, take these three key steps to ensure you can transform your software delivery chain with confidence:
- Don’t rely on key people for production deployments
- Too many organizations rely on a few key people to do "sensitive" deployments. This is usually because our applications are complex and unique, so understanding how to change them is often very specialized knowledge that only a few engineers possess. The trouble is we no longer have enough contingency time to allow for rogue updates. With the growing spread of end points in large retail companies and the faster rate of change, the risk of human error increases and becomes inevitable no matter how good your key people are.
- Automation workflows take the human error factor out of deployments without losing visibility and control. Over time your experts can continuously add pieces of specialized knowledge into an automated deployment workflow that is continuously tested in lower environments and ensures consistency and efficiency when it’s time to execute in production. After all, a computer doesn’t ever mix up the order of things, never forgets to copy a file and does not get tired and make mistakes.
- Enable efficient rollback or redeploy (automation is key for speed and control)
- If you’ve followed #1 then this stage is a natural evolution. While the automated workflows themselves are tested in lower environments and run consistently every time, things can go wrong in other places just as they did for Knight Capital. For this reason, automated rollback is an optional yet essential part of every workflow. It provides safety and reduced MTTR (mean time to repair), which in many cases can be even more critical then finding the root cause itself.
- Investigate and improve
- Deployments of massive end points and complex systems usually span multiple technical touch points, which is often the reason for failures and glitches. Understanding failures comes second to recovering operations, but is essential nonetheless. The way to understand root causes in hindsight is by recreating and reenacting processes. If your process is manual and performed by people, then it’s usually almost impossible to understand the exact order of things and see things exactly as they played out during the failed deployment. An automated process can be of great help here too, especially if the underlying platform collates outputs from all touch points into a single sequential "run log" that you can review later.
Opinions expressed by DZone contributors are their own.