Leading Through the Chaos of Large-Scale Cloud Operations: 7 Best Practices

Following operational hygiene and best practices when handling large-scale incidents is critical to mitigating the impact faster and in a safe way.

Feb. 17, 26 · Opinion

Likes (0)

Comment

Save

1.3K Views

High-scale systems fail in many unexpected ways that you would never have designed for. Over the past 14 years, I have navigated the layers of physical and virtual networking. I started as an individual contributor writing code for data plane services and later led global teams managing highly distributed services that owned millions of hosts. I have seen a wide range of incidents, including multi-service impacts, single-service impacts, cascading failures, single-customer issues, service failures during incident recovery, service failures post-recovery, and services that cannot auto-recover. The list goes on and on. I have studied the root causes of major outages across the industry’s cloud leaders. There are common failure patterns across the industry.

While these events are inevitable, based on my experience, adhering to the best practices below for managing failures will greatly improve your ability to handle them. These are the seven best practices I recommend to keep teams efficient during large-scale incidents and help reduce the impact time.

1. Mitigate First, Root Cause Later

During an outage, the natural tendency from engineering teams would be to find the underlying cause of the issue. However, you should always drive the discussion towards how to mitigate the issue. I’ve seen teams spend a lot of critical time debugging code while customer impact is still ongoing. Most of the time, you don’t need to know the root cause to mitigate the issue or execute recovery steps.

If an incident correlates with an ongoing deployment and causes a spike in 5xx errors, you should roll back the deployment immediately rather than debugging the code to identify the bug. If a single host is experiencing failure, remove it from the fleet immediately. You can perform the deep-dive analysis later once the impact is mitigated. If you have enough hands on deck, you can divide and conquer by tasking one group with immediate mitigation and another with the root cause investigation.

2. Don’t Be a Hero: Ask for Help When You Need It

Earlier in my career, I mistakenly thought reaching out for help would be seen as a sign of weakness. I was always tempted to try to solve every incident myself in order to prove my technical and operational ability.

But over time, I have realized this is often an incorrect approach since it creates a single point of failure, delaying the mitigation. The rule of thumb one could use is to escalate the moment you are blocked. A peer who is a domain expert or senior tech lead brings more experience to the table by correlating with previous outages — something you won’t be able to do easily under pressure or when you’re blocked.

3. Test Your Tools Regularly

Reliable tooling is critical and required to handle an operational incident. Teams often rely on scripts or automation that are only used during rare events, and they will likely fail since they aren’t exercised regularly. Not having the tools available during an incident when you need them most will further delay mitigation and increase pressure on teams to use manual, untested approaches. Using an untested approach can always result in errors, which may increase the impact or delay the mitigation further.

You should treat your operational tools with the same rigor and quality as your production code. One way to do this is to run unit and component-level tests whenever the tools are updated, or their dependencies change in a pre-production environment. By catching software changes that break the tools immediately and not during a large-scale event, you will increase your team’s effectiveness and operational posture for handling service outages.

4. Verify and Validate

When making production changes in response to an ongoing incident, it is better to be slow and safe than to rush and break things. There are many examples of someone running the wrong command or making a manual change to a system, service, or database during an event that breaks production. Always verify production changes via tests, additional reviews, and approvals before executing them. This approach can be enforced by mandating that every production change go through a formal peer review or an "over-the-shoulder" second pair of eyes review. Taking an extra 30 seconds to verify your work will potentially avert errors that can cause more impact during incidents. After execution of a command or change, it is equally important to validate the result by querying or inspecting if the change behaved as expected.

5. Avoid the "Context Tax"

One of the most common patterns in large-scale event handling is not having a common understanding of the issue at hand. Every time a new person joins the bridge and asks the same questions about the event, it results in operators' context switching from mitigation to explaining what they know about the issue. This pattern can be avoided by having a clear summary of the event written down in the event tracker and deferring questions to it.

A good summary must include a start time, nature of impact (latency vs. errors), magnitude of impact, scope (partition vs. zonal vs. regional), recovery metrics to track, and active threads with owners and estimated time of completion. This approach helps avoid losing valuable time and allows operators to stay focused on mitigation. Another good operational hygiene is to always post a blurb explaining the relevance of a graph, instead of just posting the graph without any context. This tells everyone what the data is showing so they can support you more effectively.

6. Aggressively Filter Distractions

It is important to stay focused on mitigation during a large-scale event. Given that large-scale complex events have many participants, many of them will have their own theory of what the issue might be. While it’s good to hear different perspectives and think of various possibilities, it is often counterproductive and can make the call go in circles for hours. This is usually the case because often these theories are not backed up by data or evidence. An incident manager must keep the discussion on a logical, data-driven path and track associated investigation threads in a visible document. If a participant proposes a new theory that isn’t backed by data, it should be moved to the backlog of pending action items or investigated separately from the main threads.

7. Drills Over Documentation

Teams typically use documents, training videos, and standard operating procedures guides to onboard new members to on-call rotations. While this sounds like a reasonable approach, I have found that new members are more effective if they are provided hands-on exposure. You can achieve this by shifting your onboarding process to include operational drills alongside training materials. The operational drills can be simulated in a pre-production environment. During these drills, your new on-call team can mitigate the issue by using the tools, following the SOPs, and executing the escalation process, as they would for production events. Being well-prepared through drills will help operators stay calm and be better equipped to handle real events.

Final Thoughts

Networking components and large-scale distributed systems relying on cloud infrastructure have become critical, foundational components for many software companies. It is essential to ensure high availability and resilience for these components.

We have read about many cloud outages that can disrupt day-to-day operations, impacting several sectors of the industry that rely on cloud companies. As the complexities of these systems increase over time, it is important to build discipline in operational hygiene to manage them. Especially given the rapid rise in AI adoption, the interdependencies between services are multiplying. While it is not possible to completely avoid failures, it is critical to have processes and a culture in place to recover quickly and ensure minimal disruption for use cases dependent on cloud technologies. By prioritizing the best practices mentioned above, we move from a reactive mode to a proactive, well-prepared, and more disciplined operational culture.

Cloud

Published at DZone with permission of Venkat Maithreya Paritala. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending