The Three Pillars of Kubernetes Troubleshooting
Diving into how the three pillars of understanding, managing and preventing for Kubernetes troubleshooting, and how it helps to conceive of what’s needed to be able to properly troubleshoot real-world Kubernetes stacks that are the hallmark of complex, distributed systems.
Join the DZone community and get the full member experience.Join For Free
With the ecosystem being extremely saturated with a diversity of tools such as monitoring, observability, tracing, logging, and much more, it’s hard to truly understand how troubleshooting differs from these typical tools.
Today, when incidents occur, it is becoming increasingly complex to get a grasp on where to even get started with understanding what you’re up against, and then ultimately fixing the immediate issue, and then remediating the root cause.
As a former engineer, who worked with modern, complex, distributed systems, I often found that every time there was an issue or incident, it was no simple feat to understand the underlying cause of the issue, what triggered it, and who made the change. But even more difficult was to figure out what’s going on under the hood, and how to prevent it from happening again.
As developers, we found ourselves putting on our detective hats, and using 10-15 different tools to understand:
- What’s actually going on?
- What things are relevant?
- What correlates to this specific symptom we’re trying to troubleshoot?
- How do we identify the root cause?
After all this, once identified, we then needed to figure out - how do we even go about fixing this? And ultimately, how can we make sure this issue, or a similar issue, doesn’t happen again in the future?
This is what brought us to think about troubleshooting in the context of three pillars:
I’m going to dive into how we envision these three pillars, and how they helped us to conceive of what’s needed to be able to properly troubleshoot real-world Kubernetes stacks that are the hallmark of complex, distributed systems. I’ll also review which ecosystem tools fit into which pillar, to provide a better grasp on the jungle of tools available for seemingly similar needs, and eventually where we saw the gap and came to build a tool to bridge this in Komodor.
The First Pillar: Understanding
Not surprisingly, this is where 80% of the resources are typically invested. These resources usually enable you to get a grasp on what’s happening, what went wrong, and what we should do next.
To try and derive some understanding of what actually happened in the system that triggered this failure, developers will start by analyzing the changes to the system and what was changed that could have caused this to happen.
Of course, this is much easier said than done. In complex distributed systems and particularly Kubernetes-based systems, this means using kubectl a lot to troubleshoot deployment logs, traces, & metrics, verifying pod health & resource caps, as well as service connections, among other common pod errors, checking YAML config files for possible misconfigurations, and validating third-party tooling & integrations, just to name a few, to find the needle in the haystack. This could be anything from a single line of code, configuration change, or even the version update that could have triggered the failure.
This image below is an illustration that provides a good understanding of the many questions that arise, and the size of the rabbit hole you need to go down when troubleshooting K8s systems.
Next, we’ll look at the events: what’s actually happening in the system - is the system overloaded? Are we losing data? Is there a service breakdown? How does this relate to the initial change in the system?
We then take a look at the fancy metrics, dashboards, and data that we created for just this very moment, to extract some kind of understanding of what is going wrong, based on tangible data sources. Is more than one system behaving the same way? Is there a dependency in one of the services affecting both systems? And finally, can we learn anything from seemingly similar, previous incidents that will give us some kind of understanding of what we’re going through right now?
Just for some context, below is a sample list of some of the tools you would need to employ just to get a basic understanding of what’s happening under the hood in your systems.
Monitoring Tools: Datadog, Dynatrace, Grafana Labs, New Relic
Observability Tools: Lightstep, Honeycomb
Live Debugging Tools: OzCode, Rookout
Logging Tools: Splunk, LogDNA, Logz.io
The Second Pillar: Management
With today’s microservices architectures, many times inter-dependent services are managed by different teams. When there’s an incident, one of the primary keys to successful remediation is the communication and collaboration between teams, to resolve the issue as quickly as possible. Understanding something as basic as a feature flag being added or removed can apparently make or break even a highly successful company. (If you want to sleep well at night, we suggest you DO NOT read the Knight’s Capital post-mortem).
Depending on the kind of underlying issue, the type of actions you may want to take include actions as simple as restarting the system, or more drastic measures such as rolling back a version, or reverting recent configurations until there’s more clarity regarding the underlying problem. Eventually, you may need to take proactive measures to increase capacity in the form of increasing memory caps or the number of machines. However, all of this shouldn’t be something you try and figure out in real-time. Today there are plenty of tools from Jenkins to ArgoCD, cloud providers’ proprietary tools, and even more kubectl to take these actions and measures.
Once the underlying issue is understood better, the remediation shouldn't be ad hoc operations that are mostly trial and error, or partially documented playbooks that live in the minds and practices of the current team. Based on the technologies in the stack, and the probable root cause, customized runbooks should be used to manage any given incident, with concrete tasks and actions per each kind of alert.
This is also a good way to prevent single points of failure through that one seasoned engineer who built the system from the ground up, and who knows how to troubleshoot the issue based on an unwritten oral tradition. A good runbook will be able to be leveraged by every single engineer on the team, senior or junior to troubleshoot in real-time.
This phase’s toolkit would include some of the following:
Incident Management: PagerDuty, Kintaba
Project Management: Jira, Monday, Trello
CI/CD Management: ArgoCD, Jenkins
The Last Pillar: Prevention
Prevention is likely the most important pillar, to ensure similar incidents don’t recur. The way to prevent similar issues is by creating well-defined policies and rules based on each and every incident. What are the actions to take in the “understanding” phase, how do we most quickly identify the issue and escalate the management of the issue to the relevant teams?
How do we delegate the responsibility, and ensure frictionless communication and collaboration between teams? This includes full transparency to the tasks and operations at hand, along with real-time updates on progress. What is the canonical order of tasks for each kind of alert and incident?
Once we’ve figured all of the above, we can start to think about how we automate and orchestrate these incidents, and get as close as possible to the fabled “self-healing” systems.
This pillar is characterized by tooling that serves to create systems that are more resilient and adaptive to change, by constantly pushing them to their limits. For example:
Chaos Engineering: Gremlin, Chaos Monkey, ChaosIQ
Auto Remediation: Shoreline, OpsGenie
Bringing It All Together
We believe that these three pillars combined are what differentiate troubleshooting from monitoring, observability, tracing, and more. Troubleshooting takes this visualization & understanding (as deep and comprehensive as it may be), to actual execution and remediation. However, probably most important is embedding the learning into the systems and processes, to prevent them from happening again.
We’ve all seen this meme:
Or this one:
And the reason they’re still widely used, and so popular, is because even with all of the progress we have made with “DevOps tools”, with real-time issues and incidents, these many times still hold true.
Centralizing both application and operations data into one platform, enables teams to actually gain a true understanding of their systems, and eventually also understand how to act upon very specific alerts that correlate to specific changes made in a very complex system. This will alleviate the pressure of having to figure it out in real-time, in a high-pressure atmosphere, through pre-defined runbooks with specific tasks and remediation. When we bring the best of dev and ops together, we can solve incidents much more rapidly, by being better together.
Opinions expressed by DZone contributors are their own.