Kubernetes #Fails
Typically these failures are the function of a lack of knowledge and skill, highly complex technology, lack of planning for security, and day-two operations.
Join the DZone community and get the full member experience.
Join For FreeTo understand the current and future state of Kubernetes (K8s) in the enterprise, we gathered insights from IT executives at 22 companies. We asked, "What are the most common failures you see with K8s?" Here’s what we learned:
Skill
- Trying to do it all by themselves, at scale, in the enterprise. People start to do it on their own, when they try to put into production and scale they run into issues. Thousands of containers across multiple geographic regions. Problems happen when you scale, and you cannot keep up with all of the necessary work. Day two operations like upgrades. K8s talent is very difficult to find and retain. Even if you can find them, it’s hard to retain them. Hire a large platform team, train them, and keep them on board.
- When you go past early-adopter, you get to the next stage of enterprises from V-sphere or open stack world and you do not have the skills to manage the complexities. Abstract away the infrastructure to work with languages they are familiar with to orchestrate.
- A lot of it is the skill. Make sure you have the right configurations in place. Rely less on human engineers configuring versus Helm Charts. Like a scripting language within K8s to simplify config. As more tools are available it will get easier. Operators will simplify it even more, but you’ll need orchestration for operators.
- 1) A common challenge is how to level up quickly so your team can be productive. 2) Estimating the complexity of installing and operating K8s. What happens on new releases every three months. Underestimating the operational impact and keeping it current. 3) Taking a monolithic app and turning it into something that will resonate well with K8s.
Complexity
- People often think K8s can do more than it’s actually designed to do. The concept of easy elasticity inside an environment wasn’t part of the original K8s stack. When the pieces come into play, they are part of the monitored infrastructure and you need to keep track of it. The K8s ecosystem has a couple of hundred products on it, people assume they work together seamlessly but they do not. We help customers know what pieces work together to understand how they work together.
- People have a great first experience and are surprised by complexity when they put it into production and have to put security or monitoring in place. They don’t have the tools to debug and fix. We provide guidance and checklist for design and production. Have the right monitoring and debugging in place. Failure when attempting an upgrade. They don’t test on a test cluster. We provide guidance for the strategy for upgrades. Run workloads on our internal test clusters to have confidence when running their own.
- The most common failures I see are due to a lack of understanding of how K8s functions. K8s itself is pretty solid. Some examples of misuse include: 1) Incorrect network rules: This can cause communication issues between master and worker nodes, which consequently can cause your entire cluster to restart. 2) Not updating your kubectl tool along with your configs when K8s versions updates will stop you from being able to manage your containers.
Security
- Complexity and learning curve. People want to use the feature of K8s. The biggest is wanting to use K8s when they don’t need to. You should have a good understanding of the complexity and ongoing management. You can’t do that if you're small. K8s adds overhead for the development cluster. Creating a Dockerfile takes a while and it's hard to get a quick feedback loop. Scaffold helps developers but been there it’s a lot faster to just run services outside containers and then localize them. We see failures around security with new vulnerabilities weekly that require patches. The third is around networking, K8s networking story needs to be fleshed out. Ingresses are immature constructs. You have to dig into the code for functionality errors.
- Security to the cluster. We solve security, authentication, authorization, rate limiting. Check out these K8s failure stories.
- Enterprises can find it challenging to implement effective security solutions for K8s and container environments. Similarly, enterprises can find it difficult to recruit experts with the depth of K8s and DevOps knowledge required to create proper tools and implement application workflows in containerized environments. Unfortunately, this can leave enterprises unable to implement all the security measures needed to properly protect their K8s-orchestrated environments.
- 1) Most often, we see operational and security failures that arise because teams don’t implement any policy around the creation of external load balancers/ingresses. There is a lot of thought given to which images can be deployed (and to scanning for vulnerabilities before and during runtime) but many companies have a blind spot related to just how easy it is to accidentally steal traffic from one workload and send it to another. The best-case scenario in this common failure is downtime — the worst case is that sensitive data gets exposed to the Internet, and no one notices, because there is no alert and no notification. 2) Similarly, we see failures come from human error or accidental oversight when it comes to naming and labeling policies. K8s labels are extremely powerful for defining what can happen downstream. Things like network policy, privacy policy, limiting privilege, and authorization can all be keyed off of labels — which means that when a label is missing, things can breakfast. 3) We also see surprising failures resulting from mundane issues. Something as simple as workloads ending in
: latest
that prevent effective rollback, or lack of specified resources that allow workloads to overtake their pod or cluster and push neighboring workloads onto under-specified systems. K8s is immensely powerful, and as such needs to have strong guardrails to direct and control that power.
Day-Two Operations
- 1) Humans that have rolled out a change in code and cause cascading failures. 2) The complexity of the platform. K8s is pretty solid. Applications are reliable. The platform itself needs to start-stop in the right order, navigate the sea of TLS certificates, and logs from multiple services. It isn’t as easy as it could be to manage. You don’t need a dedicated ops team anymore but the teams need to be experts in management using K8s. You can also build better tooling around it.
- A common K8s failure scenario is when the K8s cluster infrastructure hardware fails to satisfy the container startup policies. Since K8s provides a declarative way to deploy applications and those policies are strictly enforced, it's critical the declared and desired container states can be met by the infrastructure allocated, otherwise a container will fail start! Other areas where failures can occur, or cause concern, is when deploying persistent storage, properly monitoring and alerting on failure events, and deploying applications across multiple K8s clusters. While the promise of orchestrated container environments holds great potential, there are still several areas that require careful attention to reduce the occurrence of failures and issues when deploying these systems.
- The first big failure mode is not budgeting for maintenance. Just because K8s hides a lot of details from application developers doesn’t mean that those details aren’t there. Someone still needs to be allocating time for upgrades, setting up monitoring, and thinking about provisioning new nodes. The second big failure mode is that teams move so fast that they forget that adopting a new paradigm for orchestrating services means they need to rethink observability for those services as well. Not only is a move to K8s often coincident with a move to microservices (which necessitates new observability tools) but pods and other K8s abstractions are often shorter-lived than traditional VMs, meaning the way telemetry is gathered from the application also needs to change. The solution is to build in observability from day one, including instrumenting code in such a way that you can take both a user-centric as well as the infrastructure-centric view of requests, to how instrumentation data is transmitted and aggregated, to how that data is analyzed. Waiting until you’ve experienced an outage is too late to address these issues, as it will be too late to get the data you need to understand and remediate that outage.
Other
- A fundamental issue is moving a container deployment from an experimental, pilot, or playground environment into an enterprise-ready, stable, operationalized environment where you measure availability, reliability, you have to be concerned about compliance, backups, and disaster recovery. In an environment created in a playground may not qualify itself for a mission-critical enterprise environment. It may lead to a requirement to rearchitect the container environment and the fundamental infrastructure that’s behind it.
- 1) We are still in the phase in the industry where K8s, and the supporting technologies, are early in their lifecycle that it’s still challenging to deploy and run reliably at scale. People are struggling just to deploy K8s and make it work reliably. People are aware of the benefits of K8s, but they have trouble getting it to work. People give up because it’s too hard for them. 2) It's problematic when K8s and containers are employed to the organization without first making sure the software can be modularized enough to support a K8s workflow. An executive decision to use K8s with the dev team not on board is sure to fail. The dev team needs to be structured in the proper way so independent teams of developers will embrace K8s.
- We monitor K8s clusters and all the elements therein to make sure it works and runs as intended. Number one is out of memory, under-provisioned containers, and overused memory. See container restarts all over the place without telling you it’s broken. Persistent volumes failed jobs without being notified it’s a problem and K8s will not fix.
- The primary issues I see are related to storage and networking, especially DNS. You need to understand that K8s relies on having control of DNS to effectively handle rescheduling with failover for a hostname and some of its service discovery. While built-in DNS is fine for applications in the cluster, if you’re crossing a network boundary you should ensure you appropriately configured DNS so that it delegates a routable zone within your network.
- When migrating to K8s, the most serious consequence is typically from a failed strategy, rather than a technical one. Of course, there will be technical issues which might cause application downtime with pod crashing or severe performance degradation due to an unoptimized pod placement strategy or incorrect load balancing. These are inevitable when adopting new and recent technology. But the goal should be to ensure minimal disruption to the business and your customers which requires proper risk assessment on the migration plan. Migrating your entire stack to K8s in one go will be an eventual failure. A better approach would be to identify the most suitable component to migrate with minimal impact if something goes wrong, learn from it, and incrementally migrate with a clear long-term architectural runway and objectives tied to an end goal.
Fragmentation and cloud vendor lock-in. While some workloads — especially greenfield applications – may be running well in K8s, it may be still hard to run legacy applications. Creating a two-speed highway within the organization can cause fragmentation over time. Having a strategy to reconcile this fragmentation across traditional environments and simultaneously on containers is very important. You must also avoid cloud vendor lock-in at all costs when adopting a multi-cloud strategy.
Here’s who shared their insights:
- Dipti Borkar, V.P. Product Management & Marketing, Alluxio
- Matthew Barlocker, Founder & CEO, Blue Matador
- Carmine Rimi, Product Manager Kubernetes, Kubeflow, Canonical
- Phil Dougherty, Sr. Product Manager, DigitalOcean
- Tobi Knaup, Co-founder and CTO, D2iQ
- Tamas Cser, Founder & CEO, Functionize
- Kaushik Mysur, Director of Product Management, Instaclustr
- Niraj Tolia, CEO, Kasten
- Marco Palladino, CTO & Co-founder, Kong
- Daniel Spoonhower, Co-founder and CTO, LightStep
- Matt Creager, Co-founder, Manifold
- Ingo Fuchs, Chief Technologist, Cloud & DevOps, NetApp
- Glen Kosaka, VP of Product Management, NeuVector
- Joe Leslie, Senior Product Manager, NuoDB
- Tyler Duzan, Product Manager, Percona
- Kamesh Pemmaraju, Head of Product Marketing, Platform9
- Anurag Goel, Founder & CEO, Render
- Dave McAlister, Community Manager & Evangelist, Scalyr
- Idit Levine, Founder & CEO, Solo.io
- Edmond Cullen, Practice Principal Architect, SPR
- Tim Hinrichs, Co-founder & CTO, Styra
- Loris Degioanni, Founder & CTO, Sysdig
Opinions expressed by DZone contributors are their own.
Trending
-
Reactive Programming
-
Transactional Outbox Patterns Step by Step With Spring and Kotlin
-
Operator Overloading in Java
-
How To Scan and Validate Image Uploads in Java
Comments