DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Container Checkpointing in Kubernetes With a Custom API
  • Mastering Node.js: The Ultimate Guide
  • Spinnaker Meets Minikube: Part 1
  • The Importance of Persistent Storage in Kubernetes- OpenEBS

Trending

  • The Perfection Trap: Rethinking Parkinson's Law for Modern Engineering Teams
  • ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation
  • Driving DevOps With Smart, Scalable Testing
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Fixing Kubernetes FailedAttachVolume and FailedMount Errors on EBS

Fixing Kubernetes FailedAttachVolume and FailedMount Errors on EBS

Do you love the smell of FailedAttachVolume and FailedMount errors with Kubernetes on AWS EBS? No? Oh. Well, here's what's probably causing them (and how to fix them).

By 
Kai Davenport user avatar
Kai Davenport
·
Dec. 17, 17 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
41.4K Views

Join the DZone community and get the full member experience.

Join For Free

This blog is part of a new series on debugging Kubernetes in production. Portworx has worked with customers running all kinds of apps in production, and one of the most common errors we see from customers relates to failed attach and failed mount operations on an AWS EBS volume. This post will show you how to resolve Failed AttachVolume and FailedMount warnings in Kubernetes and how to avoid the issue in the future.

Background

As we described in our blog post about stuck EBS volumes and containers, using one EBS volume for each container creates a very fragile system. As we pointed out in that post, when we create a 1-to-1 relationship between our EBS drives and containers, there are a variety of problems that can occur:

  • The API call to AWS fails
  • The EBS drive cannot be unmounted or detached from the old node
  • The new node already has too many EBS drives attached
  • The new node has run out of mountpoints

When there is a problem with this process, you will typically see errors like the following:

Warning FailedAttachVolume  Pod 109 Multi-Attach error for volume "pvc-6096fcbf-abc1-11e7-940f-06c399d05922" 
Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount Pod 1 AttachVolume.Attach failed for volume "pvc-6096fcbf-abc1-11e7-940f-06c399d05922" : 
Error attaching EBS volume "vol-03ea2cb51f21f9fac" to instance "i-0e8e0bbf7d97a15df": 
IncorrectState: vol-03ea2cb51f21f9fac is not 'available'.
  status code: 400, request id: 41707341-e239-4808-846f-8f9d19fd1563


Specifically, there are errors for FailedAttachVolume and FailedMount, attach and mount being the two essential tasks for properly starting up a stateful container that uses an EBS volume.

This post will look in detail at why these errors occur, how to resolve the error at present, and how to avoid the issue entirely in the future. At this point, though, we can summarize the issue as follows:

When something happens that requires a pod to be rescheduled to a different node in the cluster, if the unmount and detach operations are not possible before the host becomes unavailable, you will not be able to attach it to a new host.

In our experience, 90% of EBS issues and Kubernetes come down to this issue. You can’t start up a pod on some EC2 instance because its EBS volume is still attached to some other (potentially broken) host.

Manual Setup

Let’s simulate the Kubernetes process by doing the operations manually.

First, we create an EBS volume using the AWS CLI:

$ aws ec2 create-volume --size 100
vol-867g5kii


This gives us a VolumeId, and we have 3 EC2 instances we could use it on. But before we can do that, we must attach the EBS volume to a specific node, otherwise, it is unusable. We use the VolumeId, pick an instance from our pool, and perform an attach operation:

$ aws ec2 attach-volume \
 --device /dev/xvdf \
 --instance-id instance-3434f8f78
 --volume-id vol-867g5kii


Note that the attach-volume command can be run from any computer (even our laptop) – it’s only an AWS API call.

Now that AWS has attached the EBS volume to our node, it will be viewable on that node at /dev/xvdf (or whatever device path we gave in the attach-volume command).

The next step is to mount the EBS volume so we can start to write files to it, so we SSH onto the node:

$ ssh admin@instance-3434f8f78
$ sudo mkfs.ext3 /dev/xvdf
$ sudo mount /dev/xvdf /data


Note that the mount command must be run from the node itself.

We can then start writing files to /data. Because we have not used this volume before, we have formatted first the drive with an ext3 filesystem.

Manual Failover

What if the node our EBS volume is attached to failed? Let’s walk through the steps we would need to perform:

  • Realize we cannot unmount the volume from failed node
  • Force detach the volume using aws api $ aws ec2 detach-volume –force
  • Attach and mount the volume to a healthy node


Error Cases

Kubernetes automates a lot of this process – in a previous blog post we showed you how Kubernetes manages persistent storage for your cluster.

With that background out of the way, let’s look at some failures that can occur during daily operations.

Warning FailedAttachVolume

The Warning FailedAttachVolume error occurs when an EBS volume can’t be detached from an instance and thus cannot be attached to another. This happens because Kubernetes will not force detatch EBS volumes from nodes – the EBS volume has to be in the available state to be attached to a new node.

In other words, Warning FailedAttachVolume is usually a symptom of an underlying failure to unmount and detach the volume from the failed node.

You can see in the Kubernetes codebase that this error is generated when Kubernetes attempts to attach the volume to a node but it is already attached to an existing node.

Warning FailedMount

The FailedMount error comes in that, because we were unable to attach an EBS volume to the new host, we are also by definition unable to mount that volume on the host.

You can see one of the examples where this error is generated in the Kubernetes codebase.

Common Failure Modes That Cause EBS Problems on Kubernetes

There are a number of scenarios that can cause these problems:

  • Network partition
  • Docker crashing
  • Forced cordon/reschedule
  • Failed EC2 node

Let’s take a look at these failure scenarios and see how Kubenetes, using EBS, copes with them.

Note: These errors are taken from Kubernetes version v1.7.4

Network Partition

Networks are one major component of a distributed system that can go wrong and lead to the log messages after failure: Warning FailedAttachVolume and Warning FailedMount.

We can simulate a network partition with an $ iptables -A DROP command on one of our nodes. When the network is killed, the kubelet is unable to communicate its status back to the Kubernetes master, is thus dropped from the cluster, and all its pods are rescheduled to other hosts.

Once our stateful pod is scheduled to another node, the controller will attempt to attach the EBS volume to this new, healthy node. However, as we have discussed, AWS views the volume as currently attached to the old node and Kubernetes will not force detach. This will lead to the following warning events (using $ kubectl get events):

$ kubectl get ev -o wide

LASTSEEN   FIRSTSEEN   COUNT     NAME                        KIND      SUBOBJECT   TYPE      REASON                  SOURCE
0s         11s         105       mysql-app-397313424-9v0q6   Pod                   Warning   FailedAttachVolume      attachdetach   
Multi-Attach error for volume "pvc-6096fcbf-abc1-11e7-940f-06c399d05922" 
Volume is already exclusively attached to one node and can't be attached to another
2m         2m          1         mysql-app-397313424-pq8m2   Pod                   Warning   FailedMount             attachdetach   
AttachVolume.Attach failed for volume "pvc-6096fcbf-abc1-11e7-940f-06c399d05922" : 
Error attaching EBS volume "vol-03ea2cb51f21f9fac" to instance "i-0e8e0bbf7d97a15df": 
IncorrectState: vol-03ea2cb51f21f9fac is not 'available'.
           status code: 400, request id: 41707341-e239-4808-846f-8f9d19fd1563


Docker Daemon Crash or Stop

When the Docker daemon crashes or stops, the pods running on the host don’t stop. Kubernetes will nevertheless reschedule them to other hosts because the kubelet cannot report the status of those containers.

This can be tested by running $ sudo systemctl stop docker on one of the nodes.

This leads to a similar sequence of errors as a network partition:

$ kubectl get ev -o wide

LASTSEEN   FIRSTSEEN   COUNT     NAME                        KIND      SUBOBJECT   TYPE      REASON        SOURCE
0s         8s        79          mysql-app-397313424-vtxk2   Pod                   Warning   FailedAttachVolume      attachdetach                                           
Multi-Attach error for volume "pvc-6fb2d9ff-b991-11e7-8e89-060bb7549b64" 
Volume is already exclusively attached to one node and can't be attached to another
2m         2m          1         mysql-app-397313424-8xk0h   Pod                   Warning   FailedMount   attachdetach        
AttachVolume.Attach failed for volume "pvc-6fb2d9ff-b991-11e7-8e89-060bb7549b64" : 
Error attaching EBS volume "vol-07c7135dc55da5c5f" to instance "i-0a0f9a906e1a2b8eb": 
IncorrectState: vol-07c7135dc55da5c5f is not 'available'.
           status code: 400, request id: ae905421-2fe7-44ae-accc-20cb0acd1b81


Update Affinity Settings, Forcing a Reschedule and Cordon

One of the most powerful features of Kubernetes is the ability assign pods to particular nodes. This is typically referred to as affinity. A simple way to implement affinity is with nodeSelector, basically a key-value pair label that describes a node.

$ kubectl label nodes kubernetes-foo-node-1.c.a-robinson.internal disktype=ssd


If you place a nodeSelector in your pod spec, Kubenetes will respect that label when making scheduling decisions. When this affinity setting is updated, your pods are rescheduled to nodes that fit the nodeSelector criteria.

This also applies to the cordon command – which makes use of node selectors to render a node unusable by pods.

In both cases, by changing the values, Kubernetes will reschedule the pod to another node.

Manually Unpicking the Stuck EBS Problem

In our tests, Kubernetes version v1.7.4 suffers from the problems mentioned above. Because it is not able to force detach the volume, it gets stuck and so cannot be attached to the new node.

If your Kubernetes cluster has this problem, the key to resolving it is to force detach the volume using the AWS CLI:

$ aws ec2 detach-volume --volume-id vol-867g5kii --force


This will move the volume into an available state, which will enable Kubernetes to proceed with the attach operation for the newly scheduled node.

EC2 Instance Failure

With a node failure, we have an interesting case to consider. We are simulating the node failure using $ aws ec2 delete-instances, which means AWS does two things:

  • Powers down and removes the EC2 instance.
  • Marks any attached EBS volumes to be force detached

So, it’s important to note that in our tests, whilst the node failure test worked, it was because the test itself does not capture what would happen with a true node failure:

  • power cut
  • kernel panic
  • reboot

In the situations described above, neither Kubernetes nor AWS can determine that the EBS volume should be forcibly detached. It is always better to err on the side of caution after all when it comes to production disks — force detach sounds scary!

Back to the instance failure test itself. Because the node is lost, k8s will reschedule our pods to other nodes, the same as above. But because we are simulating the node failure using $ aws ec2 delete-instances, AWS issues what is essentially a force detach.

This means the EBS volume becomes available, unlocking the controller code that is constantly checking the status of the volume to proceed and attach it to the new node:

$ kubectl get ev -o wide

COUNT   NAME    TYPE      REASON                  SOURCE
1       p59jf   Normal    SuccessfulMountVolume   kubelet, ip-172-20-52-46.eu-west-1.compute.internal   
        MountVolume.SetUp succeeded for volume "pvc-5cacd749-abce-11e7-8ce1-0614bbc11108"
7       p59jf   Warning   FailedMount             attachdetach                                          
        AttachVolume.Attach failed for volume "pvc-5cacd749-abce-11e7-8ce1-0614bbc11108" : error finding instance ip-172-20-52-46.eu-west-1.compute.internal: instance not found
1       q0zt7   Normal    Scheduled               default-scheduler                                     
        Successfully assigned mysql-app-397313424-q0zt7 to ip-172-20-40-96.eu-west-1.compute.internal
1       q0zt7   Normal    SuccessfulMountVolume   kubelet, ip-172-20-40-96.eu-west-1.compute.internal   
        MountVolume.SetUp succeeded for volume "default-token-xwxx7"


What you can see here is:

  • The volume mounting successfully to 172-20-52-46
  • The call to AWS deleting the instance, which triggers a detach asynchronously
  • The volume controller failing repeatedly to confirm the attachment to 172-20-52-46
  • The scheduler moving the pod to 172-20-40-96
  • The volume mounting successfully to 172-20-40-96 because AWS has detached it for us

This only works if you have a readiness probe for your pod so that the backend AWS API has a chance to update the state of the volume to available — this test failed when we removed the readiness probe.

Cloud-Native Storage Approach

There are some failure scenarios above that could possibly be mitigated by updating the Kubernetes EBS driver codebase. However, the problem is one of fundamental architecture.

In reality, EBS disks are not agile entities that should be moved around the cluster. As well as FailedAttachVolume and FailedMount errors, the following spurious events could occur:

  • The API call to AWS fails
  • The EBS drive cannot be unmounted or detached from the old node
  • The new node already has too many EBS drives attached
  • The new node has run out of mountpoints

The real utility of EBS is that we can have failover without losing data. However, we are losing performance because EBS is network attached. AWS offers EC2 instances that have super fast SSD disks but we cannot use these in our failover scenario.

How Portworx Handles It

When you use Portworx as your Kubernetes storage driver when running on AWS, this problem is solved because, once attached, an EBS volume stays attached and will never be moved to another node.

If a node fails, the EBS volume fails with it and should be deleted. This really embraces the immutable infrastructure aspect of a cloud-native approach to a system.

The question remains – how can we ensure no data loss in the event of failover? The answer lies in the way Portworx takes an entirely different approach in consuming the underlying EBS drives and using synchronous block layer replication.

Decouple Storage From Containers

Portworx takes a different approach – it pools the underlying EBS drives into a storage fabric. This means containers get virtual slices of the underlying storage pool on demand. It is able to replicate data to multiple nodes and work alongside the Kubernetes scheduler to ensure containers and data converge efficiently.

With Portworx, EBS drives are created, attached, mounted and formatted once and then join the storage pool. Thousands of containers could be started using the same number of EBS drives because Portworx decouples the underlying storage from the container volumes.

As you can see – Portworx consumes the underlying storage but decouples the actual drives from the volumes it presents to containers. Because the data is replicated and we are using shared volumes, the failover scenario we saw earlier becomes much simpler (and therefore less error-prone).

We are no longer enforcing a 1-to-1 relationship between EBS drives and containers, so the sequence of...

  • detach the block device from unresponsive old node
  • attach the block device to the new node
  • mount the block device to the new container

...is no longer needed — the target EBS drive is already attached and already has the data!

Conclusion

So in conclusion, to resolve FailedAttachVolume and FailedMount errors:

  • Don’t have a one-to-one mapping of EBS volume per Docker container
  • Instead, mount EBS volumes onto EC2 instances and leave them there
  • Carve up that volume into multiple virtual volumes using software-defined storage
  • These volumes can be instantly mounted to your Kubernetes pods and their containers

That way, you will avoid the “Warning FailedAttachVolume” and “Warning FailedMount” errors that come with Kubernetes and EBS.

Kubernetes AWS Docker (software) pods Host (Unix)

Published at DZone with permission of Kai Davenport, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Container Checkpointing in Kubernetes With a Custom API
  • Mastering Node.js: The Ultimate Guide
  • Spinnaker Meets Minikube: Part 1
  • The Importance of Persistent Storage in Kubernetes- OpenEBS

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!