I received an email from Amazon today that I have never seen before. Here it is in its entirety:
We have noticed that one or more of your instances is running on a host degraded due to hardware failure.
The risk of your instances failing is increased at this point. We cannot determine the health of any applications running on the instances. We recommend that you take appropriate action.
For more options to stop and start your instance please see:
If your instance was launched from an instance store-backed AMI, you should launch a replacement instance from your most recent AMI and migrate all necessary data to the replacement instance.
Should have you have any additional questions, we offer AWS Basic Support via our Community Forums for free, or Premium Support for one-on-one assistance direct from an AWS Developer Support Engineer at http://aws.amazon.com/support.
The Amazon EC2 Team
So I dutifully went and followed the instructions and stopped and started (not just rebooted) the specified instance using the EC2 Web Management Console.
PROBLEM: The instance came back up as expected in the Web Management Console, however I could not ping it or SSH to it or connect to it in any way for that matter using my DNS name. I could however connect to it using the Amazon assigned public DNS name. It took me a few minutes to figure it out (all the while my site was down of course), but I eventually noticed that the Elastic IP address assigned to that instance was no longer shown in the instance details view. I went over to the Elastic IP management screen and sure enough that Elastic IP address was shown as not being associated with any instances. I reassigned the Elastic IP address to the instance and a few moments later, everything was back up and running.
CONCLUSION: This scenario is exactly why you need to be using an Elastic Block Storage (EBS) backed EC2 instance for any of your important servers, so in the event that the hardware fails, your actual server image is still safe and can be restored on other hardware. It also proves that while “the cloud” is awesome, it can fail and you need to be prepared for it. Also, one last curious piece about Elastic IP addresses becoming disassociated with instances – not sure if this is related to the hardware failure, or to the stop/start of the instance, but definitely something to keep an eye out for in the future.