Azure Team Thwarted by Problem Four Years in the Making
Back in 1999, NASA lost a $125 million dollar Mars orbiter due to a mishap on the part of a Lockheed Martin engineering team. The engineers used English units of measurement, while the agency's team used the metric system, leading to a disruption in navigation information processing. Yesterday, users around the world of one of the leading cloud platforms, Windows Azure, started experiencing major usage problems due to a time miscalculation. The culprit? Leap year.
Apparently the Azure team didn't anticipate the time lapse change, which led users in multiple regions across the globe to experience disruptions on the cloud platform. Bill Lang, a Microsoft exec, responded to the problem on the Windows Azure blog soon after the problem was recognized and (mostly) rectified
Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions. The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year. --Bill Lang
The Azure team was able to return service to most of the effected regions, although this problem has not been entirely fixed. Writes Lang:
However, some sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality. We are actively working to address these remaining issues. Customers should refer to the Windows Azure Service Dashboard for latest status. -- Bill Lang
This problem should come as no surprise to cloud PaaS users, as its certainly not the first time (nor the last) that a major cloud platform has been disrupted. In April, Amazon Web Services, Azure's leading competitor, met with similar problems during a routine scaling activity in a single Availability Zone in the US East Region. This led to a chain reaction that impacted not only that region, but other services as well. For many users, Amazon's problems may have seemed intrinsic to the service provider. But, with the problems surfacing around Windows Azure due to a time lapse miscalculation, users should recognize that the problem cannot just be chalked up to negligence, bad luck, or human error. Problems like this will persist in the cloud as long as it is still maturing. It seems that the days of a truly reliable cloud have yet to arrive.