Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

GitHub's Massive Data Storage Failure

DZone 's Guide to

GitHub's Massive Data Storage Failure

Yesterday, GitHub experienced a data storage system failure. Today, engineers are still working to restore access. So, what happened?

· Performance Zone ·
Free Resource

Yesterday, on October 21st, GitHub.com started receiving an increased number of error reports around 16:00 Eastern time.

Many users couldn't log in, and for those who could, pushes were not showing up. They couldn't see changes they were making in their repos or make comments.

By 19:09, the influx of errors was showing up on GitHub.com's system status page, and by 20:05, the page said, "we're failing over a data storage system in order to restore access to GitHub.com." The status was "We are continuing work to repair a data storage system" repeated for hours into the next day.

Image title

GitHub's system status page.

At 12:24 today, GitHub, soon to be bought by Microsoft, said they had completed validation of data consistency and normal functions were being restored. Issues with webhooks, Pages, and background jobs continued through 14:00, as GitHub had paused them to protect users' repository data. The hashtag "#GitHubDown" was thriving on Twitter.

Image title

Source.

So what happened? GitHub's incident report defines the event as "a network partition and subsequent database failure."

During the breakdown, users were only able to access old versions of their repos — if they could log in at all — causing panic for many that they had lost their data, but GitHub assured users that while data on the site may appear inconsistent, no git repository data was lost. As GitHub recovers, this appears to be true.

Many pointed out on Twitter, and sites like Packt reported, that there was no way to tell that the site was down, since backend functions were still normal.

Users who could log in were surprised by the outage when they were unable to submit bug reports, posts, and pushes, although they were receiving email confirmations. This failure to inform users of the outage was a point of contention.

Image title

Source.

As of this publication at 15:28 EDT, webhook delivery, Pages builds, and other functions have been restored and GitHub is continuing to monitor the situation as normal operation resumes. Companies as large as Twitter and Adobe were impacted, as they use GitHub for their open-source projects, as well as countless individual developers all over the world.

DZone will post updates as GitHub's data storage failure situation develops and resolves. 

UPDATE 10:44 EDT: As of 18:00 EDT last night, GitHub reported that all GitHub.com services are now back to normal, having worked through the backlog of webhooks and Pages builds. Their incident update states that they will be conducting a transparent investigation into the causes; we'll keep you posted as we learn more.

Topics:
performance ,github ,software development ,open source ,version control ,database ,networks ,outages

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}