The Biggest Software Failures in Recent Years
While in most cases the programmers’ mistakes are not too serious, some IT failures can have truly horrific consequences.
Join the DZone community and get the full member experience.Join For Free
Everyone who uses modern technologies has encountered errors and software failures. While in most cases the programmers’ mistakes are not too serious, some IT failures can have truly horrific consequences. The other aspect is the price the breached organizations pay.
According to the RiskIQ’s report, security breaches alone cost major companies as much as $25 per minute, while crypto-companies may lose almost $2000 a minute due to cybercrime. We have collected some of the most memorable examples of software failures from recent years (with many well-known brands involved) to show how severe the results can be and why preventive measures (such as extensive software testing) are truly required.
You may also like: What We Can Learn From Software Failures
Data Loss at Gitlab
Two years ago a well-known code collaboration platform GitLab experienced a severe data loss which appeared to be one of the major outages in the IT world. GitLab originally used only one database server but decided to test a solution using two servers. They planned to copy the data from the production environment to the test environment.
In the process, the automatic mechanisms began to remove accounts from the database which were identified as dangerous. As a result of increased traffic, the data copying process began to slow down and then stopped completely due to data discrepancies. To add insult to injury, information from the production database was removed during the copying process.
After several attempts to resume the process, one of the employees decided to delete the test base and start the process again but accidentally deleted the production base. What made things even worse is that the directory holding the copies was empty too — the backups had not been made for a long time due to a configuration error.
What meant to be a standard procedure resulted in an 18-hour outage while the 300 GB of customer data was lost. According to the GitLab’s estimates, the company has lost data on at least 5,000 new projects, 5,000 comments, and 700 users. The company approach to this failure deserves respect.
Gitlab explained in detail what happened, broadcasted the restoration procedure on YouTube and published a list of improvements to ensure that this trouble would never happen again. But as they say — the damage is done.
British Airways “Technical Issue”
This summer the flag carrier airline of the UK — British Airways — reported an IT system issue that resulted in the delay of hundreds of flights in the UK, while dozens of flights were canceled completely. This failure affected three British airports and thousands of passengers who had to rebook their flights or check-in by using manual systems. Despite the problem being solved, the airports still felt the effect of this failure for a long while before normal service was resumed.
This computer problem at British Airways is just the latest in a series of IT concerns of the airline. Last year British Airways was sentenced to a record fine of 200 million euros for a data breach. This happened because of the cyber-hack which resulted in a website failure compromising the data of 500 thousand customers. British Airways also experienced a massive system failure in 2017, which affected 75,000 passengers and cost the company nearly 80 million pounds.
British Airways is not the only airline that is struggling with programming issues. In 2013 American Airlines had to ground off all its flights because of the computer glitch. And in 2017 the company had over 1,000 flights at risk of cancellation. The plans of many travelers during the holiday season could be ruined because of a single error in the company’s internal scheduling system which gave too many pilots a day off.
Amazon AWS Outage
When it goes about IT failures, no one is safe. Amazon’s AWS, which is considered to be one of the most reliable hosting services, experienced a serious outage in the eastern coast of the U.S in 2017.
The AWS’s infrastructure supports millions of sites, meaning that when the company’s servers go down, it causes a lot of trouble across the internet. It wasn’t a surprise that “major technical difficulties” of ASW had led to unprecedented problems for hundreds of popular websites.
Many companies of different sizes and from different industries store their data in the data centers of AWS. This includes well-known names such as Netflix, Slack, Business Insider, IFTTT, Nest Trello, Quora, and Splitwise. Many of them were impacted by the outage mentioned above.
A lot of websites were completely offline, devices on the Internet of things such as IFTTT lighting controls or Nest thermostats refused to work, Amazon’s assistant Alexa was struggling to stay online, not even Amazon’s AWS status page worked anymore. This points to one thing – as more and more services rely on AWS's good reputation and move their websites to its servers, even small glitches in a single data center become a really big deal.
Google Plus Security Glitch
A vulnerability in Google+ exposed the private information of nearly 500 000 people using the social network between 2015 and March 2018. According to a report by the Wall Street Journal, the major part of the problem was a specific API that might be used to get access to non-public information.
The software glitch allowed outside developers to see the name, email address, employment status, gender, and age of the network’s users. The error had been discovered in March 2018 and rectified immediately.
The interesting part is — Google did not share the information about the bug in Google+ at once trying not to get into the limelight of the Cambridge Analytica scandal and become noticed by the regulators.
At the same time, the WSJ report states, although Google has no evidence of data misuse it also сan’t say there was none. In any case, the tech backlash ended sadly for Google+ – the consumer version of the network was shut down shortly afterward.
Facebook’s User Data-Leak
Last year Facebook, whose ability to handle the private information had been already questioned, confirmed that nearly 50 million accounts could be at risk. Hackers exploited a vulnerability in the system that allowed them to get access to the accounts and possibly to the personal information of Facebook’s users.
The attack was detected on September 25, 2018. According to The New York Times sources, 3 software flaws in the network’s systems allowed hackers to access user accounts, including Mark Zuckerberg’s, the CEO of Facebook.
The social network representatives stated that the hackers probably exploited a vulnerability in the “View as” code, the function that allows checking how a profile looks as seen by other people. This, in turn, resulted in the acquiring of authentication tokens, thanks to which the user does not have to log in to the site every time. 90 million users have been logged out of their accounts the day the vulnerability was discovered.
Facebook’s representatives explained that 40 additional million accounts had been logged out as a preventative measure. Back then this data breach was the largest in Facebook’s history. According to the new UpGuard’s report, over 540 million records on Facebook users were eventually exposed to Amazon cloud servers.
Can Software Testing Prevent Business Software Solutions Failures?
The cases listed above serve as a reminder of the importance of IT quality assurance of any type of software. They highlight the need for developing an effective approach to testing as a crucial part of the business processes.
The complexity of modern systems is so great that it is usually nearly impossible to perform one particular test and guarantee a perfect result. In most cases, only a combination of manual testing and automated testing allows you to bring a great product to the market.
It is important to stress; however, the test effort has to be adapted to the priorities of the business. Some modules of the software are often prone to error thus require greater attention of the QA specialists. Testing procedures must be also adapted to the system being tested. Because safety issues are much more critical in some systems than others. The tests must, therefore, be contextual and adapted to the environment.
The testing effort should start as early as possible in the software life cycle. No one will argue that the cost of resolving software bugs in the development process is significantly lower than the cost of resolving issues when the damage (to customer experience and the company’s reputation) is already done.
The detailed and effective testing strategy minimizes the likelihood of errors in the end product that can lead to negative consequences for your business.
Published at DZone with permission of Andrew Smith. See the original article here.
Opinions expressed by DZone contributors are their own.