How a New API Can Cause So Much Havoc
How a New API Can Cause So Much Havoc
Zone Leader John Vester talks about a major microservice issue, which one would not expect to happen with internal-only facing APIs.
Join the DZone community and get the full member experience.Join For Free
Containerized Microservices require new monitoring. See why a new APM approach is needed to even see containerized applications.
The following article is based upon factual events. To protect the innocent, a majority of the details were changed.
With nearly twenty Agile teams focused on replacing a monolithic application, everyone seemed motivated, challenged and excited. It was at that point in a project where everyone was eager for the end-of-sprint demo days - in order to show off the work that was completed.
At first, the largest conference room at the corporation was packed with customers and Information Technology (IT) staff. Everyone was excited to see what features were slated for implementation during the next release and hear about plans for the next set of features that will be addressed.
After a few months, though, despite a strong flow of exciting features, the attendance began to dwindle - mostly around the customers who planned to utilize the application. Some stated that corporate priorities and conflicting meetings caused the end-of-sprint demo attendance to drop. Others blamed additional workload as a result of added responsibilities or staff turnover.
Within a few months, the official end-of-sprint demos were canceled. Plans were made to have technical team meetings, to demo features across teams, but they never really gained traction. By this point of the project, the Agile teams had become more like dedicated silos focusing on specialized areas of the application.
Core Business Updates
One of the teams worked directly with their customer to solve some core business issues with the billing side of the application. The approach employed by the legacy application was no longer valid and revisions to the underlying data model and API were required.
While the changes were not trivial, having one team focus on the new business logic was far easier from a development and planning perspective. The team would retire the old APIs and introduce a new set of APIs since the business rules and data model had changed significantly.
Where Things Went Bad
With the amount of work placed on each team and the fact that other teams were not able to keep up with the work completed by other teams, management made the decision to have Pull Requests (code reviews) completed by the other sister team working on the same segment of the application. You see, the Agile teams worked in pairs, where two complete teams were assigned a section of the monolithic application that was being replaced.
As one might expect, the changes were known to the other team and the PR process passed review without much effort. From a database side, the Shared Services reviewers did not notice anything that would raise a caution flag and they approved their requests as well.
The migration for the code was scheduled and completed without any issues. The data transformation required to handle legacy data also completed without any unexpected issues. It wasn't until the following morning when everything started to fall off the rails.
The first to call were the mobile customers, who were no longer able to access the financial side (billing) of the mobile app. As the calls were routed to the Help Center, other calls started flowing in from those who supported the remote customers. Other customers started calling their Product Owners when daily reports were not functional as well.
The Help Center and Product Owners were struggling to figure out the root cause of the issue. In the last twelve hours, their systems were not updated in any way and no maintenance tasks were performed on the underlying systems.
It wasn't until the developers got pulled into the situation before the cause was identified. At this point, nearly half a business day had processed transactions against the new data model - which maintained elements that could not be rolled into the former state.
Instead, both the mobile application and reporting teams had to work quickly on a fix. In the meantime, the impacted reports and mobile functionality were disabled from both applications.
How This Could Have Been Avoided
I am certain that disbandment of the end-of-sprint demos played a role in this internal disaster. After all, when the feature team demonstrated the new billing functionality, team members from the mobile application and reporting teams would have noticed these major changes and likely spoken up during the meeting regarding the underlying data design. Even if the end-of-sprint demos did need to be canceled, the planned technical team meetings never happened - which would have also caught this issue ahead of time.
Another factor that hampered the situation is that no one had a handle on who was utilizing the API. External APIs often employ API keys in order to track usage. While this might be considered an over-the-top approach, it would have provided the necessary insight ahead of time. If nothing else, some way to track the API calls should have been considered - even if only obtaining an internal IP address.
On the database side, having an understanding of who/what is making calls to a particular table might have helped identify the usage. Of course, this falls into a "no man's land" situation where the DBAs cannot be expected to know who is submitting queries into a database table and application developers likely do not see it as their responsibility either.
While the same-team PR process did not expose the issue, it is highly unlikely that another team would have insight into the needs of the mobile and reporting teams - especially if neither of those teams were to be in on the PR process.
In the end, better communication would have prevented this scenario. Of course, that in itself is a challenge, since not everyone will take the time to stop and read this level of communication. I can't imagine trying to sort through nearly twenty teams of communications...and keep up with my daily workload.
From the team's perspective, one might suggest that API versioning would have prevented the situation. However, in this case, there wasn't a manner in which the prior version could still execute since the data model was updated significantly. At the same time, those calling the database directly were bypassing the API altogether.
Knowing who is using your API is crucial in today's API-driven world. One might expect that internal APIs are not as prone to this vulnerability, but seeing this scenario play out first hand quickly disproved that statement.
The impact of the issue endured significant costs which could have been prevented if one or more of the elements noted above were put into (or kept) in place. With every decision, there is always a degree of risk. In this case, the risk certainly outweighed the costs when changes to the existing strategy were employed.
Have a really great day!
Opinions expressed by DZone contributors are their own.