Most businesses depend on third parties to reliably deliver products or services to their customers. E-commerce sites rely on delivery services. Broadcasters rely on cable and satellite providers. And web platforms rely on cloud infrastructure to keep their systems accessible.
The third-party provider ecosystem maximizes efficiency. Imagine if O.co took responsibility for shipping their own packages. They wouldn’t be able to completely focus on their core business, which could weaken their competitive position.
But relying on business partners does introduce a measure of risk: if Amazon Web Services suffers an outage, it’s not just Amazon.com that will experience downtime. Countless other businesses that run their sites on AWS would suffer an outage, too.
Because of the inherent risk we assume when using their parties we need a way to work around and mitigate the impact a third-party’s failure will have on our business.
Scrutinize Every Provider
You should continually assess all of your third-party services using end-to-end testing methods. By being proactive in identifying service delays or outages, engineers can easily switch the order of your primary, secondary and tertiary providers to ensure that your customers are only minimally affected by downtime. Sure, your partners may have a policy of notifying customers (you) of an outage, but this can occur well after an outage begins extending the amount of time your service is unavailable for your customers
Having redundancy with your third-party services and testing them against each other will help you limit the risk of one provider’s failure affecting your operations and customer experience.
Plan for Third-Party Failure
Failure is inevitable. Taking this principle to heart by designing around failure is the only logical response. Because system downtime is all but assured, you should make every effort to understand how your systems will respond when things break.
Test and validate your failover plans by manually shutting down your third-party services(and do this often). Knowing how your systems will function when one of your partner’s is down will help you proactively alleviate pain when it happens in the wild.
Failure testing will:
Provide your engineering team with a holistic, top-down view of your operations.
Put on display the dependencies in your system architecture (i.e., the third-party services you rely on to keep things up and running)
Let you probe those dependencies in a structured, rigorous way.
End-to-End Testing at PagerDuty
At PagerDuty, end-to-end testing is a pivotal piece of our reliability story. We rely on third-party carriers to help us deliver SMS alerts to our users. If a provider is down, our customers may not receive an alert to know that there is a critical incident in their system. We developed an End-to-End SMS Provider Testing method to ensure all our SMS alerts are delivered, and delivered quickly.
How We Automate End-to-End SMS Provider Testing
We have 3 Android phones with different mobile carriers: AT&T, Verizon and T-Mobile (we’ll be adding Sprint soon)
Internally, we built an android app that allows us to automate testing and send test SMS alerts from our system to each of our Android phones in a round-robin rotation.
We are using Datadog to collect and calculate the time taken for each SMS to reach the designated testing phone and how long out testing app takes to reply back to us.
Based on the collected data we can determine if a third-party provider is down or performing poorly. If our primary provide is jeopardized, we take action and switch the order of our primarily, secondary and tertiary providers.
As we scale our systems, we are looking ahead to creating a third-party probabilistic provider management system. This will allow us to be even more proactive around end-to-end testing, so instead of manually reacting to our SMS provider’s performance our system will automatically switch providers when one is down or degraded.
Of course, this technique is not limited to SMS providers and can be modified to use with any third-party systems you may rely on. Finding a way to ensure your third-party reliances will save you from having a massive headache and let you sleep responsibility because you get less 2:00 AM alerts.