We Went Multi-Cloud and Almost Drowned: Lessons From Running Across AWS, GCP, and Azure

Multi-cloud sounds strategic, but usually happens by accident. Networking, IAM, and observability all break at boundaries. Only attempt it if you have no choice.

May. 18, 26 · Analysis

Likes (0)

Comment

Save

1.8K Views

It started, as most bad architectural decisions do, with a PowerPoint slide from a VP who had just returned from a conference. “We need to avoid vendor lock-in,” he declared, and suddenly our platform engineer team had a mandate to distribute workloads across three public clouds. Eighteen months later, we had something that technically ran on three major public clouds (AWS, GCP, and Azure). We also had a Terraform code that made people cry and an on-call rotation nobody wanted.

This is what I learned about multi-cloud strategy, not the vendor pitch but the messy reality of keeping production alive across multi-cloud boundaries.

The Most Common Reason vs. The Reason That Logically Matters

Vendor lock-in avoidance is the stated justification for most of the multi-cloud initiatives I have seen. It sounds sensible on paper: spread your bets, negotiate better pricing, avoid being held hostage. In practice, the switching cost argument is often considered theoretical. Nobody is ripping a mature workload off AWS and replanting it on GCP over a cost dispute. The expense of moving everything would have exceeded whatever we stood to gain.

The real reasons multicloud makes sense are less glamorous if you acquire a company running on a different cloud. A specific managed service, say BigQuery for analytics or Azure AD for identity, is genuinely better than what your primary cloud offers. Sometimes the decision is made for you. A customer's regulatory or compliance obligations may demand data residency in a region that only one specific provider serves, or, in our case, a situation that required all three simultaneously.

Those are valid reasons to proceed. If you cannot articulate a concrete scenario where you would actually migrate between clouds, you are paying an operational complexity, data egress cost, and management overhead that act as a tax for insurance you will never claim.

Where It Truly Became Painful

Initially, things seemed under control. We chose Kubernetes, spun up clusters on EKS, GKE, and AKS, and told ourselves the hard part was over. It wasn't.

Networking hit its first wall. Each cloud provider handles virtual networks, traffic routes, and DNS completely differently. We spent three frustrating weeks chasing random connection failures between services split across GCP and AWS, eventually discovering our Transit Gateway was silently discarding packets in an obscure NAT scenario. Nobody warns you that there's no such thing as a common networking standard across clouds, and anyone claiming otherwise is really just pitching you their middleware.

Identity and access management became our second major headache. AWS, Google Cloud, and Azure each handle permissions through completely different systems. We attempted to maintain matching role definitions across all three to ensure consistency. That approach collapsed within two months as the configurations slowly diverged into chaos. We eventually consolidated around a single identity provider and built a translation layer between them — messy, but at least we could audit it properly.

Observability turned out to be our most stubborn ongoing problem. The moment a request travels across cloud boundaries, distributed tracing falls apart completely. Your metrics end up scattered across three separate platforms, each speaking its own query language, making unified monitoring feel nearly impossible. We consolidated into Datadog, which helped, but the bill was eye-watering, and we still had unknown areas at the edges.

Here's what working day-to-day actually feels like. Writing a single Terraform module to spin up a managed Postgres database meant wrestling with three completely different provider APIs simultaneously.

That module was 800 lines long for what is theoretically a single resource type. Multiply that by every piece of resource in the infrastructure you run, and you see the maintenance burden.

What Indeed Worked

After twelve months of struggle, one decision turned everything around. We stopped chasing cloud-agnosticism and started being cloud-intentional instead. That distinction is everything.

Cloud-agnostic thinking pushes you toward lowest-common-denominator solutions, wrapping every service in abstraction layers, pretending all clouds are essentially identical. It's a trap that leaves you using none of them effectively.

Cloud-native means deliberately matching specific workloads to the right provider, embracing each platform's native strengths, and only building abstractions at the actual boundaries where systems talk to each other. Machine learning training lived on GCP because Vertex AI and TPU access justified it. Transactional workloads stayed on AWS, where our team had years of accumulated expertise. Azure handled enterprise identity because that's simply where our customers' Active Directory already lived.

The crucial mindset shift was simply accepting that each environment would naturally look different. We stopped forcing our GCP infrastructure to mirror our AWS setup. Instead, we focused on standardizing the connections between them, API contracts, event schemas, and a shared service mesh — only where cross-cloud communication was genuinely necessary.

The Cost Nobody Mentions

Multicloud isn't purely a technology challenge. It's fundamentally a people problem. Finding engineers who truly master even one cloud platform is already difficult. Expecting a single team to maintain deep production-level expertise across three simultaneously is simply unrealistic. We naturally drifted toward informal specialization: two people owned GCP, three focused on AWS, and one managed Azure, and those knowledge silos made being on-call genuinely miserable.

Training costs are very real, too. The mental overhead of constantly switching between three different consoles, three separate command-line tools, and three completely different mental models of networking and storage is a hidden tax that never appears in any architecture review document.

When You Shouldn't Attempt This

If you are a startup or a team of under fifty engineers, multicloud is almost a mistake. Pick the cloud your team knows the best, use its managed services aggressively, and ship. The theoretical risk of vendor lock-in is less dangerous than the very real risk of moving slowly because your infrastructure is too complex to operate in different clouds.

Even at scale, if your workloads lack a concrete reason to span multi-cloud boundaries, resist the pressure. A well-architected single-cloud deployment with proper DR will serve you better than a disintegrated multi-cloud setup held together by duct tape and YAML.

Lessons Learned

Treat multicloud as a practical response to specific constraints, never as a default stance. Go cloud-intentional rather than cloud-agnostic; use each provider’s strengths natively and standardize only at cloud integration boundaries. Invest generously in monitoring and networking; that's where multi-cloud complexity bites hardest. Be honest about your team’s capacity before committing to overhead that scales with the number of providers that you support.

One Final Reflection

In hindsight, the most valuable thing about our multi-cloud journey was not the architecture; it was the discipline forced on our service boundaries. When crossing a cloud boundary is expensive and painful, you think harder about what actually needs to be crossed. That forced intentionality made our systems design better overall, even parts that never left a single cloud.

Could we have achieved the same architectural improvements simply by designing cleaner interfaces, without literally spanning multiple clouds? Probably yes. But nobody approves a budget for "Let's build better boundaries." They do approve a budget for a "multi-cloud strategy." Sometimes solid engineering quietly sneaks in through a dubious PowerPoint presentation.

AWS azure Cloud

Opinions expressed by DZone contributors are their own.

Related

Trending