DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How are you handling the data revolution? We want your take on what's real, what's hype, and what's next in the world of data engineering.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Google Cloud Pub/Sub: Messaging With Spring Boot 2.5
  • [CSF] Using Metrics In Spring Boot Services With Prometheus, Graphana, Instana, and Google cAdvisor
  • Top Load Balancing Algorithms: Choosing the Right Strategy
  • Stop Building Monolithic AI Brains, Build a Specialist Team Instead

Trending

  • Reducing Hallucinations Using Prompt Engineering and RAG
  • Vibe Coding: Conversational Software Development — Part 1 Introduction
  • Docker Model Runner: A Game Changer in Local AI Development (C# Developer Perspective)
  • Lessons Learned in Test-Driven Development
  1. DZone
  2. Data Engineering
  3. Data
  4. Lessons From the Birth of Microservices at Google

Lessons From the Birth of Microservices at Google

There are five great lessons to be learned from Google's innovative adoption of microservices.

By 
Daniel ''Spoons'' Spoonhower user avatar
Daniel ''Spoons'' Spoonhower
·
Dec. 13, 18 · Opinion
Likes (24)
Comment
Save
Tweet
Share
29.5K Views

Join the DZone community and get the full member experience.

Join For Free

This article is featured in the new DZone Guide to Microservices. Get your free copy for insightful articles, industry stats, and more!

Our story begins in the early 2000s when Google's Search and Ads products were really taking off. At the time, the conventional wisdom suggested that building a reliable internet service meant buying reliable — and expensive — servers. To keep costs down, Google engineering teams switched to commodity hardware. While this approach saved on cost, these servers were (as you might expect) highly unreliable. As hardware failure became the common case rather than the exception, Google engineers needed to design accordingly.

Today, we might start by going to GitHub to look for code that others have developed to tackle similar problems, but in 2001 this simply wasn't an option. While there were strong communities around projects like Linux and the Apache Web Server, these projects were mostly focused on the software running on a single machine. As a result, Google had no choice but to build things from scratch.

What followed was an explosion of now-legendary infrastructure projects, including GFS, BigTable, and MapReduce. These systems, along with widely used products like Google Search and Ads, had several characteristics that we now associate with microservice architectures:

  • Horizontal scale points that enabled the software to scale across thousands of physical machines.

  • Reusable code for RPC, service discovery, load balancing, distributed tracing, and authentication.

  • Frequent and independent releases.

However, even if Google arrived at the results we are hoping to achieve with microservices, should we follow the same path?

The Lessons

Lesson 1: Know Why (You're Adopting Microservices)

The most important reason to adopt microservices has nothing to do with how your software scales: microservices are useful because they address engineering management challenges. Specifically, by defining service boundaries between teams and letting them work independently, teams will need to communicate less and can move more quickly.

The architecture developed for Google Search and Ads, while superficially similar to microservices, was really there to support the scale needed to index and search the web while serving billions of requests every second. This scale came at a cost in terms of features, and there were hundreds of other smaller teams at Google who needed tools that were easy to use, and who didn't care about the planetary scale or tiny gains in efficiency.

As you are designing your infrastructure, consider the scale of what you need to build. Does it need to support billions of requests per second... or "just" millions?

Lesson 2: Serverless Still Runs on Servers

Many organizations are considering skipping microservices and moving straight to a "serverless" architecture. Of course, serverless still runs on servers, and that has important consequences for performance. As Google built out its infrastructure, measuring performance was critical. Among the many things that Google engineers measured and used are two that are still relevant to serverless:

  • A main memory reference is about 100 nanoseconds.

  • A roundtrip RPC (even within a single data center) is about 500,000 nanoseconds.

This means that a function call within a process is roughly 5,000 times faster than a remote procedure call to another process. Stateless services can be a great opportunity to use serverless, but even stateless functions need access to data as part of their implementation. If some of this data must be fetched from other services (which almost goes without saying for something that's stateless!), that performance difference can quickly add up. The solution to this problem is (of course) a cache, but caching isn't compatible with a serverless model.

Serverless has its place, especially for offline processing where latency is less of a concern, but often a small amount of context (in the form of a cache) can make all the difference. "Skipping" right to a serverless-based architecture might leave you in a situation where you need to step back to a more performant, services-based one. 

Lesson 3: What Independence Should Mean

Above, I argued that organizations should adopt microservices to enable teams to work more independently. But how much independence should we give teams? Can they be too independent? The answer is definitely yes, they can.

In adopting microservices, many organizations allow each team to make every decision independently. However, the right way to adopt microservices is to establish organization-wide solutions for common needs like authentication, load balancing, CI/CD, and monitoring. Failing to do so will result in lots of redundant work, as each team evaluates or even builds tools to solve these problems. Worse, without standards, it will become impossible to measure or reason about the performance or security of the application as a whole.

Google got this one right. They had a small number of approved languages and a single RPC framework, and all of these supported standard solutions in the areas listed above. As a result, product teams got the benefits of shared infrastructure, and infrastructure teams could quickly roll out new tools that would benefit the entire organization.

Lesson 4: Beware of Giant Dashboards

Time series data is great for determining when there is a regression but it's nearly impossible to use it to determine which service was the cause of the problem. It's tempting to build dashboards with dozens or even hundreds of graphs, one for each possible root cause. You aren't going to be able to enumerate all possible root causes, however — let alone build graphs to measure them.

The best practice is to choose fewer than a dozen metrics that will give you the best signal that something has gone wrong. Even products as large as Google Search were able to identify a small number of signals that really mattered to their users. In this case, perhaps unlike others, what worked for Google Search can work for you.

Ultimately, observability boils down to two activities: measuring these critical signals and then refining the search space of possible root causes. I use the word "refining" because root cause analysis is usually an iterative process: establishing a theory for what went wrong, then digging in more to validate or refute that theory. With microservices, the search space can be huge: it must encompass not only any changes in the behavior of individual services but also the connections between them. To reduce the size of the search space effectively, you'll need tools to help test these theories, including tracing, as described next.

Lesson 5: You Can't Trace Everything... or Can You?

Like anyone who has tried to understand a large distributed system, Google engineers needed tools that would cut across service boundaries and give them a view of performance centered around end-to-end transactions. They built Dapper, Google's distributed tracing system, to measure the time it takes to handle a single transaction and to break down and assign that time to each of the services that played a role in that transaction. Using Dapper, engineers can quickly rule out which services aren't relevant and focus on those that are.

Tracing, like logging, grows in cost as the volume of data increases. When this volume increases because of an increased number of business transactions, that's fine — as presumably, you can justify additional costs as your business grows. The volume of tracing data also increases combinatorially with the number of microservices, however, so new microservices mean your margins will take a hit.

Google's solution to this problem was simple: 99.99% of traces were discarded before they were recorded in long-term durable storage. While this may seem like a lot to throw out, at the scale of services like Google Search, even as few as 0.01% of requests can provide ample data for understanding latency. To keep things simple, the choice about which traces to keep was made at the very beginning of the request. Subsequent work, like Zipkin, copied this technique.

Unfortunately, this approach misses many rare yet interesting transactions. And for most organizations, it's not necessary to take such a simplistic view of trace selection. They can use additional signals that are available until after the request has been completed as part of the selection. This means we can put more — and more meaningful — traces in front of developers without increasing costs. To do so will require rethinking the data collection and analysis architecture, but the payoff, in terms of reducing mean time-to-resolution and improved performance, is well worth the investment.

Conclusion

While Google built systems that share many characteristics with microservices as they exist today (as well as a powerful infrastructure that has since been replicated by a number of open-source projects), not every design choice that Google engineers made should be duplicated. It's critical to understand what led Google engineers down these paths — and what's changed in the last 15 years — so you can make the best choices for your own organizations.

This article is featured in the new DZone Guide to Microservices. Get your free copy for insightful articles, industry stats, and more!

microservice Google (verb)

Opinions expressed by DZone contributors are their own.

Related

  • Google Cloud Pub/Sub: Messaging With Spring Boot 2.5
  • [CSF] Using Metrics In Spring Boot Services With Prometheus, Graphana, Instana, and Google cAdvisor
  • Top Load Balancing Algorithms: Choosing the Right Strategy
  • Stop Building Monolithic AI Brains, Build a Specialist Team Instead

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: