Thanks to Ben Sigelman who built and wrote a paper on Dapper - a core monitoring platform at Google. Dapper is an always-on distributed tracing project that helps Google make sense of its distributed systems in production, processing billions of distinct transactions every second. “Distributed tracing” allows DevOps to automatically follow the path that any and every request takes across microservices; with ordinary transactions involving hundreds or thousands of distinct services in the span of a quarter second, there’s no other way to understand and explain the behavior of today’s intricate distributed systems.
Dapper inspired a number of open-source projects, such as Zipkin by Twitter (now maintained by Pivotal) and Jaeger by Uber. Sigelman is also a co-author of the OpenTracing project, part of the Cloud Native Computing Foundation (CNCF).
Sigelman is now the co-founder of LightStep, which is still in stealth. While he can’t talk about the company just yet, he did share his thoughts with me about:
- Distributed systems break down toolchains used by the enterprise. Companies are decoupling services to build faster and better. The result is a distributed system that is complex and always in flux. However, the solutions that engineering departments have depended on for the last decade – logging, metrics and conventional APM (and even how they deploy and do security) – depend on built-in assumptions about system architecture that no longer hold. There is a path forward, and it represents a major new market opportunity. Engineers and DevOps will increase their efficiency and workflow with as few interdependencies as possible thereby increasing the velocity and frequency of deployment.
- Dapper. With the widespread industry transition to microservices, distributed tracing has gone from being esoteric to being essential. Google built Dapper in order to follow requests across their own microservice architecture, as it’s impossible to understand how an application or business is performing without explaining the full lifecycle of these application requests. DevOps needs to know what’s breaking, and why – preferably before users are impacted since it’s more time-consuming and expensive to fix at that point.
- The rise of SaaS changes table stakes for APM. Traditional APM, a big budget category for most businesses, measures application performance in averages. However, SaaS companies tend to have a small number of top customers that disproportionately contribute to revenue. These customers use the product differently and have larger and richer datasets than the “long tail” customers. The ability to measure and optimize for top-customer performance is critical in such a scenario and companies are looking for APM vendors to rise up to the challenge.
With the adoption of DevOps, distributed tracing needs to be automated so companies are able to explain any and every request, sort the signal from the noise, and know where the problem is that needs to be fixed.
In order for microservices deployment to continue to scale with business, there needs to be greater standardization. There’s already an OpenTracing API standard that reduces the time-to-value for distributed tracing, and service mesh technology like linkerd and Envoy accelerates integration even faster. Standardization promotes sanity through improved workflows, reduced instrumentation burden, and operational efficiency. Standards that directly benefit developers are the ones that survive, and OpenTracing has been growing of its own accord.