Talking Uber-Level Monitoring With Martin Mao of M3 and Chronosphere

DZone 's Guide to

Talking Uber-Level Monitoring With Martin Mao of M3 and Chronosphere

Find out why Uber engineers decided to leave Uber and focus on building Chronosphere, the company that enables enterprises to take advantage of M3.

· Big Data Zone ·
Free Resource

Chronosphere is the brainchild of two ex-Uber engineers, Martin Mao (CEO) and Rob Skillington (CTO).

I recently spoke with Martin Mao, and you can hear the full interview below.

While working at Uber and undertaking a Kubernetes and cloud-native architecture adoption, the duo realized that there were no tools available that could handle and store all the monitoring data produced by such a setup, let alone do anything useful with it. With this realization, they built their own open-source solution known as M3 and scaled it to one of the largest monitoring systems.

You may also like: Digital Disruptors: How Airbnb, Tesla, and Uber Used Software Innovation to Transform Entire Industries

In 2019, the pair decided to leave Uber and focus on building Chronosphere, the company that enables enterprises to take advantage of M3.

How M3 Differs From Other Monitoring Solutions

While there are plenty of monitoring solutions that claim they are enterprise-grade, and able to handle 10s of millions of time-series data, many fall short when it comes to meeting data storage requirements of large organizations. As technology stacks become more complex, even that isn't enough, which is where M3 comes in, with its ability to handle 10s of billions of time-series data.

M3 is also reliable. While most other monitoring solutions run on a single cloud provider, Chronosphere runs across multiple regions and multiple cloud providers for ultimate reliability.

How Is M3 Able to Handle so Much Data?

The first monitoring solution implemented at Uber used off-the-shelf open source technologies, but these weren't tailor-made for time-series use cases. The team built something from the ground up as there was nothing available that was up to the task.

The team started with a new time-series database for the underlying storage engine, then layered a highly reliable ingestion pipeline and a scalable query engine capable of querying those billions of time-series data.

It's hard to know precisely how M3 achieves this without diving into code. On initial inspection, it may be due to the component-based architecture, with M3 split into four components:

  • A distributed time-series database, M3DB that provides scalable storage for time series data and a reverse index.
  • A sidecar process, M3Coordinator that allows M3DB to act as the long-term storage for Prometheus.
  • A distributed query engine, M3Query, with native support for PromQL and Graphite (M3QL coming soon).
  • An aggregation tier, M3Aggregator that runs as a dedicated metrics aggregator/downsampler allowing metrics to be stored at various retentions at different resolutions.

You can start M3 in single-node mode using Docker or manually, and clustered with Kubernetes (or similar orchestrators).

Once running, you can interact with M3 with REST APIs, or via gRPC endpoints.

Fit for Enterprises

Out of the box, M3 provides bare bone monitoring essentials, like dashboards and graphs for tracking time-series data. There's also an alerting engine that notifies based on configured thresholds, and ties into standard notification engines.

Chronosphere takes M3 and adds a bunch of proprietary enterprise use case features, including:

  • Multi-tenancy controls mapped to organizational structure - adds resource allocation for improved stability and enables teams to omit metrics from other parties.
  • Visualizations, an alerting engine and analytics tooling to make the data useful.
  • Fully hosted, run and managed across multiple cloud providers for ultimate reliability.
  • One-click agent deployment — goes out and discovers every endpoint metric and ingests them automatically.
  • Pre-generated dashboards based on endpoint metrics.
  • Anomaly detection — runs in the background and automatically generates alert thresholds based on historical data. Great for eCommerce businesses that experience abnormal traffic on events like Black Friday.

What's Next for the $11m Startup?

Chronosphere's growth was initially kick-started because the team already had M3, and people were using it, speeding their time to market. As a result, they were able to raise $11 million through Series A funding.

With more and more businesses migrating to technologies like Kubernetes and containers, Chronosphere is looking to step in and relieve two major pain points: unreliability and cost at scale.

Even in its Uber days, M3 was being used to monitor all of the organization's products across multiple cities, helping with real-time monitoring of not just tech, but also the operations side of the business, so there's plenty of potentials to do the same for other enterprises out there.

While the team has no true validation of just how widely the Chronosphere is being used, it is currently running in production in 15 of the Forbes Global 2000 (that they know of).

Over the next six months, the Chronosphere team wants to upstream many features from the paid solution (Chronosphere) back into the open-source M3 community.

More Information

The Chronosphere team will be at KubeCon Europe, plus they'll be holding some DevOps days in New York and Seattle this year. Stay tuned to their social channels for more info.

Further Reading

Observability vs. Monitoring

Kubernetes in 10 Minutes: A Complete Guide

How Kubernetes Works

big data ,chronosphere ,interview ,m3 ,monitoring ,monitoring and alerting ,time series data ,uber

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}