The Cost of Knowing: When Observability Becomes the Outage

Observability costs spiral when teams optimize for visibility, not cost. Fix it by making spend visible, sampling aggressively, and cutting low-value data.

May. 13, 26 · Opinion

Likes (0)

Comment

Save

1.5K Views

There's a particular kind of smugness that infects teams with mature observability stacks. Dashboards everywhere. Latency percentiles, error budgets, trace waterfalls with microsecond resolution — the whole cathedral. And then the AWS bill arrives and someone in finance flags a line item that's larger than the EC2 spend, and suddenly the cathedral looks less like engineering excellence and more like a money furnace nobody was watching.

This is not a hypothetical.

The core failure isn't technical. It's epistemic. Teams build instrumentation to answer operational questions — is this service healthy, where is the latency, what broke at 3am — and they never ask the parallel question that finance is silently screaming: what does it cost to know all of this? Cost is treated as someone else's ledger entry. FinOps is a separate team. The engineers ship features, the accountants reconcile invoices. That boundary is where the money goes to die.

How the Pile Grows

Walk through the mechanics of a typical ingestion pipeline and the inefficiency becomes almost tragicomic. A service emits logs. Each log line hits a collector — Fluentd, Vector, the Datadog agent, whatever — which buffers and ships to a managed backend. That backend charges per ingested gigabyte. So far, so ordinary.

Now add carelessness. A developer is debugging a gnarly race condition in staging, sets the log level to DEBUG, ships it. Nobody changes it back. Production is now emitting stack traces for successful health checks. Every 15-second Kubernetes liveness probe gets a multi-line log entry. Multiply that by 200 pods. Multiply that by 30 days. What was a 50GB/month ingestion budget becomes 800GB without a single feature change, without a single bug introduced. Just entropy and inattention.

This is the DEBUG-in-prod failure mode and it is embarrassingly common. One Reddit thread surfaced an engineer describing their Datadog bill as "watching a slow-motion bank robbery that you authorized yourself" — which is honestly accurate. Datadog's per-host pricing is the visible line; the log ingestion overage is the one that blindsides you.

High-cardinality metrics are a subtler variant. Prometheus's data model is beautiful in theory — labels compose arbitrarily, every combination becomes a series, queries over millions of series are fast. What no one warns you about is that custom metrics with high-cardinality label dimensions — say, user_id or request_id as a label value — can explode the number of active time series into the tens of millions. Managed Prometheus (Grafana Cloud, Amazon Managed Service for Prometheus) charges by active series. An intern adds a metrics label. Six million new series. Invoice goes up by $3,000/month. Nobody notices for two billing cycles because the alert threshold was set in absolute dollars, not as a rate-of-change.

Distributed tracing at full fidelity is the most expensive habit of all. A service at 1,000 RPS producing traces with 30 spans each, stored for 30 days at Jaeger-on-S3 rates — the storage alone is manageable, but if you're shipping those traces to Honeycomb or Datadog APM, you're paying per span. Do the math: 1,000 spans/second, 86,400 seconds/day, $0.000001 per span (rough order-of-magnitude vendor pricing). That's $86/day, $2,580/month, for one service. A microservices shop with 40 services and comparable traffic has a tracing bill that starts eclipsing the application infrastructure it's supposed to monitor.

Gartner's finding that over half of observability spend goes to logs isn't surprising if you've sat inside any enterprise. Log data is undisciplined by nature — it's the catch-all for everything developers couldn't predict at design time. Metrics have schemas; logs have vibes.

The Architecture of Accidental Spend

Some organizations run Splunk, Datadog, and CloudWatch simultaneously. Sometimes this is intentional — different teams, different tools, organic growth. Often it's just nobody ever decommissioned the old stack when the new one arrived. Now you're ingesting the same log streams into three backends, paying three ingestion costs, maintaining three sets of dashboards that almost certainly diverge over time and tell different stories about the same events.

This is the "10+ tools simultaneously" problem Gartner keeps documenting. It's not stupidity. It's the accumulated residue of organizational change — an acquisition, a re-org, a new VP of Platform who liked a different vendor. The engineers who understand the full topology of telemetry pipelines are rarely the ones who control procurement. The people who control procurement have a dashboard of invoices, not a map of data flows.

What makes this especially pernicious is that observability systems are self-concealing. The logs from your log aggregator don't usually appear in your log aggregator. The metrics from your metrics system don't feed back into your dashboards unless someone deliberately wires them up. You cannot accidentally discover that your observability stack costs 12% of total infrastructure — you have to go looking, and looking requires someone to first care.

What a Careful Builder Changes on Monday

Pragmatically, the first thing is just inventory. Not glamorous, but essential. Enumerate every telemetry pipeline. Where does data originate, what agent collects it, where does it land, who queries it, what does each destination cost? This is painful because the answer usually involves spreadsheets, Slack messages to five different teams, and at least one system that nobody is entirely sure still exists.

Once you have the map, the second thing is tagging — not at the vendor level, but at the emission level. OpenTelemetry gives you resource attributes. Use them. Tag every metric, trace, and log with the owning team, the service name, the environment, and if possible the business domain. This is the prerequisite for showback: the practice of showing teams what their telemetry is actually costing. Showback is not chargeback — you're not billing internal teams — but when a team sees that their service's logs cost $4,200/month, behavior changes. Engineers start questioning whether every SQL query needs to emit a full parameter trace in production.

Sampling is the highest-leverage intervention, and it's under-used because the cultural instinct is "what if we miss something important?" which is understandable but poorly calibrated. Head-based sampling — deciding at trace ingestion whether to record the trace — is cheap to implement but blunt: you sample 10% and hope nothing interesting happened in the 90%. Tail-based sampling is smarter and architecturally heavier: you buffer spans in a collector, wait to see if the trace ends with an error or high latency, and then decide whether to keep it. Errors and outliers get full fidelity. The unremarkable 200-OK in 12ms gets dropped. This is how you achieve 90%+ cost reduction on tracing while retaining everything you'd actually investigate.

For logs, the decision hierarchy should be explicit: what log lines would an engineer actually look at during an incident? Health check successes? Almost certainly not. Authentication failures? Yes. Slow database queries exceeding a threshold? Yes. The rest is probably informational logging that belongs to a cheaper tier — ship to S3, query with Athena when you need it, not to the real-time expensive backend. There's a 5x to 20x cost difference between hot storage (indexed, queryable in milliseconds) and cold storage (Parquet on S3, query in seconds). Most "just in case" logs only get queried during incidents, and incidents can tolerate a few seconds of query latency.

Prometheus remote write costs money when you're pushing to a managed backend. Not all metrics are equally valuable. Before remote-writing everything, run a cardinality analysis. tsdb-analyzer and similar tools will show you which label combinations are generating the most series. Frequently the top five metrics by cardinality account for 60-70% of total series. Dropping or summarizing those — recording a histogram instead of storing every individual user_id-labeled gauge — can halve the managed metrics bill without touching dashboards.

The Prometheus config snippet in the canonical version of this article is technically correct but a bit toothless on its own. What matters is the metric_relabel_configs block you add before remote write:

     YAML
    
 

    remote_write:
  - url: https://your-managed-backend/write
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'some_high_cardinality_metric.*'
        action: drop
   

Drop the metrics you've determined are low-value. Not "might be low-value." Are demonstrably, historically low-value — nobody has opened a dashboard querying them in six months, there are no alerts dependent on them, the runbook doesn't reference them. Delete aggressively. You can always re-add them.

The SLO Nobody Writes

Most SRE literature focuses on reliability SLOs — availability, latency, error rate. There's almost no discussion of observability SLOs, which is strange given the cost exposure. The right framing is:

Observability spend as a percentage of total infrastructure spend should not exceed X%.

What's X? Depends on the company. A startup with complex distributed systems and aggressive debugging needs might tolerate 20%. A mature org with stable systems and well-understood failure modes should probably be under 10%. If you're at 30%+, you are paying more to watch your system than to run it in some dimensions, and that is a policy failure.

Expressing this as a Prometheus metric — cost_observability_dollars_total / cost_infrastructure_dollars_total — requires feeding billing data into your metrics system, which is a thing most teams haven't done and most vendors don't make easy. AWS Cost Explorer has an API. Azure Cost Management has an API. GCP Billing Export goes to BigQuery. None of these are plug-and-play integrations with Grafana, but the engineering work is days, not weeks, and the result is a dashboard where finance and platform engineering look at the same numbers in the same place. That alignment alone is worth the effort.

The Uncomfortable Trade-Off

Here is the honest version of the trade-off table:

Full-fidelity telemetry gives you the best chance of answering novel questions during incidents — questions you didn't anticipate when you designed your dashboards. That's genuinely valuable. The argument for keeping everything is not wrong. Debugging a subtle distributed transaction bug where you needed a span that got sampled away is a real experience that creates a real incentive to keep everything.

But that argument only holds if you're actually using the data. Pull the query logs from your managed observability backend. How many unique queries ran against data older than 7 days in the last 90 days? Usually the answer is "very few, mostly compliance-related." The 30-day retention you're paying for is largely insurance you never collect on.

Self-hosted OSS — Prometheus, Loki, Grafana, Tempo — shifts the cost from SaaS invoices to engineering time and cloud storage. S3 is cheap. Maintaining a Loki cluster is not free, but if you already have Kubernetes competency, the marginal operational overhead is bounded. The meaningful risk is the team rotation problem: the engineer who built the stack leaves, the replacement doesn't know Loki's chunk cache configuration, and 18 months later nobody can explain why the query performance degraded or how the retention is actually configured. Documentation and runbooks for the observability stack itself — not just the applications — is a discipline most teams skip.

Managed SaaS is expensive and operationally pleasant. Self-hosted is cheaper and operationally demanding. There's no free lunch. The correct choice depends on team size, platform maturity, and whether the observability stack is a core competency or a supporting function.

What Actually Happens

Companies that get this right tend to have done one specific thing: they made telemetry cost visible to engineers before the invoice arrives. Not as an annual budget conversation. Not as a postmortem after an $80,000 quarter. As a live dashboard, refreshed daily, visible on the same screen as error rates and latency percentiles. When the cost of logging a query is as visible as the latency of executing it, the engineering judgment changes.

That's it, really. The Prometheus configs, the sampling strategies, the cold storage tiers — those are implementations. The prerequisite is just seeing the cost while you're making the decision, not six weeks later when someone in finance exports a spreadsheet.

The companies that don't get this right keep treating observability as infrastructure — a background cost, a utility, something you pay and forget. Then the bill becomes the incident.

Engineering Observability Telemetry

Opinions expressed by DZone contributors are their own.

Related

Trending