Architecting Observability in Kubernetes with OpenTelemetry and Fluent Bit

Microservices solve scalability problems but introduce troubleshooting nightmares. Here is a practical architectural pattern to unify logs, metrics, and traces.

Dippu Kumar Singh

Jan. 13, 26 · Analysis

Likes (1)

Comment

Save

5.6K Views

In the era of monolithic architectures, troubleshooting was relatively straightforward: SSH into the server, grep the log files, and check CPU usage with top.

In the cloud-native world — specifically within Kubernetes — this approach is obsolete. Applications are split into dozens of microservices, pods are ephemeral (spinning up and terminating automatically), and a single user request might traverse ten different nodes. When a transaction fails, where do you look?

If you don’t have a robust observability strategy, you are essentially flying blind. This article outlines a proven architectural pattern to establish observability by leveraging OpenTelemetry, Fluent Bit, and structured logging.

The Problem: The Distributed Blind Spot

Moving to a microservices architecture on Kubernetes introduces three distinct visibility challenges:

Ephemeral Data: When a pod crashes or scales down, its local logs are destroyed. You cannot rely on local disk storage.
Fragmented Context: A transaction flows through Service A, Service B, and Service C. If Service C fails, Service A may only log a generic "500 Error." There is no easy way to link the error in Service C back to the request in Service A.
Siloed Telemetry: Logs, metrics, and traces often live in different tools, making correlation nearly impossible during a 3 a.m. outage.

To solve this, we must move from simple monitoring (Is the system up?) to observability (Why is the system behaving this way?).

The Architecture: A Unified Telemetry Pipeline

The solution involves building a pipeline that standardizes data generation at the source and centralizes it for analysis.

The stack:

Logs: Fluent Bit (collection) → Object storage (S3/Athena) or a log aggregator
Metrics: OpenTelemetry Collector → Prometheus / CloudWatch
Tracing: OpenTelemetry SDK → Distributed tracing backend

Below is the high-level data flow:

Step 1: Structured Logging and Context Propagation

The first step isn’t infrastructure — it’s code. Text-based logs are difficult to parse programmatically. Applications must emit logs in JSON format.

More importantly, we must inject trace context into those logs. By adopting the W3C Trace Context standard, every service passes a traceparent header to the next service.

The “Golden Key”: Trace IDs

When an application logs an event, it must include the current trace_id and span_id. This is the glue that binds logs to traces.

Example of a structured log entry:

    JSON
   
 

   {
  "timestamp": "2024-10-27T10:00:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Database connection timeout",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "http_status": 503
}
  

By querying for trace_id: 0af7651916cd43dd8448eb211c80319c, you can instantly retrieve every log line generated by that specific user request across all microservices.

Step 2: The Log Shipping Layer (Fluent Bit)

Because Kubernetes logs are ephemeral, they must be shipped immediately. Fluent Bit is the industry standard here due to its lightweight footprint and high performance.

Deployed as a DaemonSet (one agent per node), Fluent Bit tails container log files, parses the JSON, and flushes them to an external destination.

Why Object Storage (S3)?

While sending logs to Elasticsearch or Splunk is common, high-volume systems can generate terabytes of logs per day. A cost-effective pattern is shipping logs to Amazon S3 (or similar object storage) and using a query engine such as Amazon Athena to analyze them on demand.

Sample Fluent Bit configuration:

    IDL
   
 

   [INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               kube.*

[FILTER]
    Name              kubernetes
    Match             kube.*
    Kube_URL          https://kubernetes.default.svc:443

[OUTPUT]
    Name              s3
    Match             *
    Bucket            my-app-logs-bucket
    Region            us-east-1
    Total_FileSize    100M
    Upload_Timeout    1m
  

Step 3: Unifying Metrics with OpenTelemetry

Historically, metrics were collected by proprietary agents. OpenTelemetry (OTel) has standardized this process.

By deploying the OpenTelemetry Collector, we create a vendor-agnostic way to receive metrics from applications. The pipeline consists of three stages:

Receivers: Ingest data (e.g., OTLP, Prometheus receiver)
Processors: Filter, batch, or enrich data (e.g., adding Kubernetes metadata such as namespace or pod name)
Exporters: Send data to the backend of choice (Prometheus, Datadog, CloudWatch, etc.)

This decoupling allows you to change observability vendors without rewriting a single line of application code.

The Result: Correlation and Root Cause Analysis

Implementing this architecture fundamentally transforms the troubleshooting workflow.

The "Before" Workflow

User reports an error
Engineer guesses which service failed
Engineer SSHs into nodes or checks scattered log groups
Engineer realizes the error is upstream and repeats step 3

Time elapsed: ~4 hours

The "After" Workflow

Engineer receives an alert containing a trace_id
Engineer queries the trace ID in the central dashboard
The distributed trace visualizes the request flow, highlighting the exact span that failed (e.g., “SQL Timeout” in Service C)
Engineer clicks the span to view correlated JSON logs for context

Time elapsed: ~10 minutes

Conclusion

Observability in cloud-native environments is not about collecting more data — it’s about collecting connected data.

By enforcing structured logging with trace IDs at the application level, using Fluent Bit for efficient log transport, and leveraging OpenTelemetry for standardized metrics, engineering teams can regain control over complex distributed systems. This approach not only reduces Mean Time to Recovery (MTTR) but also establishes a scalable foundation for future growth.

Architecture Kubernetes Observability

Opinions expressed by DZone contributors are their own.

Related

Trending