Implementing Observability in Distributed Systems Using OpenTelemetry
Instrument a Python Flask service with OpenTelemetry auto trace requests, export metrics to Prometheus, and inject trace IDs into logs for observability in one setup.
Join the DZone community and get the full member experience.
Join For FreeModern distributed systems demand observability, the ability to understand internal states from external outputs. Observability is achieved by collecting traces, logs, and metrics to improve performance, reliability, and availability. No single signal is sufficient; it's the combination and correlation of these data that form a narrative for root cause analysis.
In monolithic applications, debugging was easier since one service handled a request. In contrast, microservices distribute a request across many services, making it hard to follow a transaction’s path. OTel’s distributed tracing shines here; it propagates context with each request, so you can trace a transaction across service boundaries.
This means when Service A calls Service B, they share a common trace ID, allowing you to view a single trace spanning multiple services. Similarly, OpenTelemetry can attach unique identifiers to logs, making it easier to correlate log events across services. Overall, OTel provides a unified API for instrumenting code and an ecosystem of instrumentation libraries for frameworks that can automatically capture common operations. It focuses on data generation and collection, while the actual storage and querying of telemetry is handled by backend tools.
Setting Up OpenTelemetry in a Python Microservice
Installation
To get started, install the OpenTelemetry libraries for Python. At minimum, you'll need the API and SDK, plus exporters/instrumentation for your use case. For example:
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-prometheus \
opentelemetry-instrumentation-flask
This installs the core OTel API/SDK and the Prometheus metrics exporter and Flask instrumentation. You might also install the OTel OTLP exporter, which is a generic exporter that can send data to an OpenTelemetry Collector or other backend via the OTLP protocol. Additionally, it's recommended to set a service name for your application so that telemetry from this service is identifiable. This can be done via code or an environment variable. In code, you'll see below how we attach a service name as a resource attribute so that traces and metrics are tagged with service.name.
Distributed Tracing With OpenTelemetry
Tracing involves capturing spans that represent units of work in the system. In a microservice, a span could represent an incoming HTTP request, a database query, or an external API call. Spans form a trace when linked together via context propagation. Using OpenTelemetry, we can instrument our Python service to create spans for critical operations and automatically propagate the trace context to downstream services.
First, let's initialize OpenTelemetry tracing in our Python microservice. We create a tracer provider, configure an exporter, and obtain a tracer instance:
import time
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, BatchSpanProcessor
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
# Set up tracer provider with service name for identification
trace.set_tracer_provider(TracerProvider(resource=Resource.create({SERVICE_NAME: "order-service"})))
tracer = trace.get_tracer(__name__)
# Configure a span processor with a Console exporter (prints trace data to stdout)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
# Example: instrument a code block with a span
with tracer.start_as_current_span("process_order"):
# Simulate processing (e.g., calling another service or performing work)
time.sleep(0.1)
# If this code calls another service, OpenTelemetry context propagates via HTTP headers automatically
In this snippet, we configured a TracerProvider with a resource attribute service.name="order-service" so that all spans from this service are labeled. We added a BatchSpanProcessor with a ConsoleSpanExporter this will batch and print our spans to the console in JSON for demonstration. In a real system, you might use a Jaeger exporter here to send spans to a Jaeger agent. The tracer = trace.get_tracer(__name__) gives us a tracer we can use to start spans. We then start a span named "process_order" using a context manager (start_as_current_span), which automatically ends the span when the block exits. Inside that span, you would put the operation you want to measure.
Metrics Collection and Export (Prometheus Integration)
While tracing shows the path of individual requests, metrics provide aggregated insights into system behavior. OpenTelemetry’s metrics API allows you to define instruments like counters and histograms to record these values.
First, ensure the Prometheus client/exporter is set up. We’ll use OTel’s Prometheus exporter, which works by exposing a /metrics HTTP endpoint that Prometheus will scrape. In code, this is done by creating a PrometheusMetricReader and starting an HTTP server for metrics. Here’s how you can integrate metrics in a Flask microservice:
from flask import Flask, request, g
from prometheus_client import start_http_server
from opentelemetry import metrics
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import time
# Initialize metrics provider with Prometheus exporter (reader)
resource = Resource(attributes={SERVICE_NAME: "order-service"})
reader = PrometheusMetricReader() # exposes metrics in Prometheus format
provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter(__name__)
# Define metric instruments
request_counter = meter.create_counter(
name="app_requests_total",
description="Total number of requests processed",
unit="1"
)
request_latency = meter.create_histogram(
name="app_request_latency_ms",
description="Request latency in milliseconds",
unit="ms"
)
# Start Prometheus client on an endpoint (e.g., port 8000) for scraping
start_http_server(port=8000, addr="0.0.0.0")
# Flask app and instrumentation
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app) # auto-instrument Flask for tracing
@app.before_request
def before_request():
g.start_time = time.time()
# Increment counter for each incoming request, with the route path as a label
request_counter.add(1, {"endpoint": request.path})
@app.after_request
def after_request(response):
# Record the request duration in milliseconds
duration_ms = (time.time() - g.start_time) * 1000
request_latency.record(duration_ms, {"endpoint": request.path})
return response
# Example route
@app.route("/hello")
def hello():
return "Hello, World!", 200
In the setup above, we configured a MeterProvider with a PrometheusMetricReader. This essentially registers an HTTP endpoint that exposes our metrics in Prometheus format. We explicitly call start_http_server(port=8000) to start the metrics server on port 8000, Prometheus will scrape this. We created two metric instruments: a counter to count the number of requests, and a histogram to track the distribution of request durations.
In the Flask hooks, we use these instruments: at the beginning of each request, we note the start time and increment the counter. After the request is handled, we compute the elapsed time and record it in the histogram again, labeled by the endpoint path. These labels let us break down metrics per route.
Log Correlation With OpenTelemetry
Logs are the third pillar of observability. They provide detailed event information and error messages. OpenTelemetry can augment logging by injecting trace context into logs, so that you know which trace/span a log entry is associated with.
In Python, the package opentelemetry-instrumentation-logging can automatically enrich Python logging records with trace context. After installing it, you can enable it with:
from opentelemetry.instrumentation.logging import LoggingInstrumentor
LoggingInstrumentor().instrument(set_logging_format=True)
This will ensure that whenever you call the standard logging functions, if a trace is currently active, the log record will contain the trace and span IDs. For instance, you might see logs like:
INFO [trace_id=0xf4a3b...] Order 123 processed successfully
indicating that the log was emitted during a specific trace. To fully centralize logs, you would forward them to a log backend. One approach is using the OpenTelemetry Collector to collect and export logs.
Conclusion
Implementing observability in a microservice architecture is no small feat, but OpenTelemetry greatly simplifies the process by providing a one-stop solution for instrumentation. We have shown how to set up distributed tracing to follow requests across services, how to collect metrics and export them to Prometheus for monitoring, and how to correlate logs with trace context. With these in place, you gain deep visibility into your system. You can monitor performance and identify latency bottlenecks, get alerted on anomalies via metrics, trace requests end-to-end to see where failures occur, and dive into logs for detailed errors. This comprehensive observability is crucial for engineers to effectively maintain and optimize distributed systems.
In summary, OpenTelemetry enables a consistent, portable way to implement observability across distributed systems. Embracing it in your microservices will lead to faster debugging, better performance insights, and more resilient applications. With traces, metrics, and logs at your fingertips, you are no longer flying blind in your distributed architecture; instead, you have the data to understand and improve your system continually.
Opinions expressed by DZone contributors are their own.
Comments