Debugging Distributed ML Systems
My ML model misclassified groceries as entertainment. Distributed tracing with OpenTelemetry and Jaeger helped me quickly find a caching bug causing it.
Join the DZone community and get the full member experience.
Join For FreeMy ML model for categorizing suddenly started classifying groceries as entertainment expenses. But why? What happened?
I was looking at my personal finance dashboard and noticed something was completely off. The logs from each service looked normal. The health checks were green. Yet somehow, my grocery store purchases were being flagged as entertainment, and my restaurant bills were showing up as utilities.
For some background, I had recently broken my monolith finance tracker into multiple microservices. What used to be a single Flask app with traceable execution had become a distributed puzzle of API calls, Redis caches, and background jobs. When something went wrong, figuring out where and why had become a nightmare.
After spending many late nights debugging issues that seemed to appear from nowhere, I finally implemented tracing using OpenTelemetry and Jaeger. It has honestly transformed my personal project from a debugging headache into something I actually enjoy maintaining.
If you're running your own distributed ML project and finding yourself lost in a sea of logs, this guide will show you how to set up tracing infrastructure that makes debugging manageable instead of purely frustrating.
The Personal Project Debugging Nightmare
Let me give you a picture of what debugging used to look like in my finance tracker. After decomposing my monolith, I had five core services:
- Transaction Processor: Ingests and cleans raw transaction data
- Category Predictor: ML service for transaction categorization
- Spending Analyzer: Computes spending patterns and insights
- Fraud Detector: Identifies suspicious transactions
- Budget Manager: Tracks budgets and sends alerts
A typical transaction processing flow would look like this:
- New transaction comes in via CSV upload or bank API
- Transaction Processor cleans and validates the data
- Category Predictor calls the ML model to classify the transaction
- Fraud Detector checks for suspicious patterns
- Spending Analyzer updates monthly statistics
- Budget Manager checks if any budget limits were exceeded
When this worked smoothly, it felt like magic. When it didn't, debugging was pure pain.
The categorization disaster I mentioned earlier started with nonsensical expense categories. Looking at individual service logs, everything seemed fine:
- Transaction Processor: "Processed transaction ID 12845 ✓"
- Category Predictor: "Classified 'Wholefoods#123' as 'entertainment' ✓"
- Fraud Detector: "No suspicious patterns detected ✓"
- Spending Analyzer: "Updated monthly totals ✓"
But clearly something was off as my grocery shopping was being classified as entertainment. Without a way to trace a single transaction through all these services, I was basically guessing where the problem might be.
OpenTelemetry: The Game Changer for Personal Projects
OpenTelemetry might sound like a overkill for a personal project, but it's worth the setup effort. The basic idea is simple: every request gets a unique trace ID that follows it through your entire system, and each service operation becomes a "span" within that trace.
Here's how I implemented it across my Python services running in Docker containers:
# requirements.txt additions for all services
opentelemetry-api==1.21.0
opentelemetry-sdk==1.21.0
opentelemetry-instrumentation-flask==0.42b0
opentelemetry-instrumentation-requests==0.42b0
opentelemetry-exporter-jaeger-thrift==1.21.0
# shared/tracing.py - Common tracing setup
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
def setup_tracing(service_name):
# Set up tracer provider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name=os.getenv("JAEGER_HOST", "jaeger"),
agent_port=int(os.getenv("JAEGER_PORT", "6831")),
)
# Add span processor
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Auto-instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
return tracer
# category_predictor/app.py - Example service implementation
from flask import Flask, request, jsonify
import joblib
import requests
from shared.tracing import setup_tracing
app = Flask(__name__)
tracer = setup_tracing("category-predictor")
# Load the categorization model
model = joblib.load('models/category_classifier.pkl')
@app.route('/categorize', methods=['POST'])
def categorize_transaction():
transaction_data = request.json
with tracer.start_as_current_span("categorize_transaction") as span:
span.set_attribute("transaction.id", transaction_data.get('id'))
span.set_attribute("transaction.amount", transaction_data.get('amount'))
span.set_attribute("merchant", transaction_data.get('description', '')[:50])
try:
# Get user spending patterns for context
user_id = transaction_data.get('user_id')
with tracer.start_as_current_span("fetch_user_context") as context_span:
context_response = requests.get(
f"http://spending-analyzer:5000/patterns/{user_id}"
)
context_span.set_attribute("context.status_code", context_response.status_code)
user_context = context_response.json() if context_response.ok else {}
# Prepare features for ML model
with tracer.start_as_current_span("prepare_features") as feature_span:
features = prepare_transaction_features(transaction_data, user_context)
feature_span.set_attribute("features.count", len(features))
feature_span.set_attribute("features.has_context", len(user_context) > 0)
# Run ML model prediction
with tracer.start_as_current_span("model_prediction") as model_span:
category = model.predict([features])[0]
confidence = model.predict_proba([features]).max()
model_span.set_attribute("prediction.category", category)
model_span.set_attribute("prediction.confidence", float(confidence))
model_span.set_attribute("model.version", "v1.2.3")
span.set_attribute("result.category", category)
span.set_attribute("result.confidence", float(confidence))
return jsonify({
'category': category,
'confidence': float(confidence)
})
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
return jsonify({'error': str(e)}), 500
def prepare_transaction_features(transaction, user_context):
"""Extract features for the ML model"""
description = transaction.get('description', '').lower()
amount = float(transaction.get('amount', 0))
features = [
len(description),
amount,
1 if 'grocery' in description or 'kroger' in description else 0,
1 if 'restaurant' in description or 'cafe' in description else 0,
user_context.get('avg_grocery_amount', 50.0), # User's average grocery spending
user_context.get('grocery_frequency', 0.1), # How often they buy groceries
]
return features
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
The beauty of this setup is that OpenTelemetry automatically traces HTTP requests between services. I only needed to add manual spans for ML-specific operations like feature preparation and model inference.
Setting Up Jaeger: Trace Dashboard
Getting Jaeger running alongside my finance tracker services was straightforward with Docker Compose:
# docker-compose.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.50
ports:
- "16686:16686" # Jaeger UI
- "6831:6831/udp" # Jaeger agent
environment:
- COLLECTOR_OTLP_ENABLED=true
networks:
- finance-network
transaction-processor:
build: ./transaction-processor
environment:
- JAEGER_HOST=jaeger
- JAEGER_PORT=6831
depends_on:
- jaeger
- redis
networks:
- finance-network
category-predictor:
build: ./category-predictor
environment:
- JAEGER_HOST=jaeger
- JAEGER_PORT=6831
depends_on:
- jaeger
networks:
- finance-network
spending-analyzer:
build: ./spending-analyzer
environment:
- JAEGER_HOST=jaeger
- JAEGER_PORT=6831
depends_on:
- jaeger
- redis
networks:
- finance-network
fraud-detector:
build: ./fraud-detector
environment:
- JAEGER_HOST=jaeger
- JAEGER_PORT=6831
depends_on:
- jaeger
networks:
- finance-network
redis:
image: redis:alpine
networks:
- finance-network
networks:
finance-network:
driver: bridge
After running docker-compose up, I could access the Jaeger UI at http://localhost:16686 and see traces from all my services. The first time I saw a complete transaction flow visualized end-to-end was a good feeling.
Solving the Great Categorization Bug
Remember that mysterious categorization bug I mentioned? Here's exactly how distributed tracing helped me solve it.
With all tracing added, I could finally trace individual problematic transactions through the entire pipeline. I found a recent transaction that had been miscategorized and pulled up its trace in Jaeger.
The trace looked normal at first glance: all services responded successfully, and the flow proceeded as expected. But when I examined the spans more carefully, I noticed something weird in the "fetch_user_context" span:
span: fetch_user_context
transaction.id = "12845"
user_id = "user123"
context.status_code = 200
context.cache_key = "patterns:user12" // Missing digit!
context.cache_hit = true
patterns.user_id = "user12" // Wrong user!
The smoking gun was right there in the span attributes. My Redis caching key was getting truncated due to a string formatting bug, so user123's categorization was using spending patterns from user12. No wonder the predictions were nonsensical—the model was getting context from someone with completely different spending habits.
The bug was in my caching code:
# The buggy version
def get_user_patterns(user_id):
cache_key = f"patterns:user{user_id[:4]}" # BUG: Truncating user ID!
cached = redis.get(cache_key)
# ... rest of the logic
# The fixed version
def get_user_patterns(user_id):
cache_key = f"patterns:{user_id}" # Fixed: Use full user ID
cached = redis.get(cache_key)
# ... rest of the logic
Without distributed tracing, this bug would have taken me days to track down. I would have suspected the ML model, the feature engineering, the data preprocessing—everything except a caching bug. With tracing, I found and fixed it in 20 minutes.
Opinions expressed by DZone contributors are their own.
Comments