Debugging Distributed ML Systems

My ML model misclassified groceries as entertainment. Distributed tracing with OpenTelemetry and Jaeger helped me quickly find a caching bug causing it.

Ramya Boorugula

Aug. 25, 25 · Opinion

Likes (0)

Comment

Save

1.7K Views

My ML model for categorizing suddenly started classifying groceries as entertainment expenses. But why? What happened?

I was looking at my personal finance dashboard and noticed something was completely off. The logs from each service looked normal. The health checks were green. Yet somehow, my grocery store purchases were being flagged as entertainment, and my restaurant bills were showing up as utilities.

For some background, I had recently broken my monolith finance tracker into multiple microservices. What used to be a single Flask app with traceable execution had become a distributed puzzle of API calls, Redis caches, and background jobs. When something went wrong, figuring out where and why had become a nightmare.

After spending many late nights debugging issues that seemed to appear from nowhere, I finally implemented tracing using OpenTelemetry and Jaeger. It has honestly transformed my personal project from a debugging headache into something I actually enjoy maintaining.

If you're running your own distributed ML project and finding yourself lost in a sea of logs, this guide will show you how to set up tracing infrastructure that makes debugging manageable instead of purely frustrating.

The Personal Project Debugging Nightmare

Let me give you a picture of what debugging used to look like in my finance tracker. After decomposing my monolith, I had five core services:

Transaction Processor: Ingests and cleans raw transaction data
Category Predictor: ML service for transaction categorization
Spending Analyzer: Computes spending patterns and insights
Fraud Detector: Identifies suspicious transactions
Budget Manager: Tracks budgets and sends alerts

A typical transaction processing flow would look like this:

New transaction comes in via CSV upload or bank API
Transaction Processor cleans and validates the data
Category Predictor calls the ML model to classify the transaction
Fraud Detector checks for suspicious patterns
Spending Analyzer updates monthly statistics
Budget Manager checks if any budget limits were exceeded

When this worked smoothly, it felt like magic. When it didn't, debugging was pure pain.

The categorization disaster I mentioned earlier started with nonsensical expense categories. Looking at individual service logs, everything seemed fine:

Transaction Processor: "Processed transaction ID 12845 ✓"
Category Predictor: "Classified 'Wholefoods#123' as 'entertainment' ✓"
Fraud Detector: "No suspicious patterns detected ✓"
Spending Analyzer: "Updated monthly totals ✓"

But clearly something was off as my grocery shopping was being classified as entertainment. Without a way to trace a single transaction through all these services, I was basically guessing where the problem might be.

OpenTelemetry: The Game Changer for Personal Projects

OpenTelemetry might sound like a overkill for a personal project, but it's worth the setup effort. The basic idea is simple: every request gets a unique trace ID that follows it through your entire system, and each service operation becomes a "span" within that trace.

Here's how I implemented it across my Python services running in Docker containers:

    Python
   
 

   # requirements.txt additions for all services
opentelemetry-api==1.21.0
opentelemetry-sdk==1.21.0
opentelemetry-instrumentation-flask==0.42b0
opentelemetry-instrumentation-requests==0.42b0
opentelemetry-exporter-jaeger-thrift==1.21.0

# shared/tracing.py - Common tracing setup
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

def setup_tracing(service_name):
    # Set up tracer provider
    trace.set_tracer_provider(TracerProvider())
    tracer = trace.get_tracer(__name__)
    
    # Configure Jaeger exporter
    jaeger_exporter = JaegerExporter(
        agent_host_name=os.getenv("JAEGER_HOST", "jaeger"),
        agent_port=int(os.getenv("JAEGER_PORT", "6831")),
    )
    
    # Add span processor
    span_processor = BatchSpanProcessor(jaeger_exporter)
    trace.get_tracer_provider().add_span_processor(span_processor)
    
    # Auto-instrument Flask and requests
    FlaskInstrumentor().instrument()
    RequestsInstrumentor().instrument()
    
    return tracer

# category_predictor/app.py - Example service implementation
from flask import Flask, request, jsonify
import joblib
import requests
from shared.tracing import setup_tracing

app = Flask(__name__)
tracer = setup_tracing("category-predictor")

# Load the categorization model
model = joblib.load('models/category_classifier.pkl')

@app.route('/categorize', methods=['POST'])
def categorize_transaction():
    transaction_data = request.json
    
    with tracer.start_as_current_span("categorize_transaction") as span:
        span.set_attribute("transaction.id", transaction_data.get('id'))
        span.set_attribute("transaction.amount", transaction_data.get('amount'))
        span.set_attribute("merchant", transaction_data.get('description', '')[:50])
        
        try:
            # Get user spending patterns for context
            user_id = transaction_data.get('user_id')
            with tracer.start_as_current_span("fetch_user_context") as context_span:
                context_response = requests.get(
                    f"http://spending-analyzer:5000/patterns/{user_id}"
                )
                context_span.set_attribute("context.status_code", context_response.status_code)
                user_context = context_response.json() if context_response.ok else {}
            
            # Prepare features for ML model
            with tracer.start_as_current_span("prepare_features") as feature_span:
                features = prepare_transaction_features(transaction_data, user_context)
                feature_span.set_attribute("features.count", len(features))
                feature_span.set_attribute("features.has_context", len(user_context) > 0)
            
            # Run ML model prediction
            with tracer.start_as_current_span("model_prediction") as model_span:
                category = model.predict([features])[0]
                confidence = model.predict_proba([features]).max()
                
                model_span.set_attribute("prediction.category", category)
                model_span.set_attribute("prediction.confidence", float(confidence))
                model_span.set_attribute("model.version", "v1.2.3")
            
            span.set_attribute("result.category", category)
            span.set_attribute("result.confidence", float(confidence))
            
            return jsonify({
                'category': category,
                'confidence': float(confidence)
            })
            
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            return jsonify({'error': str(e)}), 500

def prepare_transaction_features(transaction, user_context):
    """Extract features for the ML model"""
    description = transaction.get('description', '').lower()
    amount = float(transaction.get('amount', 0))
    
    features = [
        len(description),
        amount,
        1 if 'grocery' in description or 'kroger' in description else 0,
        1 if 'restaurant' in description or 'cafe' in description else 0,
        user_context.get('avg_grocery_amount', 50.0),  # User's average grocery spending
        user_context.get('grocery_frequency', 0.1),    # How often they buy groceries
    ]
    
    return features

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

  

The beauty of this setup is that OpenTelemetry automatically traces HTTP requests between services. I only needed to add manual spans for ML-specific operations like feature preparation and model inference.

Setting Up Jaeger: Trace Dashboard

Getting Jaeger running alongside my finance tracker services was straightforward with Docker Compose:

    YAML
   
 

   # docker-compose.yml

version: '3.8'

services:

  jaeger:
    image: jaegertracing/all-in-one:1.50
    ports:
      - "16686:16686"  # Jaeger UI
      - "6831:6831/udp"  # Jaeger agent
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks:
      - finance-network

  transaction-processor:
    build: ./transaction-processor
    environment:
      - JAEGER_HOST=jaeger
      - JAEGER_PORT=6831
    depends_on:
      - jaeger
      - redis
    networks:
      - finance-network

  category-predictor:
    build: ./category-predictor
    environment:
      - JAEGER_HOST=jaeger
      - JAEGER_PORT=6831
    depends_on:
      - jaeger
    networks:
      - finance-network

  spending-analyzer:
    build: ./spending-analyzer
    environment:
      - JAEGER_HOST=jaeger
      - JAEGER_PORT=6831
    depends_on:
      - jaeger
      - redis
    networks:
      - finance-network

  fraud-detector:
    build: ./fraud-detector
    environment:
      - JAEGER_HOST=jaeger
      - JAEGER_PORT=6831
    depends_on:
      - jaeger
    networks:
      - finance-network

  redis:
    image: redis:alpine
    networks:
      - finance-network

networks:
  finance-network:
    driver: bridge

  

After running docker-compose up, I could access the Jaeger UI at http://localhost:16686 and see traces from all my services. The first time I saw a complete transaction flow visualized end-to-end was a good feeling.

Solving the Great Categorization Bug

Remember that mysterious categorization bug I mentioned? Here's exactly how distributed tracing helped me solve it.

With all tracing added, I could finally trace individual problematic transactions through the entire pipeline. I found a recent transaction that had been miscategorized and pulled up its trace in Jaeger.

The trace looked normal at first glance: all services responded successfully, and the flow proceeded as expected. But when I examined the spans more carefully, I noticed something weird in the "fetch_user_context" span:

    Plain Text
   
   span: fetch_user_context

  transaction.id = "12845"

  user_id = "user123"  

  context.status_code = 200

  context.cache_key = "patterns:user12"  // Missing digit!

  context.cache_hit = true

  patterns.user_id = "user12"  // Wrong user!

The smoking gun was right there in the span attributes. My Redis caching key was getting truncated due to a string formatting bug, so user123's categorization was using spending patterns from user12. No wonder the predictions were nonsensical—the model was getting context from someone with completely different spending habits.

The bug was in my caching code:

    Python
   
   # The buggy version

def get_user_patterns(user_id):
    cache_key = f"patterns:user{user_id[:4]}"  # BUG: Truncating user ID!
    cached = redis.get(cache_key)
    # ... rest of the logic

    Python
   
   # The fixed version

def get_user_patterns(user_id):
    cache_key = f"patterns:{user_id}"  # Fixed: Use full user ID
    cached = redis.get(cache_key)
    # ... rest of the logic

Without distributed tracing, this bug would have taken me days to track down. I would have suspected the ML model, the feature engineering, the data preprocessing—everything except a caching bug. With tracing, I found and fixed it in 20 minutes.

Network systems Data Types Debug (command)

Opinions expressed by DZone contributors are their own.

Related

Trending