DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Observability in Spring Boot 4
  • Production-Ready Observability for Analytics Agents: An Open Telemetry Blueprint Across Retrieval, SQL, Redaction, and Tool Calls
  • Optimizing Prometheus Queries With PromQL
  • How To Use Geo-Partitioning to Comply With Data Regulations and Deliver Low Latency Globally

Trending

  • You Learned AI. So Why Are You Still Not Getting Hired?
  • AWS Kiro: The Agentic IDE That Makes Specs the Unit of Work
  • The Hidden Bottlenecks That Break Microservices in Production
  • Ten Years of Beam: From Google's Dataflow Paper to 4 Trillion Events at LinkedIn
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Monitoring and Observability
  4. One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

One SQL query across 4 GPU nodes found a straggler in under a second using eBPF fleet fan-out, no central collector needed.

By 
Ingero Team user avatar
Ingero Team
·
May. 25, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
248 Views

Join the DZone community and get the full member experience.

Join For Free

TL;DR

A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine.

The Problem We Kept Hitting

We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well.

But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening.

We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works?

What We Shipped in v0.9.1

Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports.

1. Node Identity

Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback:

Shell
 
sudo ingero trace --node gpu-node-01


Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed.

2. Fleet Fan-Out Queries

Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only.

The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically:

YAML
 
fleet:
  nodes:
    - gpu-node-01:8080
    - gpu-node-02:8080
    - gpu-node-03:8080
    - gpu-node-04:8080


A full example config is in configs/ingero.yaml.

Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving:

Shell
 
$ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \
    "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us
     FROM events GROUP BY node, source"

node              source  cnt    avg_us
----------------  ------  -----  ------
gpu-node-01       4       11009  5.2
gpu-node-01       3       847    18400  # ← 9x higher than peers
gpu-node-02       4       10892  5.1
gpu-node-02       3       412    2100
gpu-node-03       4       10847  5.3
gpu-node-03       3       398    1900
gpu-node-04       4       10901  5.0
gpu-node-04       3       421    2200

  8 rows from 4 node(s)


Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains:

Shell
 
$ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080

FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s)

[HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O
  Root cause: 847 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs

[MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O
  Root cause: 855 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset


Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process.

Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.”

3. Offline Merge and Perfetto Export

Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available.

For those cases, ingero merge combines SQLite databases from each node into a single queryable file:

Shell
 
# 1. Collect traces from each node
scp gpu-node-01:~/.ingero/ingero.db node-01.db
scp gpu-node-02:~/.ingero/ingero.db node-02.db

# 2. Merge and analyze
ingero merge node-01.db node-02.db -o cluster.db
ingero explain -d cluster.db


Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node.

For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline.

Why We Built It This Way

The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path.

We deliberately avoided that.

No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet.

Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity.

Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself.

Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query.

Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time.

MCP: AI-Driven Fleet Investigation

The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster:

Python
 
query_fleet(action="chains", since="5m")

Fleet Chains: 2 chain(s)
[HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O
[MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O


That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected.

Where This Stands

v0.9.1 is the initial step in cluster-level tracing, not the destination.

What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact.

We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor.

The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments.

We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub.

The investigations/ directory has ready-to-query databases for trying this without a GPU cluster:

  • sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node cluster
  • sample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks)

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

Related Reading

  • GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends
  • 11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this post
  • GPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story
cluster sql Observability

Opinions expressed by DZone contributors are their own.

Related

  • Observability in Spring Boot 4
  • Production-Ready Observability for Analytics Agents: An Open Telemetry Blueprint Across Retrieval, SQL, Redaction, and Tool Calls
  • Optimizing Prometheus Queries With PromQL
  • How To Use Geo-Partitioning to Comply With Data Regulations and Deliver Low Latency Globally

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook