Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing

See the difference between throughput and goodput, and why throughput alone can give you a dangerously false sense of confidence.

NaveenKumar Namachivayam

CORE ·

May. 21, 26 · Analysis

Likes (0)

Comment

Save

3.2K Views

In this blog post, we will see the difference between throughput and goodput, why throughput alone can give you a dangerously false sense of confidence, and how goodput, the metric championed by NVIDIA's AIPerf tool, tells you the truth about your LLM deployment.

If you have ever shipped a feature that looked perfectly healthy in your monitoring dashboard but fell apart under real user load, this post is for you.

What Is Throughput?

Throughput is one of the oldest and most familiar metrics in performance testing. Simply put, it answers the question: how much work can the system do in a given time window?

Depending on the context, throughput is expressed as:

Requests per second (req/s) – most common in API and web performance testing
Transactions per second (TPS) – common in database and payment system testing
Megabytes per second (MB/s) – common in file transfer and network testing
Tokens per second – specific to LLM inference workloads

In a JMeter test report, the throughput number is front and center. In a k6 summary, it shows up as http_reqs. In a Grafana dashboard, it is usually one of the first panels you look at.

Throughput tells you volume. It does not tell you the quality.

The Problem With Throughput Alone

Here is a scenario that should feel familiar.

You run a load test. Throughput looks great, 100 req/s. No errors. You ship. Real users start complaining that the app feels sluggish or unresponsive. You go back to your dashboard. Throughput is still 100 req/s. Green across the board.

What Happened?

The system was technically completing requests. But a large portion of those requests were taking 4 to 5 seconds to respond instead of the 500ms your users expect. The requests were counted as successful because they returned HTTP 200. Throughput does not care about latency. It just counts completions.

This is the gap. And in traditional web performance testing, experienced engineers close that gap by adding percentile latency checks (p95, p99) as assertions. But in LLM performance testing, the problem is deeper.

The Dosa Stall Analogy

Imagine a busy dosa stall in Coimbatore during the morning rush.

The stall owner proudly says, "We served 100 dosas this hour." That is throughput. 100 dosas per hour.

But here is the real picture:

28 dosas were served cold because the tawa was overcrowded
15 dosas arrived 20 minutes after the order because the batter queue was too long
5 dosas were undercooked

Only 52 dosas were served hot, crispy, and within the 5-minute promise. That is goodput. 52 dosas per hour.

The stall is technically operating at 100 dosas/hour. But only 52 of them actually met the quality standard the customer was promised.

Now imagine this stall is your LLM API, and each dosa is an inference request. The "hot and crispy within 5 minutes" rule is your SLO.

What Is Goodput?

Goodput is the number of requests per second that completed and met all your defined SLO constraints.

This definition comes directly from NVIDIA's AIPerf tool (the successor to GenAI-Perf), which is the industry standard for LLM inference benchmarking. In AIPerf, you define goodput constraints when you run a benchmark:

    Shell
   
 

   aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --goodput-ttft 500 \
  --goodput-itl 100
  

This tells the tool: only count a request toward goodput if:

Time to First Token (TTFT) was under 500ms, AND
Inter-Token Latency (ITL) was under 100ms

A request that completes but violates either constraint does not count. It is a failed request from the user's perspective, even if the HTTP status code was 200.

How Goodput Works in LLM Performance Testing

LLM inference has two latency metrics that users feel directly:

Time to First Token (TTFT) is how long the user waits before they see the first word of the response. This is what makes an LLM feel fast or laggy. A high TTFT means users are staring at a blank screen or a loading spinner.

Inter-Token Latency (ITL) is the delay between each token in the streamed response. A high ITL makes the text appear to stutter or pause mid-sentence, which breaks the feeling of a natural conversation.

Both of these metrics degrade under load. As concurrency increases, the inference server queue backs up. TTFT climbs first requests, sit waiting to be processed. ITL can follow if GPU compute is saturated.

Throughput stays stable through all of this. The server is still completing requests. It is just that the user experience is becoming progressively worse.

Goodput captures that degradation directly. When TTFT crosses your SLO threshold, those requests stop contributing to goodput. The goodput number drops visibly, even while throughput holds steady.

As I showed in an earlier post, 99% of Requests Failed and My Dashboard Showed Green, you can have a request throughput of 0.91 req/s that looks reasonable, while goodput sits at 0.01 req/s, meaning 99% of requests were silently breaching the SLO.

The Formula

Goodput is straightforward once you have your SLO thresholds defined:

    Plain Text
   
   Goodput (req/s) = Requests that met ALL SLO constraints / Total measurement time (seconds)

For an LLM workload with TTFT and ITL SLOs:

    Plain Text
   
   A request counts toward goodput if:
  TTFT < ttft_slo_ms  AND  ITL < itl_slo_ms

Notice that it uses AND, not OR. Both conditions must be satisfied. A request with excellent ITL but a TTFT of 3 seconds still fails. The user waited 3 seconds before seeing anything, which is a broken experience, regardless of how smooth the streaming was after that.

Pseudocode: Calculating Goodput

Here is a simplified pseudocode showing how goodput is computed behind the scenes:

    Python
   
 

   // Configuration
TTFT_SLO = 500    // milliseconds
ITL_SLO  = 100    // milliseconds

// Tracking
total_requests      = 0
compliant_requests  = 0
measurement_start   = current_time()

// Run benchmark loop
for each request sent:
    result = send_llm_request(prompt)

    total_requests++

    ttft = result.time_to_first_token_ms
    itl  = result.inter_token_latency_ms

    if ttft <= TTFT_SLO AND itl <= ITL_SLO:
        compliant_requests++

// Calculate metrics
measurement_duration_seconds = current_time() - measurement_start

throughput = total_requests / measurement_duration_seconds
goodput    = compliant_requests / measurement_duration_seconds

print("Request Throughput (req/s): " + throughput)
print("Goodput            (req/s): " + goodput)
print("SLO Compliance Rate (%):    " + (compliant_requests / total_requests * 100))
  

When your system is healthy and under low load, throughput and goodput will be very close. As concurrency increases and the system starts to struggle, you will see goodput diverge downward from throughput. That divergence is your early warning signal.

Throughput vs Goodput: Side-by-Side

Dimension	Throughput	Goodput
What it measures	All completed requests per second	Completed requests per second that met SLO
SLO-aware	No	Yes
Fails silently on latency degradation	Yes	No
Typical units	req/s, TPS, MB/s, tokens/s	req/s
Tool example	JMeter, k6, wrk	NVIDIA AIPerf
Use case	Capacity planning, raw volume	User experience validation, production readiness
Can look good while users suffer	Yes	No

When Should You Use Each Metric?

Use throughput when:

You are doing capacity planning and need to understand raw system limits
You are comparing infrastructure configurations (e.g., 2 GPU vs 4 GPU) at the same load level
You are generating a baseline before adding SLO constraints

Use goodput when:

You are validating the production readiness of an LLM endpoint
You want to know whether users are actually being served well, not just served
You are running a concurrency sweep to find the point where your SLO breaks
You are integrating LLM performance checks into your CI/CD pipeline

A healthy practice is to report both numbers together. If goodput and throughput are close, your system is healthy. If they diverge significantly, you have a quality problem that raw throughput is hiding.

Key Takeaway

Throughput answers: Can the system handle the volume?

Goodput answers: Is the system actually serving users well at that volume?

In traditional performance testing, latency SLOs were enforced through assertions and percentile checks. In LLM performance testing, goodput formalizes this into a single metric that is directly comparable to throughput. NVIDIA's AIPerf makes this measurable out of the box with the --goodput-ttft and --goodput-itl flags.

Next time you look at a load test result, ask yourself: Do I know the goodput number? If the answer is no, you only have half the picture.

Happy testing!

Metric (unit) Throughput (business) large language model Performance

Published at DZone with permission of NaveenKumar Namachivayam. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending