DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Serverless Is Not Cheaper by Default
  • ML Performance Monitoring Metrics: A Simple Guide for Every Model Type
  • Measuring DevOps Success in the Enterprise With DORA Metrics
  • How You Can Use Few-Shot Learning In LLM Prompting To Improve Its Performance

Trending

  • Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
  • Reproducible Development Environments, One Command Away: Introducing CodingBooth
  • Stop Choosing Sides: An Engineering Leader's Framework for Build, Buy, and Hybrid AI Agents in 2026
  • How to Parse Large XML Files in PHP Without Running Out of Memory
  1. DZone
  2. Software Design and Architecture
  3. Performance
  4. Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing

Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing

See the difference between throughput and goodput, and why throughput alone can give you a dangerously false sense of confidence.

By 
NaveenKumar Namachivayam user avatar
NaveenKumar Namachivayam
DZone Core CORE ·
May. 21, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
2.9K Views

Join the DZone community and get the full member experience.

Join For Free

In this blog post, we will see the difference between throughput and goodput, why throughput alone can give you a dangerously false sense of confidence, and how goodput, the metric championed by NVIDIA's AIPerf tool, tells you the truth about your LLM deployment.

If you have ever shipped a feature that looked perfectly healthy in your monitoring dashboard but fell apart under real user load, this post is for you.

What Is Throughput?

Throughput is one of the oldest and most familiar metrics in performance testing. Simply put, it answers the question: how much work can the system do in a given time window?

Depending on the context, throughput is expressed as:

  • Requests per second (req/s) – most common in API and web performance testing
  • Transactions per second (TPS) – common in database and payment system testing
  • Megabytes per second (MB/s) – common in file transfer and network testing
  • Tokens per second – specific to LLM inference workloads

In a JMeter test report, the throughput number is front and center. In a k6 summary, it shows up as http_reqs. In a Grafana dashboard, it is usually one of the first panels you look at.

Throughput tells you volume. It does not tell you the quality.

The Problem With Throughput Alone

Here is a scenario that should feel familiar.

You run a load test. Throughput looks great, 100 req/s. No errors. You ship. Real users start complaining that the app feels sluggish or unresponsive. You go back to your dashboard. Throughput is still 100 req/s. Green across the board.

What Happened?

The system was technically completing requests. But a large portion of those requests were taking 4 to 5 seconds to respond instead of the 500ms your users expect. The requests were counted as successful because they returned HTTP 200. Throughput does not care about latency. It just counts completions.

This is the gap. And in traditional web performance testing, experienced engineers close that gap by adding percentile latency checks (p95, p99) as assertions. But in LLM performance testing, the problem is deeper.

The Dosa Stall Analogy

Imagine a busy dosa stall in Coimbatore during the morning rush.

The stall owner proudly says, "We served 100 dosas this hour." That is throughput. 100 dosas per hour.

But here is the real picture:

  • 28 dosas were served cold because the tawa was overcrowded
  • 15 dosas arrived 20 minutes after the order because the batter queue was too long
  • 5 dosas were undercooked

Only 52 dosas were served hot, crispy, and within the 5-minute promise. That is goodput. 52 dosas per hour.

The stall is technically operating at 100 dosas/hour. But only 52 of them actually met the quality standard the customer was promised.

Now imagine this stall is your LLM API, and each dosa is an inference request. The "hot and crispy within 5 minutes" rule is your SLO.

What Is Goodput?

Goodput is the number of requests per second that completed and met all your defined SLO constraints.

This definition comes directly from NVIDIA's AIPerf tool (the successor to GenAI-Perf), which is the industry standard for LLM inference benchmarking. In AIPerf, you define goodput constraints when you run a benchmark:

Shell
 
aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --goodput-ttft 500 \
  --goodput-itl 100


This tells the tool: only count a request toward goodput if:

  • Time to First Token (TTFT) was under 500ms, AND
  • Inter-Token Latency (ITL) was under 100ms

A request that completes but violates either constraint does not count. It is a failed request from the user's perspective, even if the HTTP status code was 200.

How Goodput Works in LLM Performance Testing

LLM inference has two latency metrics that users feel directly:

Time to First Token (TTFT) is how long the user waits before they see the first word of the response. This is what makes an LLM feel fast or laggy. A high TTFT means users are staring at a blank screen or a loading spinner.

Inter-Token Latency (ITL) is the delay between each token in the streamed response. A high ITL makes the text appear to stutter or pause mid-sentence, which breaks the feeling of a natural conversation.

Both of these metrics degrade under load. As concurrency increases, the inference server queue backs up. TTFT climbs first requests, sit waiting to be processed. ITL can follow if GPU compute is saturated.

Throughput stays stable through all of this. The server is still completing requests. It is just that the user experience is becoming progressively worse.

Goodput captures that degradation directly. When TTFT crosses your SLO threshold, those requests stop contributing to goodput. The goodput number drops visibly, even while throughput holds steady.

As I showed in an earlier post, 99% of Requests Failed and My Dashboard Showed Green, you can have a request throughput of 0.91 req/s that looks reasonable, while goodput sits at 0.01 req/s, meaning 99% of requests were silently breaching the SLO.

The Formula

Goodput is straightforward once you have your SLO thresholds defined:

Plain Text
 
Goodput (req/s) = Requests that met ALL SLO constraints / Total measurement time (seconds)


For an LLM workload with TTFT and ITL SLOs:

Plain Text
 
A request counts toward goodput if:
  TTFT < ttft_slo_ms  AND  ITL < itl_slo_ms


Notice that it uses AND, not OR. Both conditions must be satisfied. A request with excellent ITL but a TTFT of 3 seconds still fails. The user waited 3 seconds before seeing anything, which is a broken experience, regardless of how smooth the streaming was after that.

Pseudocode: Calculating Goodput

Here is a simplified pseudocode showing how goodput is computed behind the scenes:

Python
 
// Configuration
TTFT_SLO = 500    // milliseconds
ITL_SLO  = 100    // milliseconds

// Tracking
total_requests      = 0
compliant_requests  = 0
measurement_start   = current_time()

// Run benchmark loop
for each request sent:
    result = send_llm_request(prompt)

    total_requests++

    ttft = result.time_to_first_token_ms
    itl  = result.inter_token_latency_ms

    if ttft <= TTFT_SLO AND itl <= ITL_SLO:
        compliant_requests++

// Calculate metrics
measurement_duration_seconds = current_time() - measurement_start

throughput = total_requests / measurement_duration_seconds
goodput    = compliant_requests / measurement_duration_seconds

print("Request Throughput (req/s): " + throughput)
print("Goodput            (req/s): " + goodput)
print("SLO Compliance Rate (%):    " + (compliant_requests / total_requests * 100))


When your system is healthy and under low load, throughput and goodput will be very close. As concurrency increases and the system starts to struggle, you will see goodput diverge downward from throughput. That divergence is your early warning signal.

Throughput vs Goodput: Side-by-Side

Dimension Throughput Goodput
What it measures All completed requests per second Completed requests per second that met SLO
SLO-aware No Yes
Fails silently on latency degradation Yes No
Typical units req/s, TPS, MB/s, tokens/s req/s
Tool example JMeter, k6, wrk NVIDIA AIPerf
Use case Capacity planning, raw volume User experience validation, production readiness
Can look good while users suffer Yes No


When Should You Use Each Metric?

Use throughput when:

  • You are doing capacity planning and need to understand raw system limits
  • You are comparing infrastructure configurations (e.g., 2 GPU vs 4 GPU) at the same load level
  • You are generating a baseline before adding SLO constraints

Use goodput when:

  • You are validating the production readiness of an LLM endpoint
  • You want to know whether users are actually being served well, not just served
  • You are running a concurrency sweep to find the point where your SLO breaks
  • You are integrating LLM performance checks into your CI/CD pipeline

A healthy practice is to report both numbers together. If goodput and throughput are close, your system is healthy. If they diverge significantly, you have a quality problem that raw throughput is hiding.

Key Takeaway

Throughput answers: Can the system handle the volume?

Goodput answers: Is the system actually serving users well at that volume?

In traditional performance testing, latency SLOs were enforced through assertions and percentile checks. In LLM performance testing, goodput formalizes this into a single metric that is directly comparable to throughput. NVIDIA's AIPerf makes this measurable out of the box with the --goodput-ttft and --goodput-itl flags.

Next time you look at a load test result, ask yourself: Do I know the goodput number? If the answer is no, you only have half the picture.

Happy testing!

Metric (unit) Throughput (business) large language model Performance

Published at DZone with permission of NaveenKumar Namachivayam. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Serverless Is Not Cheaper by Default
  • ML Performance Monitoring Metrics: A Simple Guide for Every Model Type
  • Measuring DevOps Success in the Enterprise With DORA Metrics
  • How You Can Use Few-Shot Learning In LLM Prompting To Improve Its Performance

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook