← Back to Blog

TTFT, TPOT, or Goodput? The LLM Benchmark Metric That Actually Matters for AI Agents

Editorial image for TTFT, TPOT, or Goodput? The LLM Benchmark Metric That Actually Matters for AI Agents about AI Infrastructure.

Key Takeaways

  • High throughput can still produce a poor agent experience if requests miss TTFT or TPOT targets.
  • TTFT should usually lead live chat benchmarks, while end-to-end latency matters more for background automations.
  • Prompt and output length distributions materially change benchmark results, so synthetic single-shape tests can mislead.
  • Goodput becomes the deciding metric once you have a real latency SLA to meet.
BLOOMIE
POWERED BY NEROVA

If you are benchmarking an LLM stack for an AI agent, the first question is not which server posts the highest tokens per second. It is which delay actually breaks the workflow. For live support, start with time to first token. For streaming assistants, watch time per output token or inter-token latency next. For long-running automations, total end-to-end latency and goodput under a service-level objective matter more than a flashy throughput peak.

Start with the moment the user feels

Current benchmarking guidance separates TTFT, inter-token latency, end-to-end latency, tokens per second, and requests per second because each one maps to a different failure mode. TTFT is the wait before the first visible answer. End-to-end latency is the full wall-clock time until the last token arrives. ITL and TPOT describe the pace of streaming after the first token, although tools do not always define them in exactly the same way. If you pick the wrong lead metric, you can optimize the wrong part of the experience.

Pick the first metric by workflow

WorkflowLead metricWhy it matters first
Website support chatbotP95 TTFTUsers judge responsiveness before they judge answer depth.
Streaming voice or live copilotP95 ITL or TPOTOnce the reply starts, uneven token cadence makes the system feel laggy.
Research or document agentE2E latencyThe operator cares about total turnaround for a complete result.
Shared production serving stackGoodput under SLOHigh raw throughput is meaningless if too many requests miss latency targets.

Why throughput alone breaks down

Throughput still matters because it shapes cost and capacity. But once a team has a latency target, throughput stops being the whole story. Goodput is the more practical metric because it measures completed requests that still satisfy the service objective. That is the missing filter on many leaderboard-style comparisons.

This is especially true for mixed agent workloads. Prompt length drives prefill cost, output length drives decode cost, and concurrency changes the latency-throughput curve. A system can look excellent at unconstrained throughput and still feel worse in production because batching pushes first-token delay too high. For chat applications, a sub-250 ms average TTFT is often a reasonable starting point for responsiveness, but it should be treated as a workflow-specific target, not a universal rule.

What the infrastructure shift is telling teams

The reason TTFT and TPOT matter more in modern serving decisions is that infrastructure is increasingly built around them. Disaggregated serving splits prefill and decode because they stress different resources. Prefill is compute-bound and decode is memory-bound, so teams can scale them independently and aim separate TTFT and ITL objectives instead of compromising on one blended metric.

That matters for agent builders because one workflow can be front-end sensitive while another is back-office heavy:

  • A customer-support bot fails fast when TTFT is slow, even if the eventual answer is good.
  • A coding or document agent can tolerate a slower first token if the full task finishes quickly and reliably.
  • A multi-tenant platform should compare systems by the request rate that still satisfies latency SLOs, not by the highest token count on an unconstrained run.

How to benchmark without fooling yourself

Most bad LLM benchmarks fail before the chart is drawn. They use the wrong prompt mix, the wrong concurrency, or the wrong success definition. A better process is simple:

  1. Set the workflow SLA first. Decide whether the experience breaks on first-token delay, streaming cadence, or full completion time.
  2. Benchmark against realistic input and output length distributions. Sequence-length mix changes hardware utilization and changes the result.
  3. Measure percentiles, not just means. Tail latency is often what users notice first.
  4. Plot latency against throughput across concurrency levels. One best-case run does not tell you where the system stops feeling comfortable.
  5. When comparing serving stacks or deployment shapes, use goodput once an SLA exists. That is the point where raw requests per second stops being honest.

When benchmark winners still fail in production

A benchmark winner can still disappoint once you add long prompts, cache misses, retrieval, tool calls, or shared traffic. Queueing time grows. TTFT drifts. Streaming gets choppy. And a model-server pair that looks cheap in tokens per second can become expensive when retries, timeout buffers, or overprovisioning are needed to hold the SLA.

The practical rule is straightforward. Use TTFT to protect responsiveness, TPOT or ITL to protect streaming quality, end-to-end latency to protect completion time, and goodput to choose an architecture under load. If your team is arguing about a single benchmark number, it is probably compressing several different business requirements into one metric that cannot carry them.

Choose the first metric before you optimize the stack

Start with the workflow outcome that users or operators feel first, then add the secondary metric that catches the next failure mode.

If your workflow isPrioritizeThen check
Customer support chatbotP95 TTFTP95 ITL and fallback rate
Voice or live assistantP95 ITL or TPOTTurn-taking interruptions and TTFT
Research or document agentEnd-to-end latencyCost per completed run
Shared multi-tenant serving layerGoodput under SLAQueue time and GPU utilization
Write the latency SLA before you run the benchmark.
Test with realistic prompt and response length distributions.
Read the latency-throughput curve across concurrency, not one snapshot.
Compare architectures by goodput when user-facing latency matters.

Frequently Asked Questions

What is the difference between TTFT and end-to-end latency?

TTFT measures how long a user waits before the first token appears. End-to-end latency measures the total time from request submission until the final token arrives.

When should I care about TPOT or inter-token latency?

TPOT or inter-token latency matters most when a user experiences the answer as a stream, such as live chat, voice, or copilots. It affects how smooth the response feels after it starts.

What is goodput in LLM serving?

Goodput is the rate of completed requests that still meet the latency or service objective you set. It is more useful than raw throughput when a system must stay responsive under load.

Why can a high-throughput benchmark still lead to a bad user experience?

A system can process many tokens or requests overall while still delaying first-token response, increasing queue time, or producing choppy streaming under real concurrency. Users feel those delays even if aggregate throughput looks strong.

Does disaggregated serving only matter at very large scale?

It matters most when prompt-heavy and decode-heavy work create different bottlenecks, especially in shared production systems. Smaller teams may not need it immediately, but the same TTFT and ITL logic still helps them benchmark correctly.

Set the right latency target before you optimize the stack

If you are deciding whether your workflow should optimize TTFT, streaming cadence, end-to-end completion time, or architecture-level goodput, Scope can map the real bottleneck first and turn that into a practical rollout plan.

Run an AI rollout audit
Ask Bloomie about this article