If you are benchmarking an LLM stack for an AI agent, the first question is not which server posts the highest tokens per second. It is which delay actually breaks the workflow. For live support, start with time to first token. For streaming assistants, watch time per output token or inter-token latency next. For long-running automations, total end-to-end latency and goodput under a service-level objective matter more than a flashy throughput peak.
Start with the moment the user feels
Current benchmarking guidance separates TTFT, inter-token latency, end-to-end latency, tokens per second, and requests per second because each one maps to a different failure mode. TTFT is the wait before the first visible answer. End-to-end latency is the full wall-clock time until the last token arrives. ITL and TPOT describe the pace of streaming after the first token, although tools do not always define them in exactly the same way. If you pick the wrong lead metric, you can optimize the wrong part of the experience.
Pick the first metric by workflow
| Workflow | Lead metric | Why it matters first |
|---|---|---|
| Website support chatbot | P95 TTFT | Users judge responsiveness before they judge answer depth. |
| Streaming voice or live copilot | P95 ITL or TPOT | Once the reply starts, uneven token cadence makes the system feel laggy. |
| Research or document agent | E2E latency | The operator cares about total turnaround for a complete result. |
| Shared production serving stack | Goodput under SLO | High raw throughput is meaningless if too many requests miss latency targets. |
Why throughput alone breaks down
Throughput still matters because it shapes cost and capacity. But once a team has a latency target, throughput stops being the whole story. Goodput is the more practical metric because it measures completed requests that still satisfy the service objective. That is the missing filter on many leaderboard-style comparisons.
This is especially true for mixed agent workloads. Prompt length drives prefill cost, output length drives decode cost, and concurrency changes the latency-throughput curve. A system can look excellent at unconstrained throughput and still feel worse in production because batching pushes first-token delay too high. For chat applications, a sub-250 ms average TTFT is often a reasonable starting point for responsiveness, but it should be treated as a workflow-specific target, not a universal rule.
What the infrastructure shift is telling teams
The reason TTFT and TPOT matter more in modern serving decisions is that infrastructure is increasingly built around them. Disaggregated serving splits prefill and decode because they stress different resources. Prefill is compute-bound and decode is memory-bound, so teams can scale them independently and aim separate TTFT and ITL objectives instead of compromising on one blended metric.
That matters for agent builders because one workflow can be front-end sensitive while another is back-office heavy:
- A customer-support bot fails fast when TTFT is slow, even if the eventual answer is good.
- A coding or document agent can tolerate a slower first token if the full task finishes quickly and reliably.
- A multi-tenant platform should compare systems by the request rate that still satisfies latency SLOs, not by the highest token count on an unconstrained run.
How to benchmark without fooling yourself
Most bad LLM benchmarks fail before the chart is drawn. They use the wrong prompt mix, the wrong concurrency, or the wrong success definition. A better process is simple:
- Set the workflow SLA first. Decide whether the experience breaks on first-token delay, streaming cadence, or full completion time.
- Benchmark against realistic input and output length distributions. Sequence-length mix changes hardware utilization and changes the result.
- Measure percentiles, not just means. Tail latency is often what users notice first.
- Plot latency against throughput across concurrency levels. One best-case run does not tell you where the system stops feeling comfortable.
- When comparing serving stacks or deployment shapes, use goodput once an SLA exists. That is the point where raw requests per second stops being honest.
When benchmark winners still fail in production
A benchmark winner can still disappoint once you add long prompts, cache misses, retrieval, tool calls, or shared traffic. Queueing time grows. TTFT drifts. Streaming gets choppy. And a model-server pair that looks cheap in tokens per second can become expensive when retries, timeout buffers, or overprovisioning are needed to hold the SLA.
The practical rule is straightforward. Use TTFT to protect responsiveness, TPOT or ITL to protect streaming quality, end-to-end latency to protect completion time, and goodput to choose an architecture under load. If your team is arguing about a single benchmark number, it is probably compressing several different business requirements into one metric that cannot carry them.