What is the difference between TTFT and end-to-end latency?

TTFT measures how long a user waits before the first token appears. End-to-end latency measures the total time from request submission until the final token arrives.

When should I care about TPOT or inter-token latency?

TPOT or inter-token latency matters most when a user experiences the answer as a stream, such as live chat, voice, or copilots. It affects how smooth the response feels after it starts.

What is goodput in LLM serving?

Goodput is the rate of completed requests that still meet the latency or service objective you set. It is more useful than raw throughput when a system must stay responsive under load.

Why can a high-throughput benchmark still lead to a bad user experience?

A system can process many tokens or requests overall while still delaying first-token response, increasing queue time, or producing choppy streaming under real concurrency. Users feel those delays even if aggregate throughput looks strong.

Does disaggregated serving only matter at very large scale?

It matters most when prompt-heavy and decode-heavy work create different bottlenecks, especially in shared production systems. Smaller teams may not need it immediately, but the same TTFT and ITL logic still helps them benchmark correctly.

TTFT vs TPOT vs E2E Latency: Which Metric Actually Predicts AI Agent Performance?

If you are benchmarking an LLM stack for an AI agent, the first question is not which server posts the highest tokens per second. It is which delay actually breaks the workflow. For live support, start with time to first token. For streaming assistants, watch time per output token or inter-token latency next. For long-running automations, total end-to-end latency and goodput under a service-level objective matter more than a flashy throughput peak.

Start with the moment the user feels

Current benchmarking guidance separates TTFT, inter-token latency, end-to-end latency, tokens per second, and requests per second because each one maps to a different failure mode. TTFT is the wait before the first visible answer. End-to-end latency is the full wall-clock time until the last token arrives. ITL and TPOT describe the pace of streaming after the first token, although tools do not always define them in exactly the same way. If you pick the wrong lead metric, you can optimize the wrong part of the experience.

Pick the first metric by workflow

Workflow	Lead metric	Why it matters first
Website support chatbot	P95 TTFT	Users judge responsiveness before they judge answer depth.
Streaming voice or live copilot	P95 ITL or TPOT	Once the reply starts, uneven token cadence makes the system feel laggy.
Research or document agent	E2E latency	The operator cares about total turnaround for a complete result.
Shared production serving stack	Goodput under SLO	High raw throughput is meaningless if too many requests miss latency targets.

Why throughput alone breaks down

Throughput still matters because it shapes cost and capacity. But once a team has a latency target, throughput stops being the whole story. Goodput is the more practical metric because it measures completed requests that still satisfy the service objective. That is the missing filter on many leaderboard-style comparisons.

This is especially true for mixed agent workloads. Prompt length drives prefill cost, output length drives decode cost, and concurrency changes the latency-throughput curve. A system can look excellent at unconstrained throughput and still feel worse in production because batching pushes first-token delay too high. For chat applications, a sub-250 ms average TTFT is often a reasonable starting point for responsiveness, but it should be treated as a workflow-specific target, not a universal rule.

What the infrastructure shift is telling teams

The reason TTFT and TPOT matter more in modern serving decisions is that infrastructure is increasingly built around them. Disaggregated serving splits prefill and decode because they stress different resources. Prefill is compute-bound and decode is memory-bound, so teams can scale them independently and aim separate TTFT and ITL objectives instead of compromising on one blended metric.

That matters for agent builders because one workflow can be front-end sensitive while another is back-office heavy:

A customer-support bot fails fast when TTFT is slow, even if the eventual answer is good.
A coding or document agent can tolerate a slower first token if the full task finishes quickly and reliably.
A multi-tenant platform should compare systems by the request rate that still satisfies latency SLOs, not by the highest token count on an unconstrained run.

How to benchmark without fooling yourself

Most bad LLM benchmarks fail before the chart is drawn. They use the wrong prompt mix, the wrong concurrency, or the wrong success definition. A better process is simple:

Set the workflow SLA first. Decide whether the experience breaks on first-token delay, streaming cadence, or full completion time.
Benchmark against realistic input and output length distributions. Sequence-length mix changes hardware utilization and changes the result.
Measure percentiles, not just means. Tail latency is often what users notice first.
Plot latency against throughput across concurrency levels. One best-case run does not tell you where the system stops feeling comfortable.
When comparing serving stacks or deployment shapes, use goodput once an SLA exists. That is the point where raw requests per second stops being honest.

When benchmark winners still fail in production

A benchmark winner can still disappoint once you add long prompts, cache misses, retrieval, tool calls, or shared traffic. Queueing time grows. TTFT drifts. Streaming gets choppy. And a model-server pair that looks cheap in tokens per second can become expensive when retries, timeout buffers, or overprovisioning are needed to hold the SLA.

The practical rule is straightforward. Use TTFT to protect responsiveness, TPOT or ITL to protect streaming quality, end-to-end latency to protect completion time, and goodput to choose an architecture under load. If your team is arguing about a single benchmark number, it is probably compressing several different business requirements into one metric that cannot carry them.

If your workflow is	Prioritize	Then check
Customer support chatbot	P95 TTFT	P95 ITL and fallback rate
Voice or live assistant	P95 ITL or TPOT	Turn-taking interruptions and TTFT
Research or document agent	End-to-end latency	Cost per completed run
Shared multi-tenant serving layer	Goodput under SLA	Queue time and GPU utilization

TTFT, TPOT, or Goodput? The LLM Benchmark Metric That Actually Matters for AI Agents

Key Takeaways

Start with the moment the user feels

Pick the first metric by workflow

Why throughput alone breaks down

What the infrastructure shift is telling teams

How to benchmark without fooling yourself

When benchmark winners still fail in production

Choose the first metric before you optimize the stack

Sources

Custom AI agents for business operations

Frequently Asked Questions

What is the difference between TTFT and end-to-end latency?

When should I care about TPOT or inter-token latency?

What is goodput in LLM serving?

Why can a high-throughput benchmark still lead to a bad user experience?

Does disaggregated serving only matter at very large scale?

Set the right latency target before you optimize the stack

Related Nerova Resources

TTFT, TPOT, or Goodput? The LLM Benchmark Metric That Actually Matters for AI Agents

Key Takeaways

Start with the moment the user feels

Pick the first metric by workflow

Why throughput alone breaks down

What the infrastructure shift is telling teams

How to benchmark without fooling yourself

When benchmark winners still fail in production

Choose the first metric before you optimize the stack

Sources

Custom AI agents for business operations

Frequently Asked Questions

What is the difference between TTFT and end-to-end latency?

When should I care about TPOT or inter-token latency?

What is goodput in LLM serving?

Why can a high-throughput benchmark still lead to a bad user experience?

Does disaggregated serving only matter at very large scale?

Set the right latency target before you optimize the stack

Get the next important AI update

Related Nerova Resources

Related Posts

DeepSeek’s DSpark Makes AI Inference Up to 85% Faster. Why That Matters for Agent Builders.

OpenAI’s Jalapeño Chip With Broadcom Makes AI Inference the Next Big Competitive Fight

OpenAI and Broadcom’s Jalapeño Chip Makes Inference Economics the Main Event