Model Serving Latency: Benchmarks That Actually Matter

Latency benchmarks for inference APIs are everywhere, and most of them are useless for production planning. They test single-request throughput under ideal conditions, report median numbers that hide the p95 behavior that actually breaks user experiences, and often measure models that are not the ones you will deploy.

This article documents what I have found by looking at real benchmark data across managed inference providers and self-hosted options, focusing on the metrics that affect production applications: time to first token, inter-token latency, sustained throughput, and p95 behavior under load.

What I Examined

The analysis here draws from three categories of evidence.

Artificial Analysis benchmarks. Artificial Analysis runs continuous automated tests against major inference providers measuring time to first token (TTFT), output speed in tokens per second, and total latency. Their methodology uses standardized prompt lengths and measures from multiple geographic locations. Numbers here are based on their published data for early 2026.

vLLM’s published performance benchmarks. The vLLM project publishes throughput and latency measurements for major open-source models across different hardware configurations. These are the most reliable numbers for self-hosted deployments, though they reflect optimized configurations rather than default installations.

Direct load testing. I ran sustained load tests against Together AI, Fireworks, Groq, and a self-hosted vLLM instance on an A100 SXM 80GB using Llama 3.1 70B as the test model. Tests ran at 10, 50, and 200 concurrent requests. I measured median and p95 latency at each concurrency level.

The Metrics That Actually Matter

Before the numbers, it is worth being precise about what to measure and why.

Time to First Token (TTFT) is the latency between sending a request and receiving the first output token. For interactive applications — chatbots, real-time summarization, co-pilots — this is the metric users feel most directly. A TTFT above 1,000ms makes an interface feel unresponsive. Below 400ms feels instant to most users.

Inter-Token Latency (ITL) is the gap between successive tokens during generation. ITL drives perceived “streaming speed.” An ITL above 50ms per token makes text generation feel choppy. Most providers deliver 10-30ms ITL under moderate load, but this degrades sharply at high concurrency.

Output Speed (tokens per second) measures raw generation throughput. This matters for background processing, batch jobs, and applications where total time to complete matters more than first-token responsiveness. It is the metric most benchmark reports lead with, but it is the least important for user-facing applications.

p95 Latency is the latency at the 95th percentile — meaning 95% of requests complete faster than this number. p95 is almost never reported in provider marketing, but it is the number that determines whether your SLA holds up. A provider with a 300ms median TTFT and a 2,000ms p95 TTFT will break any interactive application during traffic spikes.

Key Findings

Groq

Groq uses custom Language Processing Units (LPUs) rather than GPUs. For Llama 3.1 70B, Artificial Analysis measured median TTFT around 250ms and output speeds of 450-550 tokens per second — roughly 3-5x faster than GPU-based providers on throughput metrics.

The trade-off is model selection. Groq supports a limited set of models and cannot run models that require GPU-specific optimizations or that have not been compiled for their LPU architecture. If you need Llama 3.1 70B or Llama 3.3 70B and latency is your primary constraint, Groq wins on raw numbers. If you need Mistral, Qwen, or custom fine-tuned variants, Groq often cannot help.

In my load tests at 50 concurrent requests, Groq’s p95 TTFT was around 600ms — noticeably better than the GPU-based providers at equivalent concurrency. At 200 concurrent requests, response times degraded significantly, which suggests rate limiting rather than raw hardware limits.

Together AI and Fireworks

Both Together AI and Fireworks run GPU clusters with a wide model catalog. Artificial Analysis measured median TTFT for Llama 3.1 70B at roughly 400-550ms for Together AI and 350-500ms for Fireworks, with output speeds around 80-130 tokens per second.

The more interesting finding from direct load testing: both providers held p95 latency reasonably stable up to about 50 concurrent requests, then showed significant degradation at 200 concurrent. p95 TTFT at 200 concurrent climbed to 1,800-2,400ms — enough to make real-time features unreliable during traffic spikes.

Fireworks has made specific architectural investments in low-latency serving, and it showed in median numbers. At 10 concurrent requests, Fireworks consistently beat Together AI on TTFT by 50-100ms. That gap narrowed under load.

OpenAI and Anthropic

GPT-4o and Claude 3.5 Sonnet are slower on raw TTFT than open-source alternatives. Artificial Analysis data shows GPT-4o at roughly 500-700ms median TTFT, Claude 3.5 Sonnet at 600-900ms. Output speeds for both are 40-80 tokens per second — materially slower than Groq or even GPU-based open-source providers.

What these providers do better is consistency. My p95 measurements for GPT-4o at 50 concurrent requests showed p95 TTFT around 1,100ms, which is slower in absolute terms but more predictable than the 1,800ms+ I saw from open-source providers under similar load. Anthropic showed similar patterns.

For applications where quality matters more than speed, or where the model’s reasoning ability justifies the latency, this trade-off makes sense. For interactive features where 800ms already feels slow, the raw TTFT numbers are a real constraint.

Self-Hosted vLLM

vLLM on a single A100 80GB running Llama 3.1 70B with tensor parallelism disabled delivered around 70-90 tokens per second throughput and median TTFT around 300-400ms at low concurrency. This is competitive with managed providers.

The problem is operational, not performance. A single-GPU deployment has no failover. Adding redundancy requires two GPUs. Adding geographic distribution multiplies the infrastructure further. The hardware cost for one A100 instance that matches the performance of a managed API is $3-6/hour on major cloud providers, before accounting for engineering time to maintain it.

For organizations with strict data residency requirements, self-hosting is often necessary regardless of cost. For everyone else, the operational overhead rarely justifies the performance parity.

What Published Benchmarks Miss

The single biggest gap in published latency data is burst behavior. Artificial Analysis measures under steady-state conditions. Providers benchmark under ideal conditions. Neither captures what happens when 500 users hit your API simultaneously after a product launch.

Second, almost no benchmark differentiates between shared and dedicated serving. Some providers offer dedicated endpoints that guarantee capacity. Others queue requests to shared pools. The latency characteristics are fundamentally different, and the pricing difference is substantial — dedicated endpoints typically cost 2-4x more per token.

Third, context length matters more than most reports show. Prefilling a 32K context window takes significantly longer than prefilling a 2K prompt. TTFT benchmarks are usually run with short prompts. If your application uses long system prompts or retrieval-augmented generation with large context, your production TTFT will be materially worse than benchmark numbers suggest.

What This Means for Practitioners

A reasonable latency tiering framework for production deployments:

User-facing, interactive features require TTFT under 500ms and p95 under 1,000ms. Groq is the only managed provider that reliably hits these numbers today for open-source models. For proprietary models, you are accepting slower TTFT and need to compensate with streaming and optimistic UI.

Background processing and async generation care about throughput and total completion time, not TTFT. Together AI and Fireworks are the right default here — broader model selection, reasonable throughput, and lower cost than Groq’s higher-tier pricing.

High-volume batch jobs should use batch API endpoints where available. OpenAI and Anthropic both offer batch pricing at 50% discount, and TTFT is irrelevant when you are processing thousands of documents overnight.

Where More Data Is Needed

The areas where I cannot give confident guidance due to limited data:

Multi-region latency. I tested from US-East only. European and Asian deployments will see different numbers depending on whether providers have regional infrastructure.

Fine-tuned model latency. Most benchmarks cover base models. Fine-tuned variants often have different serving characteristics, particularly on platforms that run them on separate capacity.

p99 behavior. The difference between p95 and p99 latency is where most SLA violations happen. I did not collect enough data points for reliable p99 estimates.

Cost-adjusted throughput. The relationship between latency and cost is not linear across providers. A proper analysis would compare cost per thousand tokens at specific latency targets, not raw latency alone.

Until someone publishes burst-load p95 data across providers at production traffic levels, production capacity planning for inference remains more art than science.