Where RAG latency actually comes from.
After scaling a retrieval-augmented generation system past 10M daily requests, the bottleneck was almost never where I expected. Notes for anyone about to optimise the wrong layer.
Draft. Working notes from a year of operating production RAG. The companion paper (Paper 003) will have the full benchmark grid; this is the operator’s view.
When I started running RAG in production, I assumed the latency budget would be dominated by two things: vector search and LLM generation. Everyone says so. Every blog post, every vendor deck, every conference talk. So that’s where I optimised first.
I was mostly wrong.
Here is roughly what 18 months of P50 latency looked like across the pipeline, after the obvious tuning was done:
| Stage | P50 contribution | What I assumed |
|---|---|---|
| Embedding the user query | 8% | “Negligible” |
| Vector search | 6% | “The whole story” |
| Re-ranking | 11% | “Optional” |
| Context assembly + token budget | 9% | “Cheap” |
| LLM generation (streaming first token) | 22% | “Most of it” |
| LLM generation (rest of response) | 31% | “Most of it” |
| Network and TLS to model provider | 13% | “Already accounted for” |
The largest controllable bucket on most days wasn’t the vector store. It was a tie between re-ranking and network overhead to the model provider. Both surprised me. Both had cheap fixes that I delayed for months because they weren’t where the literature pointed.
The five things that actually moved the needle
In rough order of impact, here is what cut our P50 from ~3.4s to ~1.9s. Note that none of these are revolutionary; they’re just where the time actually was.
1. Cache the embedding model in process
The embedding model — even a small one like text-embedding-3-small — was being loaded from disk on cold workers. We were paying ~80ms per cold request for a model load that should have been once-per-process. Pinning it in worker startup cut P50 by ~120ms because we were absorbing more cold starts than I thought.
2. Stop re-ranking everything
We were sending the top 50 vector hits to a cross-encoder for re-ranking. The cross-encoder cost ~25ms per pair. 50 pairs × 25ms = 1.25 seconds. We backed off to top 8 and the answer quality on our eval set was statistically indistinguishable. P50 dropped ~600ms overnight.
There’s a generalisation here: most teams over-retrieve and then pay the re-rank tax on documents the model would have ignored anyway. Measure the marginal value of each retrieved document on your task, not in the abstract.
3. Co-locate the model call
We were calling our model provider’s US-East endpoint from a Frankfurt cluster. The TLS handshake plus round-trip was ~180ms before any token was generated. Moving inference to the same region as the cluster cut that to ~40ms. The fact that we had been doing this for nine months is professionally embarrassing.
4. Pre-warm connection pools
The first request to the provider after a quiet minute paid the full TLS dance. With realistic traffic, ~3% of requests hit a cold connection. We solved it with a “keepalive ping” in the worker — one connection per worker, kept warm. P99 improved more than P50.
5. Stream the first token aggressively
Most of our latency budget is “time to first useful token,” not total response time. Starting the SSE response with a single byte the moment the model emits its first token (instead of buffering until punctuation) cut perceived latency by ~400ms in user studies, even though wall-clock latency was unchanged.
The thing I haven’t been able to fix
Token generation itself. With a frontier model and a fixed prompt structure, we are at the mercy of the provider’s autoscaling. Some days P99 is 2.5 seconds. Some days it’s 6.0. We’ve tried client-side timeouts and fallbacks; both make worst-case better but average-case worse.
If you have a clever solution to this that doesn’t involve self-hosting a smaller model, I’d love to hear it.
What I’d tell my past self
- Instrument the whole pipeline before optimising any part of it. A flame graph of one real production request would have saved me three months.
- Be suspicious of literature that doesn’t include the network. Most academic RAG papers benchmark against a model in the same Python process. Production never looks like that.
- The vector store is rarely the bottleneck. It’s the thing you see; it’s not where the time is.
— Segun