What We Learned Shipping RAG to Production at Scale

Retrieval-Augmented Generation (RAG) is one of those techniques that feels almost too simple when you first prototype it. Chunk your documents, embed them, stuff the top-k results into a prompt, call GPT-4o. The demo works. The stakeholders are impressed.

Then you put it in production.

Failure mode 1: The retrieval quality cliff

Top-k cosine similarity retrieval degrades badly when the query is ambiguous or the user doesn't phrase it the way the documents are written. A question about "how to cancel my subscription" won't retrieve a document titled "Account Management > Termination Process" unless your embedding model understands the semantic equivalence.

The fix that worked for us: hybrid search — combining dense vector retrieval (pgvector or Pinecone) with sparse BM25 retrieval, and using a cross-encoder reranker on the combined results. This improved our query success rate from 87% to 96% on our DataPulse deployment.

Failure mode 2: Context window stuffing

Early implementations grabbed the top-5 chunks and dumped them verbatim into the prompt. For long documents, this burned most of the context window on marginally relevant content. The model would sometimes ignore the most relevant chunk in favour of text that appeared earlier in the prompt.

Solution: map-reduce retrieval for long contexts. Summarise each retrieved chunk independently, then synthesise the summaries. Slower, but produces dramatically better answers for complex multi-step questions.

Failure mode 3: Hallucination on schema

For our SQL generation use case, the model would occasionally invent column names that didn't exist in the schema. The fix: a validation layer that runs EXPLAIN on the generated SQL against the sandboxed database before returning the result. Failed queries trigger a retry loop with the error message injected into the prompt.

Failure mode 4: Latency at scale

With 14,000 queries per day, p99 latency was 4.2 seconds — acceptable for a chatbot, unacceptable for an analytics tool. The biggest wins: semantic caching with a 0.95 cosine similarity threshold (returning cached results for near-identical queries, which accounted for 31% of traffic), and streaming responses for queries that couldn't be cached.

What we'd tell ourselves six months ago

Build the evaluation harness first. Every RAG system needs a golden set of question/expected-answer pairs that you run against every prompt change. Without it, you're flying blind on quality regressions. We wasted three weeks of iteration time because we trusted vibes over metrics.

Failure mode 1: The retrieval quality cliff

Failure mode 2: Context window stuffing

Failure mode 3: Hallucination on schema

Failure mode 4: Latency at scale

What We Learned Shipping RAG to Production at Scale

Failure mode 1: The retrieval quality cliff

Failure mode 2: Context window stuffing

Failure mode 3: Hallucination on schema

Failure mode 4: Latency at scale

What we'd tell ourselves six months ago

The Practical Case for JSDoc Over TypeScript in 2025

Zero-Downtime Deployments on AWS: A Practical Playbook

Designing for Dark Mode from Day One

Stay ahead of the build

What We Learned Shipping RAG to Production at Scale

Failure mode 1: The retrieval quality cliff

Failure mode 2: Context window stuffing

Failure mode 3: Hallucination on schema

Failure mode 4: Latency at scale

What we'd tell ourselves six months ago

The Practical Case for JSDoc Over TypeScript in 2025

Zero-Downtime Deployments on AWS: A Practical Playbook

Designing for Dark Mode from Day One

Stay ahead of the build