Skip to main content
stackloader
  • Home
  • Services
  • Portfolio
  • Technologies
  • About
Get started
stackloader
  • Home
  • Services
  • Portfolio
  • Technologies
  • About
Get started
stackloader

AI-Driven Code, Human-Centric Impact.

Product

  • Features
  • Integrations

Company

  • About
  • Blog
  • Careers
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 stackloader, Inc. All rights reserved.

Built with precision.

We use cookies

To improve your experience. Cookie policy

  1. stackloader
  2. Blog
  3. What We Learned Shipping RAG to Production at Scale

AI & ML

What We Learned Shipping RAG to Production at Scale

Retrieval-augmented generation looks straightforward in a Jupyter notebook. At 14,000 queries per day, the failure modes are different — and the solutions are less obvious than the tutorials suggest.

Arjun Nair

Arjun Nair

Head of AI

December 12, 202411 min read

Retrieval-Augmented Generation (RAG) is one of those techniques that feels almost too simple when you first prototype it. Chunk your documents, embed them, stuff the top-k results into a prompt, call GPT-4o. The demo works. The stakeholders are impressed.

Then you put it in production.

Failure mode 1: The retrieval quality cliff

Top-k cosine similarity retrieval degrades badly when the query is ambiguous or the user doesn't phrase it the way the documents are written. A question about "how to cancel my subscription" won't retrieve a document titled "Account Management > Termination Process" unless your embedding model understands the semantic equivalence.

The fix that worked for us: hybrid search — combining dense vector retrieval (pgvector or Pinecone) with sparse BM25 retrieval, and using a cross-encoder reranker on the combined results. This improved our query success rate from 87% to 96% on our DataPulse deployment.

Failure mode 2: Context window stuffing

Early implementations grabbed the top-5 chunks and dumped them verbatim into the prompt. For long documents, this burned most of the context window on marginally relevant content. The model would sometimes ignore the most relevant chunk in favour of text that appeared earlier in the prompt.

Solution: map-reduce retrieval for long contexts. Summarise each retrieved chunk independently, then synthesise the summaries. Slower, but produces dramatically better answers for complex multi-step questions.

Failure mode 3: Hallucination on schema

For our SQL generation use case, the model would occasionally invent column names that didn't exist in the schema. The fix: a validation layer that runs EXPLAIN on the generated SQL against the sandboxed database before returning the result. Failed queries trigger a retry loop with the error message injected into the prompt.

Failure mode 4: Latency at scale

With 14,000 queries per day, p99 latency was 4.2 seconds — acceptable for a chatbot, unacceptable for an analytics tool. The biggest wins: semantic caching with a 0.95 cosine similarity threshold (returning cached results for near-identical queries, which accounted for 31% of traffic), and streaming responses for queries that couldn't be cached.

What we'd tell ourselves six months ago

Build the evaluation harness first. Every RAG system needs a golden set of question/expected-answer pairs that you run against every prompt change. Without it, you're flying blind on quality regressions. We wasted three weeks of iteration time because we trusted vibes over metrics.

Share
Arjun Nair

Written by

Arjun Nair

Head of AI

Arjun leads AI engineering at stackloader, specialising in retrieval-augmented generation, LLM fine-tuning, and production ML systems. He was previously a research engineer at a major ML lab and has shipped AI features used by hundreds of thousands of daily active users.

In this article

  1. Failure mode 1: The retrieval quality cliff
  2. Failure mode 2: Context window stuffing
  3. Failure mode 3: Hallucination on schema
  4. Failure mode 4: Latency at scale
  5. What we'd tell ourselves six months ago

More from the blog

The Practical Case for JSDoc Over TypeScript in 2025

Engineering

The Practical Case for JSDoc Over TypeScript in 2025

Jan 14, 2025·7 min
Zero-Downtime Deployments on AWS: A Practical Playbook

DevOps

Zero-Downtime Deployments on AWS: A Practical Playbook

Nov 28, 2024·9 min
Designing for Dark Mode from Day One

Design

Designing for Dark Mode from Day One

Sep 5, 2024·6 min

Newsletter

Stay ahead of the build

New articles on AI, DevOps, and engineering craft. Roughly twice a month. No noise, no promotions — just the good stuff.