We use cookies
To improve your experience. Cookie policy
AI & ML
Retrieval-augmented generation looks straightforward in a Jupyter notebook. At 14,000 queries per day, the failure modes are different — and the solutions are less obvious than the tutorials suggest.
Arjun Nair
Head of AI
Retrieval-Augmented Generation (RAG) is one of those techniques that feels almost too simple when you first prototype it. Chunk your documents, embed them, stuff the top-k results into a prompt, call GPT-4o. The demo works. The stakeholders are impressed.
Then you put it in production.
Top-k cosine similarity retrieval degrades badly when the query is ambiguous or the user doesn't phrase it the way the documents are written. A question about "how to cancel my subscription" won't retrieve a document titled "Account Management > Termination Process" unless your embedding model understands the semantic equivalence.
The fix that worked for us: hybrid search — combining dense vector retrieval (pgvector or Pinecone) with sparse BM25 retrieval, and using a cross-encoder reranker on the combined results. This improved our query success rate from 87% to 96% on our DataPulse deployment.
Early implementations grabbed the top-5 chunks and dumped them verbatim into the prompt. For long documents, this burned most of the context window on marginally relevant content. The model would sometimes ignore the most relevant chunk in favour of text that appeared earlier in the prompt.
Solution: map-reduce retrieval for long contexts. Summarise each retrieved chunk independently, then synthesise the summaries. Slower, but produces dramatically better answers for complex multi-step questions.
For our SQL generation use case, the model would occasionally invent column names that didn't exist in the schema. The fix: a validation layer that runs EXPLAIN on the generated SQL against the sandboxed database before returning the result. Failed queries trigger a retry loop with the error message injected into the prompt.
With 14,000 queries per day, p99 latency was 4.2 seconds — acceptable for a chatbot, unacceptable for an analytics tool. The biggest wins: semantic caching with a 0.95 cosine similarity threshold (returning cached results for near-identical queries, which accounted for 31% of traffic), and streaming responses for queries that couldn't be cached.
Build the evaluation harness first. Every RAG system needs a golden set of question/expected-answer pairs that you run against every prompt change. Without it, you're flying blind on quality regressions. We wasted three weeks of iteration time because we trusted vibes over metrics.

Written by
Arjun Nair
Head of AI
Arjun leads AI engineering at stackloader, specialising in retrieval-augmented generation, LLM fine-tuning, and production ML systems. He was previously a research engineer at a major ML lab and has shipped AI features used by hundreds of thousands of daily active users.
More from the blog
Newsletter
New articles on AI, DevOps, and engineering craft. Roughly twice a month. No noise, no promotions — just the good stuff.