Agentic RAG Patterns That Beat Classic Retrieval
Classic retrieval-augmented generation solves the most common question shape, "here is my question, fetch relevant docs, answer using them." It hits a ceiling on questions that require multiple lookups, query refinement, or reasoning about whether the retrieval worked. Agentic RAG. Retrieval as a tool inside an agent loop. Routinely outperforms at exactly those questions.
What Classic RAG Does and Where It Ceilings
Classic RAG is linear: user query → embedding → top-k similarity retrieval → context window → generation. It is simple, fast, and sufficient for question shapes that match a single lookup. It fails predictably on:
- Multi-hop questions. "Which customers bought X and also churned in Q3?". Needs two retrievals with an intermediate reasoning step.
- Queries where the user's wording diverges from the corpus wording. Vector similarity is only as good as the embedding's training; domain-specific vocabulary often confuses it.
- Ambiguous questions. "What did we ship last sprint?" has no single good retrieval.
- Questions that need date or entity filtering. Pure similarity retrieval does not respect structured constraints.
Pattern 1. Retrieval as a Tool
Instead of pre-fetching, let the model decide when and what to retrieve. Expose search as a tool the agent can call with its own synthesised query. The model learns to reshape the question into a search query, examine results, and retrieve again if needed.
const tools = [{
name: 'search_docs',
description: 'Search the internal documentation. Returns the top 5 chunks with source.',
input_schema: {
type: 'object',
properties: {
query: { type: 'string' },
since: { type: 'string', format: 'date', description: 'Optional date filter' },
tags: { type: 'array', items: { type: 'string' } },
},
required: ['query'],
},
}];
// The agent now chooses when to search and with what query.
// It may search, see the results are weak, and search again with a refined query.
Pattern 2. Query Decomposition
For multi-hop questions, have the agent decompose the question into sub-questions, retrieve for each, and compose the final answer. The decomposition can be explicit (first call a decompose tool) or emergent (the agent reasons about sub-questions inline and makes multiple retrieval calls).
Pattern 3. HyDE (Hypothetical Document Embeddings)
When the user's query language differs sharply from the corpus language, generate a hypothetical answer first and embed that for retrieval. The hypothetical is usually wrong in detail, but shares vocabulary with real answers. Which makes embedding-based search much sharper.
Pattern 4. In-Loop Reranking
Agentic retrieval can run a reranker inside the loop. Fetch twenty candidates, let a cross-encoder score them, feed the top five to the model. Classic RAG can do this too, but the agent-in-the-loop version lets the model request a fresh rerank with different criteria if the initial top-five do not answer the question.
Pattern 5. Self-Correction
After generating an answer, let the agent check the answer against the retrieved context (a short LLM call: "is this answer fully supported by these sources?"). If not, it can retrieve more or flag uncertainty. This turns "confident wrong answers" into "honest partial answers". A significant UX improvement for many B2B search interfaces.
// Self-correction check — called by the agent after drafting an answer
async function supportsAnswer(answer: string, sources: Chunk[]): Promise<'yes' | 'partial' | 'no'> {
const res = await llm.run({
system: 'Reply with only: yes, partial, or no.',
messages: [{
role: 'user',
content: `Does the following answer stay within the provided sources?\n\nAnswer:\n${answer}\n\nSources:\n${sources.map(c => c.text).join('\n---\n')}`,
}],
});
return res.text.trim().toLowerCase() as any;
}
When Classic RAG Still Wins
- Latency-sensitive product features. Agent loops add latency; classic RAG returns in a single round-trip.
- Cost-constrained workloads. Agents make multiple LLM calls per query; classic RAG makes one.
- Simple FAQ-style questions. The upside of the agentic approach is small when the question is linear.
The Production Pattern
In practice, the strongest production systems route queries: simple ones go through classic RAG, complex ones enter the agent loop. A tiny classifier model at the front decides the route. The result is low latency for the 80% of easy queries and high accuracy for the 20% that need real search intelligence. Without paying the agent cost for every query.