The Road to Agentic RAG: Why Classic RAG Stops Being Enough
Retrieval looks fine, answers still drift
Your first RAG prototype usually works. Embed the corpus, push it into a vector index, pull the top-K matches, stuff into the prompt. Good enough. Then users start asking questions like:
- "Tell me our full refund policy." — The policy is spread across multiple docs; top-K only covers a slice.
- "The Premium plan is unlimited, so there's no rate limit, right?" — The chunk that matches the query phrasing doesn't mention rate limits.
- "What's the difference between Standard and Premium?" — Both plan descriptions rarely land in the same top-K.
- "How do I fix a 500 error?" — No single chunk tells the full symptom → cause → fix chain.
This is the wall Classic RAG hits. A single-hop retrieval grabs "the chunk most similar to the question" and misses the surrounding facts that actually complete the answer.
Three generations of RAG
| Generation | Approach | Strength | Weakness |
|---|---|---|---|
| Classic RAG | Query → Embed → Vector DB → Top-K → LLM | Fast, simple | Single-hop, brittle on distributed facts |
| Graph RAG | Entity/relation graph → Connected context → LLM | Relational reasoning | Expensive to build, error propagation |
| Agentic RAG | Reasoning agent orchestrates sources + tools → Self-evaluation | Adaptive, self-correcting | Needs real verification infrastructure |
AICLUDE doesn't pick a generation — it layers all three inside one pipeline. Each generation's strength offsets the next one's weakness.
The mix we actually run
1. Proposition-level chunking
Instead of fixed-size splits, an LLM extracts atomic propositions using a 5W1H frame. "The Premium plan has unlimited API calls." "However, a 100 req/s rate limit applies." Each sentence becomes a self-contained fact. Ingestion is more expensive, but every downstream stage wins.
2. Two knowledge bases — SAM + RAG
| Type | Contents | Use |
|---|---|---|
| SAM | Past Q&A pairs | Reuse a prior accurate answer on near-duplicate queries |
| RAG | Document chunks | Ground a brand-new question in the corpus |
Both are queried in parallel. If SAM similarity clears a high threshold, a fast path returns the stored answer with zero LLM calls. Repeat questions are effectively free.
3. Chunk-link graph + N-hop expansion
Instead of building a traditional Knowledge Graph (entity-relation triples), AICLUDE auto-links chunks using semantic similarity + shared keywords.
When two chunks are similar enough to share context but not so similar that they're duplicates — and they also share a keyword — we automatically create a bidirectional "related" link between them.
- similarity > 0.9 → likely a duplicate, skip.
- similarity < 0.7 → likely unrelated, skip.
- keyword intersection filters out noise links.
At query time, a recursive CTE expands up to 3 hops, decaying the score by 0.7 per hop. You get the core benefit of Graph RAG inside a single PostgreSQL instance — no Neo4j, no separate graph infra.
4. Adaptive RAG — the system judges its own retrieval
We compute a coverageLevel (none / low / medium / high) over the retrieved context. If it's below high and the question looks realtime/factual/news-like, a web-search tool is invoked automatically to patch the gap. No extra LLM call to ask "is the context enough?" — it's a rule.
5. Three-stage Self-Evaluation
At response time a Quick Verification decides whether to regenerate immediately. Once the response has streamed, Deep Verify runs asynchronously to score it on multiple axes and persist the result. On the same user's next turn, Cross-turn Correction automatically folds that score back into the system prompt.
- Quick Verification: a lightweight verifier LLM reviews the response; on failure we inject a short corrective hint and regenerate immediately with more conservative settings.
- Deep Verify: fire-and-forget after streaming, scores persona fit, factual accuracy, tool grounding, and safety.
- Cross-turn Correction: that score writes back into the next turn's system prompt, so the model self-corrects across turns.
Where the answers actually diverge
"What's the difference between Standard and Premium?"
- Classic: only one plan lands in top-K → half an answer.
- AICLUDE: finds the Standard chunk, follows the shared-keyword link to the Premium chunk, and returns a side-by-side comparison.
"How do I fix a 500 error?"
- Classic: returns "500 means internal server error" — true, but useless.
- AICLUDE: hops "500 error" → "DB connection pool exhausted" → "set max_connections=100", answering with the full symptom → cause → fix chain.
A note on the marketing vocabulary
"Graph RAG" gets used in two very different ways:
- Entity-relation Knowledge Graph — extracts people/orgs/concepts and their relations into a formal graph.
- Chunk-link graph — auto-linked chunks via vector similarity + keywords.
AICLUDE runs the second. The first is expensive to build and prone to error propagation, so we haven't adopted it. In exchange, proposition chunking + chunk links + N-hop expansion gives us most of what the first approach tries to achieve in practice. Deep entity chains (e.g. "CEO's alma mater's founder") still benefit from a formal KG, and that layer is deliberately left open for future expansion.
Retrieval quality dictates answer quality, and verification loops guarantee it. AICLUDE's RAG is built to keep those two axes in the same product — not in two different add-ons.
Back to Blog