Breaking an AI Agent Pipeline into 8 Stages: From Intent Understanding to Self-Correction

AICLUDE Engineering3

Why "one LLM call" agents break

A single LLM call with a tool list in the system prompt goes a long way in a demo. Real traffic is a different story:

  • The user's intent is ambiguous → the wrong tool gets invoked.
  • A prompt-injection payload slips in → the system rules get bypassed.
  • A tool call fails → the broken result is handed straight back to the user.
  • The answer is factually wrong → the model does not catch it.

AICLUDE's core pipeline solves this by decomposing execution into 8 explicit stages. Each stage owns a specific failure mode, and each can be replaced, measured, and re-planned independently.

The 8-stage pipeline

Stage 1 — Input Processing

Language detection, binary sanitization, length limits. The job is to hand the next stage a clean input. If length limits trip here, no LLM call happens at all.

Stage 2 — Understanding

We turn the user message into an ExecutionPlan, not just a "next tool to call". The plan carries a strategy (single / serial / parallel / dag), a task array, and a merge instruction — all in one structured object.

For example, when a user asks 'Pick three priorities from today's schedule and inbox,' the planner fans out calendar and inbox lookups in parallel, then merges both results into a single ranked summary.

Stage 3 — Pre-flight check

Rule-based, no LLM. Catches destructive tool calls, ambiguous intents, missing RAG context before any expensive inference happens. The point is to stop bad executions without paying for another round trip.

Stage 4 — Execution

The plan runs. For DAGs we walk the dependency graph and execute each wave in parallel, cascading skip on failed dependencies. Inside each task a ReAct loop (LLM → tool call → observation → LLM, capped at 10 iterations) drives tool use. On failure, Adaptive Replan feeds the failed cause and partial successes back to the LLM to generate a new strategy — the strategy itself can flip from parallel to serial to dag.

Stage 5 — Fast Gate

A regex-based real-time filter that runs before streaming. Harmful content, PII leakage, XSS patterns — caught in milliseconds. Deterministic and fast, so it is the right first line of defence.

Stage 6 — Quick Verification (Reflection Loop)

A lightweight verifier LLM reviews the generated response. When validation fails, the verifier returns a short corrective hint that we inject into the system message, then regenerate on the spot with more conservative settings. This is cheaper than "get it right in one shot" because only failures pay the regeneration cost.

Stage 7 — Post-flight check + auto-patch

Deterministic post-checks: are artifact URLs alive, is the output empty, are required fields present? Anything auto-fixable triggers a targeted patch — the LLM rewrites just the broken region instead of regenerating the whole response, so latency stays predictable.

Stage 8 — Deep Verify + Cross-turn Correction

After the response has been streamed to the user, a fire-and-forget deep verification scores the turn across persona fit, factual accuracy, tool grounding, and safety. The corrective hint produced here is auto-injected into the next turn's system message, so the model self-corrects across turns — no extra call needed.

Why this is cheaper than "one shot perfection"

A common reaction: "won't running multiple verifiers cost more?" In practice it's the other way around.

  • Only failures pay — Quick Verify lets most responses through on the first try. Regeneration cost only hits the minority that fails.
  • Targeted patching replaces full regeneration — patching a broken artifact uses 10–20% of the tokens a full regeneration would.
  • Deep Verify is async — it builds a quality dataset without touching user-visible latency.
  • Cross-turn Correction is free — it's just a system-message injection on the next turn, no extra LLM call.

Stacking more rules into a single prompt to "get it right in one shot" keeps the average token cost high. Detecting failures fast and patching them cheaply is better on both latency and cost.

The real value: replacing and observing stages independently

The most practical win of the 8-stage split is that every stage is independently replaceable. Understanding can use a strong reasoning model, Fast Gate can be a pure rule engine, Deep Verify can be a cheap model — each stage picks its own optimum. Per-stage DB metrics make it obvious where failures happen, which turns agent improvement from guesswork into targeted work.

This is why AICLUDE runs agents on top of this pipeline. The gap between "seems to work" and "actually useful in production" does not close without it.


Back to Blog