Evaluating Multi-Agent Systems: Catching Tool-Use Hallucinations in Production
When AI agents use external tools, hallucinations stop being just bad text and become corrupted databases and spiked API bills. Here is how to evaluate and trace multi-agent trajectories before they fail.
An AI model hallucinating a historical fact during a chat session is an embarrassment. An AI agent hallucinating a required parameter in a CRM API call, failing, and retrying the exact same malformed request dozens of times in a loop is an operational failure. It silently burns through your daily rate limits, spikes your inference bill, and degrades system performance.
As multi-agent architectures move from experimental sandboxes to live business environments in 2026, the definition of a "hallucination" has fundamentally changed. Single-prompt systems generate text; agents execute actions. They query databases, draft emails, update records, and trigger downstream workflows. When an agent invents a tool that does not exist, or confidently passes the wrong JSON arguments to a tool that does, the blast radius extends far beyond a poor user experience.
Across the industry, most enterprise AI projects stall in pilot purgatory precisely because of this risk. Companies accumulate AI technical debt—tangled prompt chains, unmonitored agents, and brittle architectures that work perfectly in a controlled demo but collapse under edge cases. This stalled state represents a massive sunk cost in engineering salaries and delayed digital transformation timelines. Verel takes AI from this spaghetti state to production. To build production-grade multi-agent systems, you must abandon the idea of evaluating just the final text output and start evaluating the entire decision trajectory.
The Financial and Operational Cost of Silent Failures
To understand why multi-agent system evaluation requires a distinct engineering approach, you have to look at how these systems actually fail in production. They rarely fail by announcing an error to the user. Instead, they fail silently, expensively, and recursively.
Consider a multi-agent system designed to qualify inbound sales leads. The orchestration graph includes a research agent that searches the web for company background, a data agent that queries the internal CRM via an API, and a synthesis agent that drafts the final brief.
If the user asks about a company the system cannot find, a poorly evaluated agent will not simply say "I don't know." The research agent might hallucinate a slightly different company name to force a search result. The data agent, receiving this hallucinated name, queries the CRM. When the CRM API returns a 404 Not Found, the agent's internal logic often prompts it to "try again."
Without hard system constraints, the agent enters a deep recursion loop until it hits a framework timeout. It alters the query slightly, hits the API again, fails, and repeats.
The cost of this failure is highly quantifiable:
- ▸Inference Overhead: If your agent runs on a frontier model family, each loop iteration requires the model to process the entire history of its previous failed attempts. If the context window averages 10,000 input tokens per retry, and the loop runs 50 times before hitting a hard timeout, a single user query consumes 500,000 tokens. At an illustrative cost of $3.00 per million input tokens, that single failed query costs $1.50 in pure inference. Multiply that by 100 concurrent sessions experiencing similar edge cases, and you waste $150 in an hour.
- ▸Engineering Downtime: Without proper tracing, engineering teams waste an average of 14 hours of highly paid developer time per incident trying to replicate and diagnose a single untraced agent loop.
- ▸Customer Churn: For a SaaS platform, a 2% API error rate caused by recursive loops can trigger a 15% spike in customer churn within 30 days due to degraded platform performance and rate-limiting issues.
Evaluating multi-agent systems is about catching these trajectory failures before they reach production, and tracing them instantly when they occur in the wild.
The Trajectory Over The Output: In multi-agent systems, a correct final answer is not proof of a working system. An agent might stumble into the right answer after 15 unnecessary API calls. Evaluation must measure the efficiency and accuracy of the path taken, not just the destination.
Why Standard LLM Metrics Fail for Agents
If you attempt to evaluate a multi-agent system using the tools built for chatbots, you will be blind to your system's actual performance. Traditional LLM evaluation relies on reference-based metrics or simple LLM-as-a-judge scoring of the final output.
For business leaders, relying on these outdated metrics introduces severe compliance and operational risks. If you evaluate an agent solely on the politeness or formatting of its final response, you risk shipping a system that passes tests but silently corrupts backend databases, risking multi-million dollar contract renewals and regulatory penalties.
Metrics designed to compare text similarity are insufficient when evaluating an agent's decision-making process. You do not care if the agent's final summary has a high semantic overlap with a human's summary; you care whether the agent actually queried the live database or if it just guessed the numbers based on its pre-training weights.
Agent evaluation requires measuring "faithfulness" and "tool-selection accuracy."
Faithfulness in an agent context means: Did the final output rely exclusively on the data retrieved by the tools, or did the model inject outside information? Tool-selection accuracy asks: Given the user's prompt and the available tools, did the agent select the optimal tool with the correct parameters on the first try?
To measure this, engineering teams adapt frameworks like RAGAS (Retrieval Augmented Generation Assessment). While originally designed for search pipelines, the mathematical principles apply directly to agents. By logging the exact JSON payload the agent attempts to pass to a tool, you can write deterministic tests that check if the parameters match the required schema. You can then use a secondary, cheaper model to review the agent's trajectory and score its tool selection from 0 to 1 based on necessity and correctness.
Trajectory Evaluation and Tool-Use Tracing
You cannot evaluate what you cannot see. The foundation of multi-agent evaluation is comprehensive, asynchronous tracing. Every time an agent thinks, decides, or acts, that event must be logged as a distinct span within a larger trace.
From a strategic standpoint, observability is not an engineering luxury; it is your primary risk-mitigation tool. Implementing robust tracing reduces your Mean Time to Resolution (MTTR) from hours to seconds, protecting your SLA commitments in demanding enterprise markets like the US and the Gulf region.
When we rebuild failed AI pilots into production systems at Verel, the first thing we rip out are the raw print statements and basic database logs. We replace them with dedicated AI observability platforms like Langfuse or Weave to trace tool use and trajectories in real-time.
A production trace captures the exact state of the system at every node:
- ▸The raw user input.
- ▸The prompt injected into the routing agent.
- ▸The routing agent's decision (e.g., calling the
search_crmtool). - ▸The exact JSON arguments generated for the tool.
- ▸The execution time of the external API call.
- ▸The raw response from the API.
- ▸The agent's interpretation of that response.
Business leaders often worry that capturing this level of telemetry will slow down the application. This is a misconception rooted in older synchronous logging practices. Modern tracing libraries operate asynchronously. The telemetry data is sent in the background, adding negligible latency to the user's request-response cycle.
With this data captured, you can run nightly evaluations. You do not need to evaluate every single trace manually. Instead, you filter the traces. You pull every trace where an API call returned a 400-level error, or every trace where the execution time exceeded 10 seconds. You then run an LLM-as-a-judge over just those failed trajectories to categorize the root cause: was it a bad user input, a flaky external API, or a genuine tool-use hallucination by the agent?
This turns evaluation from a vague "vibe check" into a strict engineering discipline. When we partner with enterprise teams, we build these exact guardrails and observability pipelines directly into their architecture to secure their AI investments.
Implementing Hard Breakpoints and Human-in-the-Loop
Evaluation tells you that a system is failing; architecture prevents the failure from causing damage.
The most critical architectural pattern for multi-agent systems is stateful orchestration. Frameworks like LangGraph allow engineers to define agents not as black boxes of text generation, but as nodes in a cyclical graph with explicit edges and state management.
This stateful approach allows for hard breakpoints. If evaluation data shows that an agent frequently hallucinates parameters when dealing with a specific edge case, you do not just prompt the agent to "be more careful." You implement a conditional edge in the graph.
If the agent attempts to call a destructive tool (like delete_record or send_client_email), the graph pauses execution. The current state is saved to a database, and the system waits for human approval. This is human-in-the-loop (HITL) not as an interface feature, but as a risk management primitive.
For enterprise buyers, integrating human-in-the-loop for high-risk actions reduces the risk of liability, compliance breaches, and customer-facing errors to absolute zero. It protects your brand reputation while still capturing the efficiency gains of automation.
Furthermore, stateful graphs allow you to hardcode recursion limits. You can dictate that an agent may only attempt to fix a broken tool call three times. On the fourth failure, the graph forces a graceful exit, returning a standardized error message to the user and flagging the trace for engineering review. This strictly limits the risk of deep recursion loops and the associated API cost overruns, transforming an expensive operational failure into a routine, logged exception.
Production Multi-Agent Evaluation Framework
Choosing how to evaluate your agents depends entirely on the risk profile of the actions they take. We structure evaluation across four distinct layers, balancing the cost of the evaluation against the cost of a system failure.
| Evaluation Method | Implementation Cost | Runtime Latency Impact | Primary Business Value |
|---|---|---|---|
| Deterministic Unit Tests | Low (Engineering time) | None (Runs in CI/CD pipeline) | Prevents regressions in basic tool selection and JSON formatting before deployment. |
| Asynchronous Tracing | Low (SaaS platform costs) | Negligible (async) | Provides complete visibility into production failures; required for auditing. |
| LLM-as-a-Judge (Nightly) | Medium (Inference costs)* | None (Runs offline) | Automatically scores trajectories and categorizes silent failures without human labor. |
| Human-in-the-Loop | High (Operational labor) | Pauses execution indefinitely | Absolute guarantee against destructive actions; zero risk of hallucinated writes. |
*The inference cost of nightly evaluation is highly predictable. Running an LLM-as-a-judge over 1,000 complex traces using a frontier model (consuming roughly 5,000 input tokens per trace) requires 5 million tokens. At an illustrative $3.00 per million, the nightly cost is $15.00. This $450 monthly premium is negligible compared to the cost of a rogue agent corrupting production data.
Frequently Asked Questions
What is the expected ROI of setting up an evaluation and tracing framework compared to the cost of implementation? The initial setup cost of a tracing and evaluation framework (typically $10,000 to $20,000 in engineering hours or partner fees) is usually recovered within the first quarter of deployment. This payback is achieved by preventing runaway API loops (which can cost thousands of dollars in a single weekend), reducing developer debugging time by up to 80%, and avoiding customer churn caused by silent system failures.
How do we measure if an agent is actually saving time if it sometimes requires human intervention? You measure the "autonomous completion rate" alongside the "escalation rate." If an agent handles an illustrative 80% of inquiries start-to-finish without human intervention, and routes the remaining 20% to a human with a fully prepared context brief, the net time saved is still massive. The goal of human-in-the-loop is to handle the high-risk edge cases safely, not to eliminate automation. You track the total time spent by humans on the escalated tasks versus the time they previously spent handling 100% of the volume.
Can we use smaller, cheaper models to evaluate the main agent's tool use? Yes, and this is a standard practice for managing evaluation costs. While the primary agent might require a massive frontier model to reason through complex tool selection, the evaluator model operates with the benefit of hindsight. It is much easier to verify if a tool call was correct than it is to generate the correct tool call from scratch. A finely tuned model in the 8B parameter class can reliably score tool-use trajectories for a fraction of the cost of running a frontier model as a judge.
What is the difference between tracing and logging for AI agents? Logging records isolated events: "Error 404 at 10:02 AM." Tracing records the entire lifecycle of a request as a connected graph of events. A trace shows the user's prompt, the agent's reasoning, the specific tool called, the API response, and the agent's next action, all linked together by a single trace ID. When an agent fails, a log tells you that it broke; a trace tells you exactly what the agent was thinking when it broke.
How do we prevent an agent from taking destructive actions in our database?
You separate read privileges from write privileges at the infrastructure level. The agent is given tools that can only execute SELECT queries or read APIs. If a workflow requires a database write or a destructive action, the agent is given a tool that merely drafts the proposed change and stages it. A separate orchestration layer then requires either a deterministic validation check or a human approval before the staged action is committed to the live database.
