Agent Evals in Production: Tracing Tool Use and Trajectories
Traditional single-turn RAG evaluations fail in multi-agent systems. Discover how tracing agent trajectories and evaluating intermediate tool use prevents compounding errors and silent failures in production.
Your multi-agent system did not fail because the underlying language model lacked reasoning capability. It failed because a routing agent hallucinated a date parameter on step two, passed that invalid parameter to your internal CRM tool on step three, received an empty array in response, and confidently told the user their account did not exist on step four. If you are only evaluating the final output of your AI systems, you are entirely blind to the compounding errors happening inside the black box—errors that directly translate to customer churn, spiked support costs, and wasted engineering hours.
Across the industry, engineering teams are discovering that the evaluation frameworks built for basic Retrieval-Augmented Generation (RAG) are dangerously inadequate for agentic workflows. Single-turn RAG is straightforward: a user asks a question, the system retrieves text, and the model generates an answer. You can measure retrieval accuracy and answer relevance. Multi-agent systems, however, execute complex, cyclic workflows. They make decisions, call external APIs, evaluate their own tool outputs, and route tasks to other specialized agents.
When these systems move past the demo stage, they accumulate AI technical debt rapidly. What starts as a clever prototype turns into "AI spaghetti"—tangled prompt chains, unmonitored agents looping endlessly, and brittle tool integrations that break under real load. For enterprise buyers and SaaS founders, this operational instability represents a severe business risk: unpredictable API bills, degraded user experience, and compromised data integrity. Verel Systems takes AI from this spaghetti state to production. A core part of that transition is implementing rigorous multi-agent evaluation, which means shifting focus from the final answer to the entire trajectory of the agent's execution.
The Compounding Cost of Multi-Agent Errors
The fundamental mathematical reality of multi-agent systems is that error rates compound multiplicatively. This is the primary reason why so many enterprise AI projects stall in pilot purgatory, draining budgets without ever delivering a return on investment (ROI). A system that looks highly capable in isolated tests will reliably collapse when deployed into a multi-step business process.
Consider a simple agentic workflow designed to process a customer refund request. The system must classify the intent, retrieve the customer record via an API, verify the refund policy against a knowledge base, execute the refund via a payment gateway tool, and draft a confirmation email. That is five distinct steps.
If the language model driving the agent has a 95% success rate at executing any individual step correctly—a rate that feels excellent during a casual proof-of-concept—the overall probability of the system completing the entire task successfully is $0.95^5$, which equals 77.3%. If you add just two more steps for a manager approval loop and a database logging action, the success rate drops below 70%.
From a business operations perspective, this 30% failure rate is catastrophic. If your platform processes 1,000 refund requests a day, 300 of them will fail silently or execute incorrectly. If resolving a single escalated customer support ticket costs an average of $15 in manual engineering and support time, this single unstable workflow introduces $4,500 per day ($135,000 per month) in unnecessary operational overhead, completely wiping out the cost savings the AI was built to achieve.
</>View technical implementation · عرض التفاصيل التقنية
[Step 1: Classify] (95%)
│
[Step 2: Retrieve CRM] (95%) ──> Cumulative Success: 90.2%
│
[Step 3: Verify Policy] (95%) ──> Cumulative Success: 85.7%
│
[Step 4: Execute Refund] (95%) ──> Cumulative Success: 81.4%
│
[Step 5: Email Confirm] (95%) ──> Total Workflow Success: 77.3% (22.7% Failure Rate)
This compounding degradation is invisible if you only measure the final output. When the user receives an email stating "I cannot process your refund at this time," traditional evaluation metrics might grade that response as "polite" and "grammatically correct." The final output is technically fine. The failure occurred deep in the trajectory—perhaps the agent passed a string instead of an integer to the payment gateway tool, received a 400 Bad Request error, and defaulted to a generic failure message.
Without trajectory evaluation, engineering teams spend hours manually reading through server logs trying to reconstruct the state of the agent at the exact moment it failed. This manual debugging does not scale beyond a handful of concurrent users. To run multi-agent systems in production without exploding your engineering payroll, you must instrument the system to automatically track, trace, and evaluate the specific path the agent took to arrive at its conclusion.
Multi-agent evaluation requires measuring the intermediate state. You must evaluate whether the agent selected the correct tool from its available options, whether it formatted the JSON payload correctly according to the tool's schema, and whether it accurately interpreted the tool's response before proceeding to the next node in the graph.
Why Traditional Evals Fail on Agent Trajectories
For business leaders, relying on traditional RAG metrics means buying a false sense of security. You might see 95% accuracy on paper while your actual customer satisfaction (CSAT) plummets because silent failures are occurring where your metrics aren't looking. In 2024 and 2025, the industry standard for AI evaluation relied heavily on frameworks like RAGAS, which calculate metrics like context precision, context recall, and answer faithfulness. These metrics are strictly designed for static, linear question-answering pipelines. They assume a linear flow of data: query in, context retrieved, answer out.
Agentic systems are not linear; they are cyclic, stateful, and highly dynamic. An agent might attempt to use a search tool, realize the search results are insufficient, modify its search query, try again, and then decide to use a different tool entirely. This sequence of decisions, tool calls, and internal reasoning steps is called a trajectory.
Traditional evaluation fails on trajectories for three specific reasons:
First, string-matching and regex-based assertions are too brittle to evaluate reasoning. If you write a test that expects the agent to output a specific thought process before calling a tool, any slight variation in the model's phrasing will cause the test to fail, even if the underlying logic is perfectly sound. This leads to a high volume of false negatives, where developers spend expensive engineering sprints investigating "failed" runs that actually executed correctly.
Second, traditional evaluations cannot catch silent tool-use errors. A silent error occurs when a tool executes successfully from a software perspective (returning an HTTP 200 OK status) but fails from a business logic perspective. For example, an agent might query a database for "contracts signed in Q3" but mistakenly format the date range for Q2. The database successfully returns the Q2 contracts. The agent, lacking the context to realize its mistake, proceeds to summarize the wrong documents. The final output looks highly authoritative, but the data is entirely incorrect. Only by evaluating the exact parameters passed to the tool during that specific step can you catch this failure before it reaches a client.
Third, unconstrained agents suffer from elevated error rates in tool selection. When agents are given a list of several tools and asked to choose the correct one using free-form reasoning, selection accuracy reliably degrades. Traditional end-of-chain evaluation does not tell you why the agent chose the wrong tool; it only tells you that the final answer was wrong. To fix the system, you need to know if the agent misunderstood the user's intent, misunderstood the tool's description, or simply hallucinated a tool that does not exist.
→ The Cost of 'Vibes-Based' AI: How to Measure and Guarantee LLM Accuracy in ProductionTracing and LLM-as-a-Judge: What Good Looks Like
To move from AI spaghetti to production-grade infrastructure, you must implement comprehensive tracing and intermediate evaluation. This involves two distinct technical practices: capturing the trajectory via observability tools, and evaluating that trajectory using LLM-as-a-judge frameworks.
Implementing this architecture is the difference between blindly guessing why your AI system is leaking enterprise API budget and having a clear, auditable trail of every dollar spent on model calls. It turns your AI team from reactive fire-fighters into proactive system optimizers.
Tracing involves instrumenting your agent orchestrator (such as LangGraph) to log every state transition, LLM call, and tool execution. Using platforms like Langfuse or Weave, you capture the exact inputs and outputs at every node in the graph. If an agent loops three times, the trace shows three distinct tool calls, the latency of each call, the token consumption, and the exact JSON payload returned by the external API.
Once you have the trace, you can apply multi-agent evaluation. Because human review is too slow and string-matching is too brittle, the production standard is LLM-as-a-judge. This involves using a fast, highly constrained language model to grade the intermediate steps of the primary agent.
Instead of asking a judge model "Is this final answer correct?", you configure targeted judge prompts for specific failure modes within the trajectory:
- ▸Tool Selection Accuracy: "Given the user query X and the available tools Y, did the agent select the optimal tool? Answer YES or NO and provide a one-sentence justification."
- ▸Parameter Adherence: "Given the tool schema X, did the agent provide a valid JSON payload that strictly adheres to the required types? Check for missing required fields."
- ▸Loop Termination: "Did the agent successfully recognize that it had gathered enough information to answer the user, or did it execute a redundant tool call?"
By running these lightweight judge models asynchronously over a sample of your production traces, you generate a quantitative dashboard of your agent's internal health. You will immediately see if your agent is struggling with a specific tool schema or if a particular prompt is causing unnecessary recursive loops.
The Cost of Production Multi-Agent Evaluation
Business leaders often hesitate to implement LLM-as-a-judge because evaluating AI with more AI sounds expensive. However, when architected correctly, the cost of evaluation is a fraction of a cent per run and serves as essential insurance against costly business logic errors and compounding API costs.
You do not need to use the most expensive frontier models to evaluate intermediate steps. Fast, specialized reasoning models or even fine-tuned smaller models are entirely capable of grading JSON payloads and tool selection accuracy. Furthermore, in a production environment, you do not evaluate every single run. You evaluate a statistically significant sample (e.g., 5% to 10% of all trajectories) to monitor system health, detect regressions, and catch infinite loops before they run up thousands of dollars in model usage.
| Evaluation Method | Implementation Complexity | Cost Formula (Illustrative Baseline) | Business Value |
|---|---|---|---|
| Manual Log Review | Low | Engineering hourly rate × hours spent debugging | Very low. Unscalable, reactive, and prone to missing silent errors. |
| End-to-End RAG Metrics | Medium | queries × 1k tokens × $0.15/1M tokens | Low for agents. Only measures final output; ignores tool use and looping costs. |
| Full Trajectory Tracing | High | Trace storage costs (e.g., Langfuse tier) | High. Provides complete visibility into agent state, latency, and token burn. |
| Sampled LLM-as-a-Judge | High | (queries × 10% sample) × steps × 500 tokens × $0.15/1M tokens | Maximum. Catches intermediate failures, schema mismatches, and routing errors automatically. |
To put the math into perspective: If your system processes 10,000 multi-step workflows per day, and you sample 10% of those (1,000 runs) for intermediate evaluation, and each run averages 4 steps requiring 500 tokens of context for the judge model, the daily token volume for evaluation is 2,000,000 tokens. Using a fast, cost-effective model priced around $0.15 per million input tokens, your daily evaluation cost is approximately $0.30.
The alternative is allowing an unmonitored agent to enter an infinite loop, burning through dollars of token costs per minute while failing to serve your customer. Production engineering is about spending pennies on evaluation to protect dollars in execution.
→ LangGraph Development: 5 Patterns for Production-Safe AgentsImplementing Guardrails: Semantic Routing over Free-form Choice
Once your multi-agent evaluation framework highlights the failure points in your trajectories, the next step is fixing them. The most common insight teams gain from tracing is that giving an LLM complete autonomy over tool selection is a recipe for instability and uncontrolled operational costs.
When an agent has access to twenty different tools and is instructed to "use your best judgment," the error rate skyrockets. The model will inevitably confuse tools with similar names, hallucinate parameters, or string together inefficient sequences of actions. This unconstrained routing is a primary source of unpredictable agent behavior.
Verel Systems solves this by moving away from free-form agent loops and toward highly structured, stateful graphs using frameworks like LangGraph. Instead of relying on the LLM to guess the next step, we use semantic routing to constrain the agent's path.
By restricting the agent's pathways, you don't just improve accuracy; you protect your API budget. A constrained agent solves problems in 2 steps instead of wandering through 10 steps, directly cutting your operational LLM costs by 60% to 80%.
Semantic routing involves analyzing the user's intent upfront and mapping it to a specific, deterministic workflow. If the user asks about a refund, the system routes them to the Refund Sub-Graph. Inside that sub-graph, the agent only has access to the two tools strictly necessary for refunds. It cannot accidentally call the marketing database or the HR directory.
By constraining the state space, you drastically reduce the probability of the agent making a wrong turn. The trajectory becomes predictable. When you apply LLM-as-a-judge to a constrained graph, your evaluation scores will immediately improve because the agent is no longer wasting tokens trying to navigate a massive, unstructured toolset.
Building production AI is not about giving models maximum freedom; it is about giving them maximum context within strict, verifiable boundaries. Tracing and evaluating trajectories is how you verify those boundaries are holding.
→ Beyond Vibe Checks: CI/CD Pipeline Architecture for Multi-Agent SystemsFrequently Asked Questions
Q: What is the business ROI of implementing a trajectory evaluation framework instead of just relying on manual QA? The ROI is realized in two main areas: engineering efficiency and customer retention. Manual QA of agentic workflows is virtually impossible at scale because of the sheer number of path permutations. By automating trajectory tracing, you reduce the time engineering teams spend debugging silent failures from days to minutes. More importantly, it prevents silent failures from reaching production, protecting your brand reputation and preventing customer churn caused by hallucinated or incorrect agent behavior.
Q: How much latency does trajectory tracing and evaluation add to the user experience? Tracing adds effectively zero latency to the user experience because the telemetry data is sent to the observability platform (like Langfuse) asynchronously in the background. LLM-as-a-judge evaluations are also run asynchronously after the execution is complete. The user receives their answer immediately; the evaluation happens offline to monitor system health.
Q: Can we use the same LLM that powers the agent to act as the judge? You can, but it is generally better practice to use a different model family for evaluation to avoid inherent model bias. A model will often blindly agree with its own prior reasoning. Using a separate, highly constrained model specifically prompted for grading provides a more objective evaluation of the trajectory.
Q: What do we do when the LLM judge flags an intermediate step as a failure, but the final answer was correct? This is a critical signal. It usually indicates that your agent is self-correcting inefficiently. For example, it might be calling a tool with bad parameters, receiving an error, and then trying again. While the final answer is correct, the agent is wasting time, latency, and token costs. You use this evaluation data to refine your tool descriptions or adjust your system prompts so the agent gets it right on the first attempt.
Q: Why can't we just write standard unit tests for our agent tools? You absolutely should write standard software unit tests for the underlying APIs and Python functions your tools execute. However, software unit tests only verify that the tool works when given the correct inputs. They do not verify that the agent understands when to use the tool, or that the agent is capable of generating those correct inputs dynamically based on a conversational context. Multi-agent evaluation tests the AI's decision-making layer, not just the software execution layer.
