← Back to all posts

Score an agent’s tool-use trajectory, not the last message

🧪

Tool-using agents are control loops. Score the trajectory, not the final paragraph.

We keep shipping agents that look correct in the last message and behave incorrectly in the loop.

If we score only the final response, we measure narration. We do not measure whether the agent called the right tools, used the right scope, retried sanely, and stopped for the right reason.

Output-only evaluation assumes a single-shot function: input → output → score. That model holds when the output is the task. It breaks when the system calls tools, handles errors, retries, and decides when to stop, because correctness moves into intermediate decisions.

A tool-using agent is a control system. Tool choice, argument construction, retry behaviour, and stop conditions are the decisions that cause real-world side effects. A polished answer can sit on top of a broken execution.

Failure modes are boring and expensive. We get blind retries, skipped verification, tools called with the wrong scope, and errors swallowed and replaced with plausible text. Output-level scoring misses these because the prose still reads clean.

The unit we care about is the execution trajectory: the sequence of tool calls, arguments, ordering, repetition, and stopping condition. Evaluation becomes constraint checking over that sequence. The final response is downstream evidence, not the target.

The fix is mechanical. Run the real agent loop against a deterministic environment, capture the full trajectory, and assert on it. Mock tools with fixtures so the world is repeatable, and run model calls against the real tool schemas (ToolSandbox).

Trajectory constraints are the contract (see AgentLens and ToolSandbox). must_call: ["search"], max_tool_calls: 4, max_repeated_tool: 2, forbidden_tools: ["web_research"]. If the trajectory violates the contract, the run fails, regardless of how polished the output looks.

Output constraints verify downstream evidence. must_contain_any: ["validateToken", "auth.ts"], must_not_contain: ["best practices"]. These check that the right execution produced the right artefacts, not that we can write a convincing explanation.

Once we have trajectory constraints, model selection becomes a cost function. Order candidates cheapest-first, run the suite, and stop at the first model that passes. We buy the cheapest correct model, which is the only answer that matters when we pay per token at scale.

Trajectory evaluation does not replace output scoring. It catches failures that output scoring cannot see, which is where most production breakage lives.


References

  • AgentLens, "Revealing The Lucky Pass Problem in SWE-Agent Evaluation" (2025). Quantified the divergence between outcome scoring and process-level trajectory validity. arxiv.org/abs/2605.12925
  • ToolSandbox (Apple, NAACL 2025). Open-source stateful tool execution benchmark with milestone evaluation over arbitrary trajectories. github.com/apple/ToolSandbox