Demystifying evals for AI agents \ Anthropic

English

Anthropic emphasizes that evaluating AI agents is a complex but crucial process, requiring a different approach than traditional single-turn language model evaluations due to agents' multi-step behavior, tool use, and potential for unexpected solutions. In their guide, "Demystifying evals for AI agents," they outline a structured methodology for effective agent evaluation.

The Nature of AI Agent Evaluations An evaluation, or "eval," is essentially a test for an AI system where an input is provided, and grading logic is applied to its output to measure success. Unlike earlier, non-agentic LLM evaluations which were typically single-turn, agent evaluations often involve multi-turn interactions where an agent executes a series of actions within an environment. The core challenge lies in the non-determinism of agents (identical inputs can lead to varied execution paths) and the cascading effect of small deviations in multi-turn interactions.

Key Components and Strategies:

Evaluation Structure: An eval involves giving an AI an input and then applying grading logic to its output to measure success. It's crucial to distinguish between the transcript (the full multi-step interaction and tool calls) and the outcome (the final state in the environment). For instance, an agent claiming "ticket booked!" means little if database records don't confirm it.

Types of Graders: Anthropic recommends combining three types of graders to match the complexity of the systems being measured: Code-based graders: These are fast, cheap, and objective, often using exact string matches, regex, or static analysis, but can be brittle to valid variations in output. They are effective for verifiable aspects and tool-call verification. Model-based graders: These utilize LLMs as judges with rubrics, natural-language assertions, or pairwise comparisons. They are flexible and can handle nuance but are non-deterministic and require careful calibration. Human graders: Providing the highest fidelity, human graders (SMEs, crowd workers) are the gold standard for calibration, but they are also slow and expensive.

Addressing Non-Determinism: To account for the inherent non-determinism of AI agents, Anthropic uses reliability metrics like pass@k (at least one success in k trials) for human-supervised tools and pass^k (all successes in k trials) for automation-ready reliability.

Eval-Driven Development: A central tenet is to "start early and don't wait for the perfect suite". This involves defining capabilities through evaluations first, and then iterating on the agent until it meets those standards. Evals should evolve; once an agent masters a capability eval, it "graduates" to a regression suite to prevent complacency.

Sourcing Realistic Tasks: Evals should be based on realistic tasks derived from observed failures in real-world scenarios. This helps ensure the evaluations are truly representative of how the agent will perform in practice.

Iterative Refinement: Evaluations should be continuously iterated upon to improve their signal-to-noise ratio. It's also critical to review transcripts of agent interactions, as sometimes a "failure" might actually be a valid or even superior solution that the initial grading logic didn't anticipate.

Ultimately, Anthropic argues that effective agent evaluations are less about generic benchmarks and more about carefully defining what "working" truly means in practice for specific tasks and iteratively refining both the agent and its evaluation criteria.

中文

揭秘 AI Agent 的评估方法 \ Anthropic

Anthropic 强调，评估 AI Agent 是一个复杂但至关重要的过程，需要与传统单轮语言模型评估不同的方法，因为 Agent 具有多步行为、工具使用和可能产生意外解决方案的特点。在他们撰写的《揭秘 AI Agent 的评估方法》指南中，概述了有效的 Agent 评估的结构化方法。

AI Agent 评估的本质评估（"eval"）本质上是为 AI 系统设计的测试，提供输入并应用评分逻辑来衡量输出结果的成功与否。与早期非 Agent 的 LLM 评估通常为单轮交互不同，Agent 评估通常涉及多轮交互，其中 Agent 在环境中执行一系列动作。核心挑战在于 Agent 的非确定性（相同的输入可能导致不同的执行路径）以及多轮交互中小偏差的级联效应。

关键组件和策略：

评估结构：评估涉及为 AI 提供输入，然后应用评分逻辑来衡量输出结果的成功与否。关键是要区分记录（完整的多步交互和工具调用）和结果（环境中的最终状态）。例如，Agent 声称"票已预订！"但如果数据库记录没有确认，则意义不大。

评分器类型：Anthropic 建议结合使用三种评分器类型来匹配被测量系统的复杂性：代码评分器：快速、廉价且客观，通常使用精确字符串匹配、正则表达式或静态分析，但对输出的有效变化可能比较脆弱。它们在可验证方面和工具调用验证方面非常有效。模型评分器：利用 LLM 作为评分者，使用评分标准、自然语言断言或成对比较。它们灵活且能处理细微差别，但具有非确定性，需要仔细校准。人工评分器：提供最高保真度的人工评分者（主题专家、众包工作者）是校准的黄金标准，但速度慢且成本高。

解决非确定性：为了应对 AI Agent 固有的非确定性，Anthropic 使用可靠性指标，如 pass@k（在 k 次试验中至少有一次成功）用于人工监督的工具，以及 pass^k（在 k 次试验中全部成功）用于自动化就绪的可靠性。

评估驱动开发：核心原则是"尽早开始，不要等待完美的测试套件"。这包括首先通过评估定义能力，然后迭代改进 Agent 直到达到标准。评估应该不断进化；一旦 Agent 掌握了能力评估，就"毕业"到回归测试套件以防止自满。

获取真实任务：评估应基于从现实场景观察到的失败中得出的真实任务。这有助于确保评估真正代表 Agent 在实际中的表现。

迭代改进：应持续迭代评估以改善信噪比。同样关键的是审查 Agent 交互的记录，因为有时"失败"实际上可能是有效的甚至是更好的解决方案，而初始评分逻辑没有预料到这一点。

总之，Anthropic 认为，有效的 Agent 评估较少关注通用基准，更多关注仔细定义"工作"在特定任务实践中的真正含义，并迭代改进 Agent 及其评估标准。

English

中文

继续阅读