交互式评估需要设计科学

摘要

AI评估正在经历结构性变革。大型语言模型越来越多地被部署为通过工具、环境、用户和其他智能体随时间持续行动的系统，而许多评估实践仍沿用源于以响应为中心的基准测试的假设（例如固定输入、孤立输出以及仅凭单次响应即可做出的结果判断）。该领域已开始构建交互式基准测试，但由此产生的格局呈现碎片化：不同基准测试在接纳入何种交互产物、如何为轨迹打分以及其结果支持何种主张方面存在差异。本立场论文主张，交互式评估应被视为一种原则性的评估范式，而不仅仅是新型智能体基准测试的集合。简单沿用以往的评估范式已不适用。我们将评估定义为从证据到判断的自主映射，并表明交互式评估改变了这一映射的两个方面：证据变为由交互生成的轨迹，而评估程序必须对过程、可恢复性、协调性、稳健性和系统级性能进行评判。基于这一定义，我们提出双轴分类法，推导出设计原则和报告标准，审视代表性场景，并分析长期存在的评估挑战如何在轨迹层面重新显现。

English

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.