対話的評価にはデザイン科学が必要である

要旨

AI評価は構造的な変革の只中にある。大規模言語モデル（LLM）は、ツール、環境、ユーザー、その他のエージェントを通じて時間をかけて動作するシステムとしてますます展開されているが、多くの評価手法は依然として応答中心のベンチマーク（例：固定入力、孤立出力、単一応答から判断可能な結果評価）から継承された前提に依存している。この分野ではインタラクティブなベンチマークの構築が始まっているが、結果として生じた状況は断片的である。ベンチマークごとに許容される相互作用の成果物、軌跡のスコアリング方法、結果が裏付ける主張が異なるのだ。本ポジションペーパーは、インタラクティブ評価を単なる新たなエージェントベンチマークの一群ではなく、原理に基づく評価パラダイムとして扱うべきだと論じる。従来の評価パラダイムを単に採用するだけでは不十分である。我々は評価を証拠から判断への自律的な写像と定義し、インタラクティブ評価がこの写像の両側面を変革することを示す。すなわち、証拠は相互作用によって生成された軌跡となり、評価手順はプロセス、回復可能性、協調、ロバスト性、システムレベルのパフォーマンスを評価しなければならない。この定義に基づき、我々は二軸の分類法を提案し、設計原則と報告基準を導出し、代表的なシナリオを検討し、長年にわたる評価課題が軌跡レベルでどのように再出現するかを分析する。

English

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.