エージェントをエージェントで評価する：エージェントによる評価

要旨

現代の評価技術はエージェントシステムには不十分です。これらのアプローチは、エージェントシステムの段階的な性質を無視して最終的な結果にのみ焦点を当てるか、過剰な手作業を必要とします。この課題に対処するために、私たちはエージェントシステムを評価するためにエージェントジャッジフレームワークを導入します。これは、LLMジャッジフレームワークの有機的な拡張であり、中間フィードバックを可能にするエージェント機能を取り入れたものです。私たちは、コード生成のタスクにエージェントジャッジを適用します。既存のベンチマークの問題を克服し、エージェントジャッジの概念を証明するために、55の現実的な自動AI開発タスクからなる新しいベンチマークであるDevAIを提案します。これには、365の階層的ユーザ要件など、豊富な手動注釈が含まれています。私たちは、エージェントジャッジを使用して人気のある3つのエージェントシステムをベンチマークし、LLMジャッジを大幅に上回り、人間の評価基準と同様に信頼性があります。全体として、私たちは、エージェントジャッジが現代のエージェントシステムにとって具体的な前進を示すものと信じています。これにより、動的かつスケーラブルな自己改善に必要な豊富で信頼性の高い報酬信号が提供されます。

English

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.