AJ-Bench：環境を考慮した評価のためのエージェント裁判官ベンチマーク

要旨

強化学習による大規模言語モデルエージェントの訓練が拡大を続ける中、複雑な環境下でのエージェント行動を確実に検証することはますます困難になっている。既存のアプローチはルールベースの検証器やLLM-as-a-Judgeモデルに依存しているが、これらは限られた領域を超えて一般化することが難しい。Agent-as-a-Judgeはこの制限を、検証可能な証拠を取得するために環境やツールと能動的に相互作用することで解決するが、その能力は未だ十分に探究されていない。本研究では、検索、データシステム、グラフィカルユーザーインターフェースの3領域にわたる155タスクと516の注釈付き軌跡から構成されるベンチマークAJ-Benchを導入し、Agent-as-a-Judgeを体系的に評価する。このベンチマークは、審判エージェントの情報獲得能力、状態検証能力、プロセス検証能力を包括的に評価する。実験結果は、LLM-as-a-Judgeベースラインを一貫して上回る性能向上を示すと同時に、エージェントベース検証における重大な未解決課題も明らかにする。データとコードはhttps://aj-bench.github.io/で公開している。

English

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.

AJ-Bench：環境を考慮した評価のためのエージェント裁判官ベンチマーク

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

要旨

Support