AgentRewardBench: Webエージェント軌跡の自動評価の評価

要旨

Webエージェントは、ユーザーが自然言語によるインタラクションを通じてWebブラウザ上でタスクを実行できるようにするものです。Webエージェントの軌跡を評価することは重要な課題であり、エージェントがタスクを成功裏に完了したかどうかを判断するのに役立ちます。この目的のためにルールベースの手法が広く使用されていますが、新しいタスクに拡張するのが難しく、成功した軌跡を常に認識できるとは限りません。人間による評価ではより高い精度を達成できる可能性がありますが、そのプロセスは大幅に遅く、コストもかかります。LLM（大規模言語モデル）を用いた自動評価は、新しいルールの設計や軌跡の手動アノテーションの課題を回避し、迅速かつコスト効果の高い評価を可能にします。しかし、Webエージェントの評価においてLLMがどれほど効果的であるかは明らかではありません。この目的のために、我々はLLMジャッジがWebエージェントを評価する際の有効性を測定する最初のベンチマークであるAgentRewardBenchを提案します。AgentRewardBenchは、5つのベンチマークと4つのLLMにわたる1302の軌跡を含んでいます。AgentRewardBenchの各軌跡は専門家によってレビューされ、エージェントの成功、副作用、反復性に関する質問に回答します。このベンチマークを使用して、12のLLMジャッジを評価した結果、すべてのベンチマークで優れた性能を示す単一のLLMは存在しないことがわかりました。また、一般的なベンチマークで使用されるルールベースの評価は、Webエージェントの成功率を過小報告する傾向があり、ルールベース評価の主要な弱点と、より柔軟な自動評価の必要性を浮き彫りにしています。ベンチマークは以下で公開しています: https://agent-reward-bench.github.io

English

Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

AgentRewardBench: Webエージェント軌跡の自動評価の評価

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

要旨

Support