AgentRewardBench: 웹 에이전트 트랙토리 자동 평가의 성능 측정

초록

웹 에이전트는 사용자가 자연어 상호작용을 통해 웹 브라우저에서 작업을 수행할 수 있도록 합니다. 웹 에이전트의 궤적을 평가하는 것은 에이전트가 작업을 성공적으로 완료했는지 판단하는 데 도움을 주기 때문에 중요한 문제입니다. 이를 위해 규칙 기반 방법이 널리 사용되지만, 새로운 작업으로 확장하기 어렵고 성공적인 궤적을 항상 인식하지 못할 수 있습니다. 인간 평가를 통해 더 높은 정확도를 달성할 수 있지만, 이 과정은 상당히 느리고 비용이 많이 듭니다. 대형 언어 모델(LLM)을 사용한 자동 평가는 새로운 규칙을 설계하고 궤적을 수동으로 주석 달아야 하는 문제를 피할 수 있어 더 빠르고 비용 효율적인 평가를 가능하게 합니다. 그러나 이러한 방법이 웹 에이전트를 평가하는 데 얼마나 효과적인지는 명확하지 않습니다. 이를 위해, 우리는 웹 에이전트 평가를 위한 LLM 판단자의 효과를 평가하는 첫 번째 벤치마크인 AgentRewardBench를 제안합니다. AgentRewardBench는 5개의 벤치마크와 4개의 LLM에 걸쳐 1302개의 궤적을 포함하고 있습니다. AgentRewardBench의 각 궤적은 전문가가 검토하여 에이전트의 성공 여부, 부작용, 반복성에 관한 질문에 답합니다. 우리의 벤치마크를 사용하여 12개의 LLM 판단자를 평가한 결과, 모든 벤치마크에서 뛰어난 성능을 보이는 단일 LLM은 없었습니다. 또한, 일반적인 벤치마크에서 사용되는 규칙 기반 평가는 웹 에이전트의 성공률을 과소보고하는 경향이 있어, 규칙 기반 평가의 주요 약점과 더 유연한 자동 평가의 필요성을 강조합니다. 우리는 이 벤치마크를 https://agent-reward-bench.github.io에서 공개합니다.

English

Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

AgentRewardBench: 웹 에이전트 트랙토리 자동 평가의 성능 측정

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

초록

Support