ChatPaper.aiChatPaper

AssistantBench:网络代理能否解决现实且耗时的任务?

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

July 22, 2024
作者: Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
cs.AI

摘要

基于语言模型(LMs)构建的语言代理是可以与复杂环境(如开放网络)进行交互的系统。在这项工作中,我们研究这种代理是否能够在网络上执行现实且耗时的任务,例如监控房地产市场或查找相关的附近企业。我们引入了AssistantBench,一个包含214个现实任务的具有挑战性的新基准,可以自动评估,涵盖不同场景和领域。我们发现AssistantBench暴露了当前系统的局限性,包括语言模型和检索增强语言模型,因为没有模型达到超过25分的准确度。尽管闭卷LMs表现良好,但由于倾向于产生虚构事实,它们的精度较低。最先进的网络代理得分接近零。此外,我们引入了SeePlanAct(SPA),这是一个新的网络代理,明显优于先前的代理,而SPA和闭卷模型的集成达到了最佳整体性能。此外,我们分析了当前系统的失败,并强调网络导航仍然是一个重大挑战。
English
Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

Summary

AI-Generated Summary

PDF94November 28, 2024