ChatPaper.aiChatPaper

AssistantBench:網路代理人能解決現實且耗時的任務嗎?

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

July 22, 2024
作者: Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
cs.AI

摘要

語言代理人是建立在語言模型(LMs)之上的系統,能夠與複雜環境互動,例如開放網路。在這項研究中,我們探討這類代理人是否能在網路上執行現實且耗時的任務,例如監控房地產市場或尋找相關的附近企業。我們引入了AssistantBench,這是一個具有挑戰性的新基準,包含214個現實任務,可自動評估,涵蓋不同情境和領域。我們發現AssistantBench暴露了當前系統的局限性,包括語言模型和擴充檢索的語言模型,因為沒有模型達到超過25分的準確度。雖然閉書式LMs表現良好,但由於它們傾向幻覺事實,因此具有低精確度。最先進的網路代理人得分接近零。此外,我們介紹了SeePlanAct(SPA),這是一個新的網路代理人,明顯優於先前的代理人,而SPA和閉書式模型的組合達到最佳整體表現。此外,我們分析了當前系統的失敗之處,並強調網路導航仍然是一個主要挑戰。
English
Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

Summary

AI-Generated Summary

PDF94November 28, 2024