WebArena:一個用於建立自主代理程式的真實網路環境
WebArena: A Realistic Web Environment for Building Autonomous Agents
July 25, 2023
作者: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig
cs.AI
摘要
隨著生成式人工智能的進步,自主代理通過自然語言命令管理日常任務的潛力變得令人振奮。然而,目前的代理主要在簡化的合成環境中創建和測試,嚴重限制了真實世界情景的表現。在本文中,我們建立了一個用於代理命令和控制的環境,具有高度逼真和可重現性。具體而言,我們專注於在網站上執行任務的代理,並創建了一個包含四個常見領域的完全功能網站的環境:電子商務、社交論壇討論、協作軟件開發和內容管理。我們的環境豐富多彩,配備了工具(例如地圖)和外部知識庫(例如用戶手冊),以鼓勵類似人類的任務解決。基於我們的環境,我們釋出了一組旨在評估任務完成的功能正確性的基準任務。我們基準中的任務多樣,長期視角,旨在模擬人類在互聯網上經常執行的任務。我們設計並實施了幾個自主代理,整合了最新技術,如先思考後行動。結果顯示,解決複雜任務具有挑戰性:我們基於最佳GPT-4的代理僅實現了10.59%的端對端任務成功率。這些結果突顯了對強大代理的進一步發展的需求,目前最先進的語言模型在這些現實任務中表現遠非完美,而WebArena可用於衡量這種進展。我們的代碼、數據、環境重現資源和視頻演示可在https://webarena.dev/上公開獲得。
English
With generative AI advances, the exciting potential for autonomous agents to
manage daily tasks via natural language commands has emerged. However, cur rent
agents are primarily created and tested in simplified synthetic environments,
substantially limiting real-world scenario representation. In this paper, we
build an environment for agent command and control that is highly realistic and
reproducible. Specifically, we focus on agents that perform tasks on websites,
and we create an environment with fully functional websites from four common
domains: e-commerce, social forum discussions, collaborative software
development, and content management. Our environment is enriched with tools
(e.g., a map) and external knowledge bases (e.g., user manuals) to encourage
human-like task-solving. Building upon our environment, we release a set of
benchmark tasks focusing on evaluating the functional correctness of task
completions. The tasks in our benchmark are diverse, long-horizon, and are
designed to emulate tasks that humans routinely perform on the internet. We
design and implement several autonomous agents, integrating recent techniques
such as reasoning before acting. The results demonstrate that solving complex
tasks is challenging: our best GPT-4-based agent only achieves an end-to-end
task success rate of 10.59%. These results highlight the need for further
development of robust agents, that current state-of-the-art LMs are far from
perfect performance in these real-life tasks, and that WebArena can be used to
measure such progress. Our code, data, environment reproduction resources, and
video demonstrations are publicly available at https://webarena.dev/.