WebArena:一个用于构建自主代理的真实网络环境
WebArena: A Realistic Web Environment for Building Autonomous Agents
July 25, 2023
作者: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig
cs.AI
摘要
随着生成式人工智能的进步,通过自然语言命令管理日常任务的自主代理的潜力变得令人兴奋。然而,目前的代理主要是在简化的合成环境中创建和测试的,严重限制了对真实世界场景的表征。在本文中,我们构建了一个用于代理命令和控制的环境,具有高度逼真和可重现性。具体而言,我们专注于在网站上执行任务的代理,并创建了一个包含四个常见领域的完全功能网站的环境:电子商务、社交论坛讨论、协作软件开发和内容管理。我们的环境配备了工具(例如地图)和外部知识库(例如用户手册),以鼓励类似人类的任务解决。基于我们的环境,我们发布了一组旨在评估任务完成功能正确性的基准任务。我们基准测试中的任务多样且长期,旨在模拟人类在互联网上经常执行的任务。我们设计并实现了几个自主代理,集成了最新的技术,如先思考后行动。结果表明,解决复杂任务具有挑战性:我们基于GPT-4的最佳代理仅实现了10.59%的端到端任务成功率。这些结果突显了对健壮代理的进一步发展的需求,当前最先进的语言模型在这些现实生活任务中表现远非完美,并且WebArena可用于衡量这种进展。我们的代码、数据、环境重现资源和视频演示可在https://webarena.dev/上公开获取。
English
With generative AI advances, the exciting potential for autonomous agents to
manage daily tasks via natural language commands has emerged. However, cur rent
agents are primarily created and tested in simplified synthetic environments,
substantially limiting real-world scenario representation. In this paper, we
build an environment for agent command and control that is highly realistic and
reproducible. Specifically, we focus on agents that perform tasks on websites,
and we create an environment with fully functional websites from four common
domains: e-commerce, social forum discussions, collaborative software
development, and content management. Our environment is enriched with tools
(e.g., a map) and external knowledge bases (e.g., user manuals) to encourage
human-like task-solving. Building upon our environment, we release a set of
benchmark tasks focusing on evaluating the functional correctness of task
completions. The tasks in our benchmark are diverse, long-horizon, and are
designed to emulate tasks that humans routinely perform on the internet. We
design and implement several autonomous agents, integrating recent techniques
such as reasoning before acting. The results demonstrate that solving complex
tasks is challenging: our best GPT-4-based agent only achieves an end-to-end
task success rate of 10.59%. These results highlight the need for further
development of robust agents, that current state-of-the-art LMs are far from
perfect performance in these real-life tasks, and that WebArena can be used to
measure such progress. Our code, data, environment reproduction resources, and
video demonstrations are publicly available at https://webarena.dev/.