WebArena: 자율 에이전트 구축을 위한 현실적인 웹 환경

초록

생성형 AI의 발전으로 인해, 자연어 명령을 통해 일상 작업을 관리할 수 있는 자율 에이전트의 흥미로운 잠재력이 부각되고 있습니다. 그러나 현재의 에이전트는 주로 단순화된 합성 환경에서 생성 및 테스트되어, 실제 시나리오를 충분히 반영하지 못하는 한계가 있습니다. 본 논문에서는 매우 현실적이고 재현 가능한 에이전트 명령 및 제어 환경을 구축합니다. 특히, 웹사이트에서 작업을 수행하는 에이전트에 초점을 맞추고, 전자상거래, 소셜 포럼 토론, 협업 소프트웨어 개발, 콘텐츠 관리 등 네 가지 일반적인 도메인의 완전히 기능적인 웹사이트로 구성된 환경을 생성합니다. 우리의 환경은 지도와 같은 도구 및 사용자 매뉴얼과 같은 외부 지식 기반으로 풍부하게 구성되어 인간과 유사한 문제 해결을 장려합니다. 이 환경을 기반으로, 작업 완료의 기능적 정확성을 평가하는 데 초점을 맞춘 벤치마크 작업 세트를 공개합니다. 우리 벤치마크의 작업은 다양하고 장기적인 작업으로, 인간이 인터넷에서 일상적으로 수행하는 작업을 모방하도록 설계되었습니다. 우리는 행동 전 사고와 같은 최신 기술을 통합한 여러 자율 에이전트를 설계 및 구현합니다. 결과는 복잡한 작업을 해결하는 것이 어려운 과제임을 보여줍니다: GPT-4 기반 최고 성능의 에이전트도 종단 간 작업 성공률이 10.59%에 불과합니다. 이러한 결과는 강력한 에이전트의 추가 개발 필요성, 현재 최첨단 언어 모델이 이러한 실제 작업에서 완벽한 성능과는 거리가 멀다는 점, 그리고 WebArena이 이러한 진전을 측정하는 데 사용될 수 있음을 강조합니다. 우리의 코드, 데이터, 환경 재현 리소스 및 비디오 데모는 https://webarena.dev/에서 공개적으로 제공됩니다.

English

With generative AI advances, the exciting potential for autonomous agents to manage daily tasks via natural language commands has emerged. However, cur rent agents are primarily created and tested in simplified synthetic environments, substantially limiting real-world scenario representation. In this paper, we build an environment for agent command and control that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on websites, and we create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and are designed to emulate tasks that humans routinely perform on the internet. We design and implement several autonomous agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 10.59%. These results highlight the need for further development of robust agents, that current state-of-the-art LMs are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress. Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/.

WebArena: 자율 에이전트 구축을 위한 현실적인 웹 환경

WebArena: A Realistic Web Environment for Building Autonomous Agents

초록

Support