AppWorld:一個可控制的應用程式和人物世界,用於評估互動式編碼代理。
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
July 26, 2024
作者: Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian
cs.AI
摘要
解決日常數位任務(例如為家庭訂購食品)的自主代理,不僅需要透過應用程式介面(例如筆記、訊息、購物應用程式)操作多個應用程式,還必須根據與環境互動生成具有複雜控制流的豐富程式碼。然而,現有的工具使用基準不足以應對這一挑戰,因為它們僅涵蓋需要簡單API調用序列的任務。
為彌補這一不足,我們建立了AppWorld引擎,這是一個高質量的執行環境(60,000行程式碼),包含9個日常應用程式,可透過457個API操作,並充滿了模擬約100個虛構用戶生活的逼真數位活動。然後我們創建了AppWorld基準測試(40,000行程式碼),這是一套包含750個自然、多樣且具挑戰性的自主代理任務,需要生成豐富且互動式的程式碼。它支援基於狀態的單元測試,實現強大的程式化評估,允許以不同方式完成任務,同時檢查意外變化,即所謂的附帶損害。最先進的LLM,GPT-4o,僅解決了我們「普通」任務的約49%和「挑戰」任務的約30%,而其他模型解決的任務至少少了16%。這凸顯了該基準測試的難度以及AppWorld推動互動式編碼代理前沿的潛力。項目網站位於https://appworld.dev/。
English
Autonomous agents that address day-to-day digital tasks (e.g., ordering
groceries for a household), must not only operate multiple apps (e.g., notes,
messaging, shopping app) via APIs, but also generate rich code with complex
control flow in an iterative manner based on their interaction with the
environment. However, existing benchmarks for tool use are inadequate, as they
only cover tasks that require a simple sequence of API calls.
To remedy this gap, we built AppWorld Engine, a high-quality
execution environment (60K lines of code) of 9 day-to-day apps operable via 457
APIs and populated with realistic digital activities simulating the lives of
~100 fictitious users. We then created AppWorld Benchmark (40K lines
of code), a suite of 750 natural, diverse, and challenging autonomous agent
tasks requiring rich and interactive code generation. It supports robust
programmatic evaluation with state-based unit tests, allowing for different
ways of completing a task while also checking for unexpected changes, i.e.,
collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our
'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least
16% fewer. This highlights the benchmark's difficulty and AppWorld's potential
to push the frontiers of interactive coding agents. The project website is
available at https://appworld.dev/.Summary
AI-Generated Summary