AppWorld：一個可控制的應用程式和人物世界，用於評估互動式編碼代理。

摘要

解決日常數位任務（例如為家庭訂購食品）的自主代理，不僅需要透過應用程式介面（例如筆記、訊息、購物應用程式）操作多個應用程式，還必須根據與環境互動生成具有複雜控制流的豐富程式碼。然而，現有的工具使用基準不足以應對這一挑戰，因為它們僅涵蓋需要簡單API調用序列的任務。為彌補這一不足，我們建立了AppWorld引擎，這是一個高質量的執行環境（60,000行程式碼），包含9個日常應用程式，可透過457個API操作，並充滿了模擬約100個虛構用戶生活的逼真數位活動。然後我們創建了AppWorld基準測試（40,000行程式碼），這是一套包含750個自然、多樣且具挑戰性的自主代理任務，需要生成豐富且互動式的程式碼。它支援基於狀態的單元測試，實現強大的程式化評估，允許以不同方式完成任務，同時檢查意外變化，即所謂的附帶損害。最先進的LLM，GPT-4o，僅解決了我們「普通」任務的約49％和「挑戰」任務的約30％，而其他模型解決的任務至少少了16％。這凸顯了該基準測試的難度以及AppWorld推動互動式編碼代理前沿的潛力。項目網站位於https://appworld.dev/。

English

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built AppWorld Engine, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created AppWorld Benchmark (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.

AppWorld：一個可控制的應用程式和人物世界，用於評估互動式編碼代理。

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

摘要

Support