ChatPaper.aiChatPaper

AppWorld:一个可控的应用程序和人员世界,用于评估交互式编码代理。

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

July 26, 2024
作者: Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian
cs.AI

摘要

解决日常数字任务(例如,为一个家庭订购杂货)的自主代理,不仅必须通过API操作多个应用程序(例如,笔记、消息、购物应用程序),还必须根据它们与环境的交互以迭代方式生成具有复杂控制流的丰富代码。然而,现有的工具使用基准不足以满足要求,因为它们只涵盖需要简单API调用序列的任务。 为弥补这一差距,我们构建了AppWorld引擎,这是一个高质量的执行环境(60K行代码),包含9个日常应用程序,可通过457个API操作,并填充了模拟约100个虚构用户生活的真实数字活动。然后,我们创建了AppWorld基准(40K行代码),这是一个包含750个自然、多样且具有挑战性的自主代理任务的套件,需要生成丰富且互动的代码。它支持基于状态的单元测试进行强大的程序化评估,允许以不同方式完成任务,同时还检查意外更改,即,附带损害。最先进的LLM,GPT-4o,仅解决了我们“正常”任务的约49%和“挑战”任务的约30%,而其他模型解决的任务至少少16%。这突显了基准测试的难度和AppWorld推动交互式编码代理的潜力。项目网站可访问https://appworld.dev/。
English
Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built AppWorld Engine, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created AppWorld Benchmark (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.

Summary

AI-Generated Summary

PDF344November 28, 2024