代理人工作流内存

摘要

尽管基于语言模型的代理人在解决诸如网络导航等现实世界任务方面具有潜力，但目前的方法仍然在处理具有复杂动作轨迹的长视野任务时存在困难。相比之下，人类可以通过从过去经验中学习可重复使用的任务工作流程并利用它们指导未来行动来灵活解决复杂任务。为了构建能够从这一过程中获益的代理人，我们引入了代理人工作流记忆（AWM），这是一种诱导常被重复使用的例行程序，即工作流程，并有选择地向代理人提供工作流程以指导后续生成的方法。AWM灵活地适用于离线和在线场景，代理人可以事先从训练示例中诱导工作流程，或者在测试查询中即时生成。我们在两个主要的网络导航基准测试上进行了实验——Mind2Web和WebArena——涵盖了来自旅行、购物、社交媒体等200多个领域的1000多个任务。AWM显著提高了基线结果，在Mind2Web和WebArena上相对成功率分别提高了24.6%和51.1%，同时减少了成功解决WebArena任务所需的步骤数。此外，在线AWM在跨任务、网站和领域评估中具有稳健的泛化能力，在训练-测试任务分布差距扩大时，超过基线8.9至14.0个绝对点。

English

Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.