Recon-Act：一个通过网页侦察、工具生成与任务执行实现自我进化的多智能体浏览器使用系统

摘要

近年来，多模态模型取得了显著进展，为智能浏览器使用代理铺平了道路。然而，在解决现实世界网页上的多轮、长视野轨迹任务时，现有代理仍面临动作序列混乱和执行过程中过多试错的问题。本文介绍了Recon-Act，一个基于侦察-行动行为范式的自进化多代理框架。该系统由侦察团队和行动团队组成：前者负责比较分析和工具生成，后者则处理意图分解、工具编排与执行。通过对比错误轨迹与成功轨迹，侦察团队推断出补救措施，并将其抽象为通用工具的统一概念，无论是作为提示还是基于规则的代码，并实时注册到工具库中。行动团队借助这些针对性工具重新推理过程，从而建立起数据-工具-行动-反馈的闭环训练管道。按照本文提出的六级实施路线图，我们目前已达到第三级（有限的人机交互干预）。利用通过侦察获得的通用工具，Recon-Act大幅提升了对未见网站的适应性和长视野任务的可解性，并在具有挑战性的VisualWebArena数据集上实现了最先进的性能。

English

Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset.

Recon-Act：一个通过网页侦察、工具生成与任务执行实现自我进化的多智能体浏览器使用系统

Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

摘要

Support