Recon-Act:一个通过网页侦察、工具生成与任务执行实现自我进化的多智能体浏览器使用系统
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution
September 25, 2025
作者: Kaiwen He, Zhiwei Wang, Chenyi Zhuang, Jinjie Gu
cs.AI
摘要
近年来,多模态模型取得了显著进展,为智能浏览器使用代理铺平了道路。然而,在解决现实世界网页上的多轮、长视野轨迹任务时,现有代理仍面临动作序列混乱和执行过程中过多试错的问题。本文介绍了Recon-Act,一个基于侦察-行动行为范式的自进化多代理框架。该系统由侦察团队和行动团队组成:前者负责比较分析和工具生成,后者则处理意图分解、工具编排与执行。通过对比错误轨迹与成功轨迹,侦察团队推断出补救措施,并将其抽象为通用工具的统一概念,无论是作为提示还是基于规则的代码,并实时注册到工具库中。行动团队借助这些针对性工具重新推理过程,从而建立起数据-工具-行动-反馈的闭环训练管道。按照本文提出的六级实施路线图,我们目前已达到第三级(有限的人机交互干预)。利用通过侦察获得的通用工具,Recon-Act大幅提升了对未见网站的适应性和长视野任务的可解性,并在具有挑战性的VisualWebArena数据集上实现了最先进的性能。
English
Recent years, multimodal models have made remarkable strides and pave the way
for intelligent browser use agents. However, when solving tasks on real world
webpages in multi-turn, long-horizon trajectories, current agents still suffer
from disordered action sequencing and excessive trial and error during
execution. This paper introduces Recon-Act, a self-evolving multi-agent
framework grounded in Reconnaissance-Action behavioral paradigm. The system
comprises a Reconnaissance Team and an Action Team: the former conducts
comparative analysis and tool generation, while the latter handles intent
decomposition, tool orchestration, and execution. By contrasting the erroneous
trajectories with successful ones, the Reconnaissance Team infers remedies, and
abstracts them into a unified notion of generalized tools, either expressed as
hints or as rule-based codes, and register to the tool archive in real time.
The Action Team reinference the process empowered with these targeting tools,
thus establishing a closed-loop training pipeline of
data-tools-action-feedback. Following the 6 level implementation roadmap
proposed in this work, we have currently reached Level 3 (with limited
human-in-the-loop intervention). Leveraging generalized tools obtained through
reconnaissance, Recon-Act substantially improves adaptability to unseen
websites and solvability on long-horizon tasks, and achieves state-of-the-art
performance on the challenging VisualWebArena dataset.