Recon-Act：一個通過網路偵察、工具生成與任務執行實現自我進化的多代理瀏覽器使用系統

摘要

近年來，多模態模型取得了顯著進展，為智能瀏覽器使用代理鋪平了道路。然而，在解決現實世界網頁上的多輪、長視野軌跡任務時，現有代理仍面臨動作序列混亂和執行過程中過多試錯的問題。本文介紹了Recon-Act，這是一個基於偵察-行動行為範式的自我進化多代理框架。該系統由偵察團隊和行動團隊組成：前者進行比較分析和工具生成，後者負責意圖分解、工具編排和執行。通過對比錯誤軌跡與成功軌跡，偵察團隊推斷補救措施，並將其抽象為統一概念的通用工具，無論是以提示形式還是基於規則的代碼形式，並實時註冊到工具檔案中。行動團隊在這些目標工具的加持下重新推理過程，從而建立了一個數據-工具-行動-反饋的閉環訓練管道。按照本文提出的六級實施路線圖，我們目前已達到第三級（有限的人機交互干預）。利用通過偵察獲得的通用工具，Recon-Act大幅提升了對未見網站的適應性和長視野任務的解決能力，並在具有挑戰性的VisualWebArena數據集上實現了最先進的性能。

English

Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset.

Recon-Act：一個通過網路偵察、工具生成與任務執行實現自我進化的多代理瀏覽器使用系統

Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

摘要

Support