Recon-Act: 웹 정찰, 도구 생성 및 작업 실행을 통한 자가 진화형 멀티 에이전트 브라우저 사용 시스템

초록

최근 몇 년 동안, 멀티모달 모델은 놀라운 발전을 이루며 지능형 브라우저 사용 에이전트의 길을 열어 왔습니다. 그러나 실제 웹페이지에서 다중 턴, 장기적 궤적으로 작업을 해결할 때, 현재의 에이전트들은 여전히 실행 중에 무질서한 행동 순서와 과도한 시행착오를 겪고 있습니다. 본 논문은 정찰-행동(Reconnaissance-Action) 행동 패러다임에 기반한 자가 진화형 다중 에이전트 프레임워크인 Recon-Act를 소개합니다. 이 시스템은 정찰 팀(Reconnaissance Team)과 행동 팀(Action Team)으로 구성됩니다: 전자는 비교 분석과 도구 생성을 수행하고, 후자는 의도 분해, 도구 조율 및 실행을 처리합니다. 오류 궤적과 성공적인 궤적을 대조함으로써, 정찰 팀은 해결책을 추론하고 이를 일반화된 도구의 통합 개념으로 추상화하여 힌트나 규칙 기반 코드로 표현하고, 실시간으로 도구 아카이브에 등록합니다. 행동 팀은 이러한 목표 도구를 활용하여 프로세스를 재추론함으로써 데이터-도구-행동-피드백의 폐쇄형 훈련 파이프라인을 구축합니다. 본 연구에서 제안한 6단계 구현 로드맵에 따라, 우리는 현재 3단계(제한적인 인간 개입 포함)에 도달했습니다. 정찰을 통해 얻은 일반화된 도구를 활용함으로써, Recon-Act는 미지의 웹사이트에 대한 적응성과 장기적 작업의 해결 가능성을 크게 향상시키며, 도전적인 VisualWebArena 데이터셋에서 최첨단 성능을 달성합니다.

English

Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset.

Recon-Act: 웹 정찰, 도구 생성 및 작업 실행을 통한 자가 진화형 멀티 에이전트 브라우저 사용 시스템

Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

초록

Support