WebWatcher: 비전-언어 딥 리서치 에이전트의 새로운 지평을 열다

초록

Deep Research와 같은 웹 에이전트는 매우 도전적인 정보 탐색 문제를 해결할 수 있는 초인적 인지 능력을 입증했습니다. 그러나 대부분의 연구는 주로 텍스트 중심으로 진행되어 실세계의 시각 정보를 간과하고 있습니다. 이로 인해 다중 모달 Deep Research는 텍스트 기반 에이전트에 비해 훨씬 강력한 지각, 논리, 지식 및 더 정교한 도구 사용 능력을 요구하기 때문에 매우 어려운 과제가 되었습니다. 이러한 한계를 해결하기 위해, 우리는 향상된 시각-언어 추론 능력을 갖춘 다중 모달 Deep Research 에이전트인 WebWatcher를 소개합니다. WebWatcher는 고품질의 합성 다중 모달 궤적을 활용하여 효율적인 콜드 스타트 훈련을 수행하고, 다양한 도구를 사용하여 심층 추론을 진행하며, 강화 학습을 통해 일반화 능력을 더욱 향상시킵니다. 다중 모달 에이전트의 능력을 더 잘 평가하기 위해, 우리는 시각 및 텍스트 정보를 모두 포함한 복잡한 정보 검색을 요구하는 BrowseComp 스타일의 벤치마크인 BrowseComp-VL을 제안합니다. 실험 결과는 WebWatcher가 네 가지 도전적인 VQA 벤치마크에서 독점적인 베이스라인, RAG 워크플로우 및 오픈소스 에이전트를 크게 능가함을 보여주며, 이는 복잡한 다중 모달 정보 탐색 과제 해결의 길을 열어줍니다.

English

Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.

WebWatcher: 비전-언어 딥 리서치 에이전트의 새로운 지평을 열다

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

초록

Support