WebWatcher: 視覚言語深層研究エージェントの新たなフロンティアを切り開く

要旨

Deep ResearchのようなWebエージェントは、高度に挑戦的な情報探索問題を解決する超人的な認知能力を実証しています。しかし、ほとんどの研究は依然としてテキスト中心であり、現実世界の視覚情報を見落としています。これにより、マルチモーダルなDeep Researchは非常に困難な課題となっています。なぜなら、そのようなエージェントは、テキストベースのエージェントと比較して、知覚、論理、知識、およびより洗練されたツールの使用において、はるかに強力な推論能力を必要とするからです。この制限に対処するため、我々はWebWatcherを紹介します。これは、強化された視覚言語推論能力を備えたマルチモーダルなDeep Researchエージェントです。WebWatcherは、高品質の合成マルチモーダル軌跡を活用して効率的なコールドスタートトレーニングを行い、深い推論のためにさまざまなツールを利用し、強化学習を通じてさらに汎化能力を向上させます。マルチモーダルエージェントの能力をより適切に評価するために、我々はBrowseComp-VLを提案します。これは、視覚情報とテキスト情報の両方を含む複雑な情報検索を必要とするBrowseCompスタイルのベンチマークです。実験結果は、WebWatcherが4つの挑戦的なVQAベンチマークにおいて、プロプライエタリなベースライン、RAGワークフロー、およびオープンソースエージェントを大幅に上回ることを示しており、複雑なマルチモーダル情報探索タスクを解決する道を開いています。

English

Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.

WebWatcher: 視覚言語深層研究エージェントの新たなフロンティアを切り開く

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

要旨

Support