Struct-Searcher: エージェント的構造思考がマルチモーダル深層情報探索を進化させる

要旨

深層リサーチエージェントは、大規模なオンライン情報を収集して目的の知識を獲得する能力により注目を集めており、近年の取り組みは純粋なテキストベースの情報探索からマルチモーダル設定へと移行しつつある。しかし、既存のエージェントワークフローは、証拠を線形的に集約する証拠蓄積モデルに大きく依存しており、異種モダリティ間で矛盾する情報を扱うための原理的なメカニズムを欠いている。この課題に対し、我々は信念修正理論に基づく構造的エージェントワークフローであるStruct-Searcherを提案する。本手法は、推論プロセス全体を通じて進化するマルチモーダル構造グラフを明示的に維持することで、矛盾を考慮した効果的なマルチモーダル深層情報探索を実現する。複数のベンチマークデータセットとバックボーンモデルを用いた大規模な実験により、Struct-Searcherは以下の特性を持つことが示された。(1) プラグアンドプレイでモデル非依存であり、5種類の異なるバックボーンを用いたBrowseComp-VLにおいて、平均相対精度17.2%の改善を達成。(2) 最高性能を達成し、最先端の視覚言語モデル（VLM）や深層リサーチエージェントを一貫して上回り、MM-BrowseCompでは相対精度3.7%、HLE-VLでは1.5%、BrowseComp-VLでは0.7%の改善を、それぞれ第2位の競合手法に対して達成した。

English

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.