Struct-Searcher：代理結構思維推動多模態深度資訊檢索

摘要

深度研究代理因能够收集大规模在线信息以获取目标知识而日益受到关注，近期的工作已从纯文本信息搜索逐步转向多模态场景。然而，现有代理工作流大多遵循证据积累模型，以线性方式聚合证据，缺乏处理跨异质模态矛盾信息的原则性机制。针对这一问题，我们提出Struct-Searcher——一种基于信念修正理论的结构化代理工作流，该工作流在整个推理过程中明确维护一个动态演化的多模态结构图，从而支持有效的冲突感知多模态深度信息搜索。在多个基准数据集和主干模型上的广泛实验表明，Struct-Searcher（1）具备即插即用且与模型无关的特性，在五种不同主干模型上，相对于BrowseComp-VL任务的平均准确率提升达17.2%；（2）性能领先，持续优于最先进的视觉语言模型（VLM）及深度研究代理，在MM-BrowseComp、HLE-VL和BrowseComp-VL上，相较于排名第二的竞争方法，准确率分别相对提升3.7%、1.5%和0.7%。

English

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.