Struct-Searcher：自主结构性思维推动多模态深度信息搜索

摘要

深度研究智能体因其能够收集大规模在线信息以获取目标知识而日益受到关注，近期研究趋势已从纯文本信息检索转向多模态场景。然而，现有智能体工作流程大多与证据累积模型保持一致，该模型线性聚合证据，缺乏处理跨异构模态矛盾信息的结构化机制。为此，我们提出Struct-Searcher——一种基于信念修正理论的结构化智能体工作流，该框架在推理过程中显式维护不断演化的多模态结构图，从而实现有效的冲突感知式多模态深度信息搜索。在多个基准数据集和骨干模型上的广泛实验表明：(1) Struct-Searcher具有即插即用和模型无关特性，在五种不同骨干模型上对BrowseComp-VL数据集实现了平均17.2%的相对精度提升；(2) 该方法性能领先，持续超越最先进的视觉语言模型（VLM）和深度研究智能体，与次优方法相比，在MM-BrowseComp、HLE-VL和BrowseComp-VL上分别实现了3.7%、1.5%和0.7%的相对精度提升。

English

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.