Visual-Seeker: 通过主动视觉推理迈向视觉原生的多模态智能体搜索

摘要

多模态大语言模型（MLLMs）在众多视觉任务中展现出令人瞩目的能力，但当面对复杂开放场景时，常因事实根基不牢而表现欠佳。尽管近期出现的多模态深度搜索代理尝试通过调用外部工具解决这一问题，但视觉原生的搜索范式仍未被充分探索。现有方法主要依赖具备显式语义的简单图像及纯文本证据轨迹，这限制了代理执行跨模态多跳推理与搜索的能力。针对这些局限，我们提出Visual-Seeker——一种通过主动视觉推理实现的视觉原生多模态深度搜索代理。该代理不再将视觉视为静态输入，而是主动关注细粒度视觉细节，在搜索过程中动态采撷视觉证据。为释放其视觉原生潜能，我们设计了主动视觉推理数据流水线，并合成了5000条高质量多模态轨迹用于模型训练。大量实验表明，该方法在五个具有挑战性的多模态搜索基准上均达到最优性能，甚至超越多个专有模型，验证了其在真实网络环境中稳健的视觉原生推理与搜索能力。代码与数据可通过以下链接获取：https://github.com/ZhengboZhang/Visual-Seeker。

English

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.