DeepMMSearch-R1:赋能多模态大语言模型的多模态网络搜索
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
October 14, 2025
作者: Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan
cs.AI
摘要
在实际应用中,多模态大语言模型(MLLMs)需要接入外部知识源,并保持对动态且不断变化的现实世界信息的响应能力,以应对用户的信息查询和知识密集型需求。现有方法,如检索增强生成(RAG)技术、搜索代理及配备搜索功能的MLLMs,常受限于僵化的流程、过度的搜索调用以及构建不佳的搜索查询,导致效率低下和结果欠佳。为解决这些局限,我们提出了DeepMMSearch-R1,这是首个能够按需执行多轮网络搜索,并动态构建图像与文本搜索查询的多模态大语言模型。具体而言,DeepMMSearch-R1能够基于输入图像的相关裁剪区域启动网络搜索,使图像搜索更为高效,并能根据检索到的信息迭代调整文本搜索查询,从而实现自我反思与修正。我们的方法依赖于两阶段训练流程:先进行冷启动的监督微调,随后进行在线强化学习优化。为训练模型,我们引入了DeepMMSearchVQA,这是一个通过自动化流程结合网络搜索工具中的真实信息创建的新型多模态问答数据集。该数据集包含多样化的多跳查询,融合了文本与视觉信息,教导模型何时搜索、搜索什么、使用哪种搜索工具以及如何对检索到的信息进行推理。我们在一系列知识密集型基准测试中进行了广泛实验,证明了我们方法的优越性。最后,我们分析了实验结果,并提供了对推进多模态网络搜索有价值的洞见。
English
Multimodal Large Language Models (MLLMs) in real-world applications require
access to external knowledge sources and must remain responsive to the dynamic
and ever-changing real-world information in order to address
information-seeking and knowledge-intensive user queries. Existing approaches,
such as retrieval augmented generation (RAG) methods, search agents, and search
equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and
poorly constructed search queries, which result in inefficiencies and
suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1,
the first multimodal LLM capable of performing on-demand, multi-turn web
searches and dynamically crafting queries for both image and text search tools.
Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops
of the input image making the image search more effective, and can iteratively
adapt text search queries based on retrieved information, thereby enabling
self-reflection and self-correction. Our approach relies on a two-stage
training pipeline: a cold start supervised finetuning phase followed by an
online reinforcement learning optimization. For training, we introduce
DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated
pipeline intermixed with real-world information from web search tools. This
dataset contains diverse, multi-hop queries that integrate textual and visual
information, teaching the model when to search, what to search for, which
search tool to use and how to reason over the retrieved information. We conduct
extensive experiments across a range of knowledge-intensive benchmarks to
demonstrate the superiority of our approach. Finally, we analyze the results
and provide insights that are valuable for advancing multimodal web-search.