DeepMMSearch-R1:赋能多模态大语言模型于多模态网络搜索
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
October 14, 2025
作者: Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan
cs.AI
摘要
在現實世界的應用中,多模態大型語言模型(MLLMs)需要接入外部知識源,並必須對動態且不斷變化的現實世界信息保持響應,以解決用戶的信息尋求和知識密集型查詢。現有的方法,如檢索增強生成(RAG)方法、搜索代理和配備搜索功能的MLLMs,往往存在流程僵化、搜索調用過多以及搜索查詢構建不佳等問題,導致效率低下和結果不理想。為解決這些限制,我們提出了DeepMMSearch-R1,這是首個能夠執行按需多輪網絡搜索並動態構建圖像和文本搜索工具查詢的多模態LLM。具體而言,DeepMMSearch-R1可以基於輸入圖像的相關裁剪區域啟動網絡搜索,使圖像搜索更加有效,並可以根據檢索到的信息迭代調整文本搜索查詢,從而實現自我反思和自我修正。我們的方法依賴於一個兩階段的訓練流程:冷啟動的監督微調階段,隨後是在線強化學習優化。為了訓練,我們引入了DeepMMSearchVQA,這是一個通過自動化流程創建的新穎多模態VQA數據集,其中混合了來自網絡搜索工具的現實世界信息。該數據集包含多樣化的多跳查詢,這些查詢整合了文本和視覺信息,教會模型何時搜索、搜索什麼、使用哪種搜索工具以及如何對檢索到的信息進行推理。我們在一系列知識密集型基準上進行了廣泛的實驗,以證明我們方法的優越性。最後,我們分析了結果並提供了對推進多模態網絡搜索有價值的見解。
English
Multimodal Large Language Models (MLLMs) in real-world applications require
access to external knowledge sources and must remain responsive to the dynamic
and ever-changing real-world information in order to address
information-seeking and knowledge-intensive user queries. Existing approaches,
such as retrieval augmented generation (RAG) methods, search agents, and search
equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and
poorly constructed search queries, which result in inefficiencies and
suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1,
the first multimodal LLM capable of performing on-demand, multi-turn web
searches and dynamically crafting queries for both image and text search tools.
Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops
of the input image making the image search more effective, and can iteratively
adapt text search queries based on retrieved information, thereby enabling
self-reflection and self-correction. Our approach relies on a two-stage
training pipeline: a cold start supervised finetuning phase followed by an
online reinforcement learning optimization. For training, we introduce
DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated
pipeline intermixed with real-world information from web search tools. This
dataset contains diverse, multi-hop queries that integrate textual and visual
information, teaching the model when to search, what to search for, which
search tool to use and how to reason over the retrieved information. We conduct
extensive experiments across a range of knowledge-intensive benchmarks to
demonstrate the superiority of our approach. Finally, we analyze the results
and provide insights that are valuable for advancing multimodal web-search.