MMSearch-R1:激励大型多模态模型进行搜索
MMSearch-R1: Incentivizing LMMs to Search
June 25, 2025
作者: Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
cs.AI
摘要
在現實世界場景中,大規模多模態模型(LMMs)的穩健部署需要依賴外部知識源,這是由於現實世界信息的複雜性和動態性。現有的方法,如檢索增強生成(RAG)和提示工程搜索代理,依賴於固定的流程,往往導致低效或過度的搜索行為。我們提出了MMSearch-R1,這是首個端到端的強化學習框架,使LMMs能夠在現實世界的互聯網環境中進行按需、多輪搜索。我們的框架整合了圖像和文本搜索工具,允許模型基於結果導向的獎勵和搜索懲罰來推理何時以及如何調用這些工具。為了支持訓練,我們通過半自動化流程收集了一個多模態搜索視覺問答(VQA)數據集,該數據集涵蓋了多樣的視覺和文本知識需求,並策劃了一個搜索平衡的子集,其中包含需要搜索和無需搜索的樣本,這對於塑造高效且按需的搜索行為至關重要。在知識密集型和信息尋求型VQA任務上的廣泛實驗表明,我們的模型不僅在相同模型規模下超越了基於RAG的基線,而且在減少超過30%的搜索調用的同時,匹配了更大規模基於RAG模型的性能。我們進一步分析了關鍵的實證發現,為推進多模態搜索研究提供了可操作的見解。
English
Robust deployment of large multimodal models (LMMs) in real-world scenarios
requires access to external knowledge sources, given the complexity and dynamic
nature of real-world information. Existing approaches such as
retrieval-augmented generation (RAG) and prompt engineered search agents rely
on rigid pipelines, often leading to inefficient or excessive search behaviors.
We present MMSearch-R1, the first end-to-end reinforcement learning framework
that enables LMMs to perform on-demand, multi-turn search in real-world
Internet environments. Our framework integrates both image and text search
tools, allowing the model to reason about when and how to invoke them guided by
an outcome-based reward with a search penalty. To support training, We collect
a multimodal search VQA dataset through a semi-automated pipeline that covers
diverse visual and textual knowledge needs and curate a search-balanced subset
with both search-required and search-free samples, which proves essential for
shaping efficient and on-demand search behavior. Extensive experiments on
knowledge-intensive and info-seeking VQA tasks show that our model not only
outperforms RAG-based baselines of the same model size, but also matches the
performance of a larger RAG-based model while reducing search calls by over
30%. We further analyze key empirical findings to offer actionable insights for
advancing research in multimodal search.