MMSearch-R1:激励大型多模态模型进行搜索
MMSearch-R1: Incentivizing LMMs to Search
June 25, 2025
作者: Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
cs.AI
摘要
在现实场景中稳健部署大型多模态模型(LMMs)需要接入外部知识源,鉴于现实世界信息的复杂性和动态性。现有方法如检索增强生成(RAG)和提示工程搜索代理依赖于固定流程,常导致搜索行为效率低下或过度。我们提出了MMSearch-R1,这是首个端到端的强化学习框架,使LMMs能够在真实互联网环境中按需进行多轮搜索。该框架整合了图像和文本搜索工具,让模型能够基于结果导向的奖励及搜索惩罚机制,推理何时及如何调用这些工具。为支持训练,我们通过半自动化流程收集了一个涵盖多样化视觉与文本知识需求的多模态搜索VQA数据集,并精选了一个包含需搜索与无需搜索样本的搜索平衡子集,这对于塑造高效且按需的搜索行为至关重要。在知识密集型和信息寻求型VQA任务上的广泛实验表明,我们的模型不仅超越了同规模RAG基线,还在减少超过30%搜索调用的情况下,与更大规模RAG模型性能相当。我们进一步分析了关键实证发现,为推进多模态搜索研究提供了可操作的洞见。
English
Robust deployment of large multimodal models (LMMs) in real-world scenarios
requires access to external knowledge sources, given the complexity and dynamic
nature of real-world information. Existing approaches such as
retrieval-augmented generation (RAG) and prompt engineered search agents rely
on rigid pipelines, often leading to inefficient or excessive search behaviors.
We present MMSearch-R1, the first end-to-end reinforcement learning framework
that enables LMMs to perform on-demand, multi-turn search in real-world
Internet environments. Our framework integrates both image and text search
tools, allowing the model to reason about when and how to invoke them guided by
an outcome-based reward with a search penalty. To support training, We collect
a multimodal search VQA dataset through a semi-automated pipeline that covers
diverse visual and textual knowledge needs and curate a search-balanced subset
with both search-required and search-free samples, which proves essential for
shaping efficient and on-demand search behavior. Extensive experiments on
knowledge-intensive and info-seeking VQA tasks show that our model not only
outperforms RAG-based baselines of the same model size, but also matches the
performance of a larger RAG-based model while reducing search calls by over
30%. We further analyze key empirical findings to offer actionable insights for
advancing research in multimodal search.