DeepMMSearch-R1：赋能多模态大语言模型于多模态网络搜索

摘要

在現實世界的應用中，多模態大型語言模型（MLLMs）需要接入外部知識源，並必須對動態且不斷變化的現實世界信息保持響應，以解決用戶的信息尋求和知識密集型查詢。現有的方法，如檢索增強生成（RAG）方法、搜索代理和配備搜索功能的MLLMs，往往存在流程僵化、搜索調用過多以及搜索查詢構建不佳等問題，導致效率低下和結果不理想。為解決這些限制，我們提出了DeepMMSearch-R1，這是首個能夠執行按需多輪網絡搜索並動態構建圖像和文本搜索工具查詢的多模態LLM。具體而言，DeepMMSearch-R1可以基於輸入圖像的相關裁剪區域啟動網絡搜索，使圖像搜索更加有效，並可以根據檢索到的信息迭代調整文本搜索查詢，從而實現自我反思和自我修正。我們的方法依賴於一個兩階段的訓練流程：冷啟動的監督微調階段，隨後是在線強化學習優化。為了訓練，我們引入了DeepMMSearchVQA，這是一個通過自動化流程創建的新穎多模態VQA數據集，其中混合了來自網絡搜索工具的現實世界信息。該數據集包含多樣化的多跳查詢，這些查詢整合了文本和視覺信息，教會模型何時搜索、搜索什麼、使用哪種搜索工具以及如何對檢索到的信息進行推理。我們在一系列知識密集型基準上進行了廣泛的實驗，以證明我們方法的優越性。最後，我們分析了結果並提供了對推進多模態網絡搜索有價值的見解。

English

Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

DeepMMSearch-R1：赋能多模态大语言模型于多模态网络搜索

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

摘要

Support