Vision-DeepResearch:激發多模態大語言模型的深度研究能力
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
January 29, 2026
作者: Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Yao Hu, Philip Torr, Feng Zhao, Wanli Ouyang
cs.AI
摘要
多模態大型語言模型(MLLMs)在廣泛的視覺任務中取得了顯著成功。然而受限於其內部世界知識的容量,先前研究提出通過「先推理後調用工具」的方式增強MLLMs,利用視覺與文本搜索引擎在需要大量事實信息的任務上獲得顯著提升。但這些方法通常將多模態搜索置於理想化場景中,假設僅需單次全圖層面或實體層面的圖像查詢配合少量文本查詢即可獲取回答問題的關鍵證據,這在充滿視覺噪聲的真實場景中並不現實。此外,它們往往受制於推理深度與搜索廣度的局限,難以解決需要聚合多源視覺與文本證據的複雜問題。
基於此,我們提出Vision-DeepResearch,創新性地構建了多模態深度研究範式:通過多輪次、多實體、多尺度的視覺與文本搜索,在強噪聲環境下實現對真實搜索引擎的魯棒適配。我們的Vision-DeepResearch支持數十步推理鏈與數百次引擎交互,同時通過冷啟動監督與強化學習訓練將深度研究能力內化至MLLM,最終形成端到端的強大多模態深度研究模型。該模型顯著優於現有多模態深度研究MLLMs,以及基於GPT-5、Gemini-2.5-pro和Claude-4-Sonnet等頂級閉源基礎模型構建的工作流。代碼將發佈於https://github.com/Osilly/Vision-DeepResearch。
English
Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.