ChatPaper.aiChatPaper

视觉深度研究:激励多模态大语言模型中的深度研究能力

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

January 29, 2026
作者: Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Yao Hu, Philip Torr, Feng Zhao, Wanli Ouyang
cs.AI

摘要

多模态大语言模型(MLLMs)在各类视觉任务中取得了显著成功。然而受限于其内部世界知识的容量,先前研究提出通过"先推理后工具调用"的方式增强MLLMs,借助视觉与文本搜索引擎在需要大量事实信息的任务上实现显著提升。但现有方法通常将多模态搜索置于理想化场景,仅假设单个全景/实体级图像查询和少量文本查询即可获取答题关键证据,这在充满视觉噪声的现实场景中并不适用。此外,这些方法在推理深度和搜索广度上存在局限,难以解决需要聚合多源视觉与文本证据的复杂问题。 基于此,我们提出Vision-DeepResearch,创新性地构建了多模态深度研究范式:通过多轮次、多实体、多尺度的视觉与文本搜索,在强噪声环境下实现对现实搜索引擎的鲁棒调用。我们的方法支持数十步推理流程和数百次引擎交互,同时通过冷启动监督和强化学习训练将深度研究能力内化至MLLM,最终形成强大的端到端多模态深度研究模型。实验表明,该模型显著优于现有多模态深度研究MLLMs,以及基于GPT-5、Gemini-2.5-pro和Claude-4-Sonnet等顶尖闭源基础模型构建的工作流。代码将发布于https://github.com/Osilly/Vision-DeepResearch。
English
Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.
PDF1545March 12, 2026