DeepEyesV2:邁向具能動性的多模態模型
DeepEyesV2: Toward Agentic Multimodal Model
November 7, 2025
作者: Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu
cs.AI
摘要
具身化多模態模型不僅需要理解文本與圖像,更應能主動調用外部工具(如代碼執行環境與網絡搜索),並將這些操作整合至推理過程中。本研究提出DeepEyesV2,從數據構建、訓練方法與模型評估三個維度探討如何構建具身化多模態模型。我們觀察到僅依靠直接強化學習難以形成穩健的工具使用行為,此現象促使我們設計兩階段訓練流程:建立工具使用模式的冷啟動階段,以及進一步優化工具調用的強化學習階段。我們構建了多樣化且具適度挑戰性的訓練數據集,特別納入工具使用能產生效益的實例。同時推出RealX-Bench綜合基準測試,專為評估需要融合感知、搜索與推理等多重能力的真實世界多模態推理而設計。DeepEyesV2在RealX-Bench及其他代表性基準測試中展現出卓越性能,涵蓋真實場景理解、數學推理及搜索密集型任務。值得注意的是,該模型呈現任務自適應的工具調用特性:感知任務傾向使用圖像操作,推理任務則偏好數值計算。強化學習更促進了複雜工具組合的運用,使模型能根據情境選擇性調用工具。我們期望此研究能為學界開發具身化多模態模型提供實踐指引。
English
Agentic multimodal models should not only comprehend text and images, but
also actively invoke external tools, such as code execution environments and
web search, and integrate these operations into reasoning. In this work, we
introduce DeepEyesV2 and explore how to build an agentic multimodal model from
the perspectives of data construction, training methods, and model evaluation.
We observe that direct reinforcement learning alone fails to induce robust
tool-use behavior. This phenomenon motivates a two-stage training pipeline: a
cold-start stage to establish tool-use patterns, and reinforcement learning
stage to further refine tool invocation. We curate a diverse, moderately
challenging training dataset, specifically including examples where tool use is
beneficial. We further introduce RealX-Bench, a comprehensive benchmark
designed to evaluate real-world multimodal reasoning, which inherently requires
the integration of multiple capabilities, including perception, search, and
reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative
benchmarks, demonstrating its effectiveness across real-world understanding,
mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2
exhibits task-adaptive tool invocation, tending to use image operations for
perception tasks and numerical computations for reasoning tasks. Reinforcement
learning further enables complex tool combinations and allows model to
selectively invoke tools based on context. We hope our study can provide
guidance for community in developing agentic multimodal models.