ChatPaper.aiChatPaper

DeepEyesV2:邁向具能動性的多模態模型

DeepEyesV2: Toward Agentic Multimodal Model

November 7, 2025
作者: Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu
cs.AI

摘要

具身化多模態模型不僅需要理解文本與圖像,更應能主動調用外部工具(如代碼執行環境與網絡搜索),並將這些操作整合至推理過程中。本研究提出DeepEyesV2,從數據構建、訓練方法與模型評估三個維度探討如何構建具身化多模態模型。我們觀察到僅依靠直接強化學習難以形成穩健的工具使用行為,此現象促使我們設計兩階段訓練流程:建立工具使用模式的冷啟動階段,以及進一步優化工具調用的強化學習階段。我們構建了多樣化且具適度挑戰性的訓練數據集,特別納入工具使用能產生效益的實例。同時推出RealX-Bench綜合基準測試,專為評估需要融合感知、搜索與推理等多重能力的真實世界多模態推理而設計。DeepEyesV2在RealX-Bench及其他代表性基準測試中展現出卓越性能,涵蓋真實場景理解、數學推理及搜索密集型任務。值得注意的是,該模型呈現任務自適應的工具調用特性:感知任務傾向使用圖像操作,推理任務則偏好數值計算。強化學習更促進了複雜工具組合的運用,使模型能根據情境選擇性調用工具。我們期望此研究能為學界開發具身化多模態模型提供實踐指引。
English
Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.
PDF412December 2, 2025