DeepEyesV2:迈向具身多模态智能体模型
DeepEyesV2: Toward Agentic Multimodal Model
November 7, 2025
作者: Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu
cs.AI
摘要
智能多模态模型不仅应能理解文本与图像,更需主动调用外部工具(如代码执行环境与网络搜索),并将这些操作融入推理过程。本文提出DeepEyesV2模型,从数据构建、训练方法和模型评估三个维度探索如何构建智能多模态模型。我们发现单纯使用强化学习难以形成稳健的工具使用行为,这一现象促使我们设计两阶段训练流程:通过冷启动阶段建立工具使用模式,再通过强化学习阶段优化工具调用策略。我们构建了多样化、适度挑战性的训练数据集,特别包含工具使用能带来增益的实例,并推出RealX-Bench综合基准——该基准专为评估需要融合感知、搜索与推理能力的真实世界多模态推理任务而设计。在RealX-Bench及其他代表性基准上的实验表明,DeepEyesV2在真实场景理解、数学推理和搜索密集型任务中均表现优异。此外,该模型展现出任务自适应的工具调用特性:针对感知任务倾向于使用图像操作,针对推理任务则偏好数值计算。强化学习进一步实现了复杂工具组合调用,使模型能根据上下文选择性激活工具。我们希望本研究能为学界开发智能多模态模型提供参考路径。
English
Agentic multimodal models should not only comprehend text and images, but
also actively invoke external tools, such as code execution environments and
web search, and integrate these operations into reasoning. In this work, we
introduce DeepEyesV2 and explore how to build an agentic multimodal model from
the perspectives of data construction, training methods, and model evaluation.
We observe that direct reinforcement learning alone fails to induce robust
tool-use behavior. This phenomenon motivates a two-stage training pipeline: a
cold-start stage to establish tool-use patterns, and reinforcement learning
stage to further refine tool invocation. We curate a diverse, moderately
challenging training dataset, specifically including examples where tool use is
beneficial. We further introduce RealX-Bench, a comprehensive benchmark
designed to evaluate real-world multimodal reasoning, which inherently requires
the integration of multiple capabilities, including perception, search, and
reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative
benchmarks, demonstrating its effectiveness across real-world understanding,
mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2
exhibits task-adaptive tool invocation, tending to use image operations for
perception tasks and numerical computations for reasoning tasks. Reinforcement
learning further enables complex tool combinations and allows model to
selectively invoke tools based on context. We hope our study can provide
guidance for community in developing agentic multimodal models.