ChatPaper.aiChatPaper

DeepEyesV2:迈向具身多模态智能体模型

DeepEyesV2: Toward Agentic Multimodal Model

November 7, 2025
作者: Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu
cs.AI

摘要

智能多模态模型不仅应能理解文本与图像,更需主动调用外部工具(如代码执行环境与网络搜索),并将这些操作融入推理过程。本文提出DeepEyesV2模型,从数据构建、训练方法和模型评估三个维度探索如何构建智能多模态模型。我们发现单纯使用强化学习难以形成稳健的工具使用行为,这一现象促使我们设计两阶段训练流程:通过冷启动阶段建立工具使用模式,再通过强化学习阶段优化工具调用策略。我们构建了多样化、适度挑战性的训练数据集,特别包含工具使用能带来增益的实例,并推出RealX-Bench综合基准——该基准专为评估需要融合感知、搜索与推理能力的真实世界多模态推理任务而设计。在RealX-Bench及其他代表性基准上的实验表明,DeepEyesV2在真实场景理解、数学推理和搜索密集型任务中均表现优异。此外,该模型展现出任务自适应的工具调用特性:针对感知任务倾向于使用图像操作,针对推理任务则偏好数值计算。强化学习进一步实现了复杂工具组合调用,使模型能根据上下文选择性激活工具。我们希望本研究能为学界开发智能多模态模型提供参考路径。
English
Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.
PDF442February 8, 2026