开放视觉推理器:迁移语言认知行为以实现视觉推理
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
July 7, 2025
作者: Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
cs.AI
摘要
大型语言模型(LLMs)卓越的推理能力源于通过可验证奖励强化而涌现的认知行为。本研究探讨如何将这一原理迁移至多模态大语言模型(MLLMs),以解锁高级视觉推理。我们基于Qwen2.5-VL-7B提出了一种两阶段范式:首先进行大规模语言冷启动微调,随后实施跨越近千步的多模态强化学习(RL),在规模上超越了所有先前的开源尝试。这一开创性工作揭示了三个基本洞见:1)由于语言心理意象,行为迁移在冷启动初期便意外显现。2)冷启动广泛记忆视觉行为,而RL则关键性地识别并放大有效模式。3)迁移策略性地倾向于高效用行为,如视觉反思。我们最终得到的模型——开放视觉推理器(OVR),在一系列推理基准测试中达到了最先进的性能,包括MATH500上的95.3%,MathVision上的51.8%以及MathVerse上的54.6%。我们公开了模型、数据及训练动态,以促进开发更具能力、行为对齐的多模态推理器。
English
The remarkable reasoning capability of large language models (LLMs) stems
from cognitive behaviors that emerge through reinforcement with verifiable
rewards. This work investigates how to transfer this principle to Multimodal
LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage
paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning,
followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps,
surpassing all previous open-source efforts in scale. This pioneering work
reveals three fundamental insights: 1) Behavior transfer emerges surprisingly
early in cold start due to linguistic mental imagery. 2) Cold start broadly
memorizes visual behaviors, while RL critically discerns and scales up
effective patterns. 3) Transfer strategically favors high-utility behaviors
such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR),
achieves state-of-the-art performance on a suite of reasoning benchmarks,
including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We
release our model, data, and training dynamics to catalyze the development of
more capable, behavior-aligned multimodal reasoners.