开放视觉推理器：迁移语言认知行为以实现视觉推理

摘要

大型语言模型（LLMs）卓越的推理能力源于通过可验证奖励强化而涌现的认知行为。本研究探讨如何将这一原理迁移至多模态大语言模型（MLLMs），以解锁高级视觉推理。我们基于Qwen2.5-VL-7B提出了一种两阶段范式：首先进行大规模语言冷启动微调，随后实施跨越近千步的多模态强化学习（RL），在规模上超越了所有先前的开源尝试。这一开创性工作揭示了三个基本洞见：1）由于语言心理意象，行为迁移在冷启动初期便意外显现。2）冷启动广泛记忆视觉行为，而RL则关键性地识别并放大有效模式。3）迁移策略性地倾向于高效用行为，如视觉反思。我们最终得到的模型——开放视觉推理器（OVR），在一系列推理基准测试中达到了最先进的性能，包括MATH500上的95.3%，MathVision上的51.8%以及MathVerse上的54.6%。我们公开了模型、数据及训练动态，以促进开发更具能力、行为对齐的多模态推理器。

English

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

开放视觉推理器：迁移语言认知行为以实现视觉推理

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

摘要

Support