開放視覺推理器:遷移語言認知行為以實現視覺推理
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
July 7, 2025
作者: Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
cs.AI
摘要
大型語言模型(LLMs)卓越的推理能力源自於通過可驗證獎勵強化而湧現的認知行為。本研究探討如何將這一原則轉移至多模態大型語言模型(MLLMs),以解鎖高級視覺推理能力。我們基於Qwen2.5-VL-7B提出了一種兩階段範式:首先進行大規模的語言冷啟動微調,隨後進行近1000步的多模態強化學習(RL),其規模超越了所有先前的開源努力。這項開創性工作揭示了三個基本洞見:1)由於語言心理意象,行為轉移在冷啟動階段意外地早期出現。2)冷啟動廣泛記憶視覺行為,而RL則關鍵性地辨識並擴展有效模式。3)轉移策略性地偏好高效用行為,如視覺反思。我們最終的模型——開放視覺推理器(OVR),在一系列推理基準測試中達到了最先進的性能,包括MATH500的95.3%、MathVision的51.8%以及MathVerse的54.6%。我們公開了模型、數據及訓練動態,以促進開發更強大、行為對齊的多模態推理器。
English
The remarkable reasoning capability of large language models (LLMs) stems
from cognitive behaviors that emerge through reinforcement with verifiable
rewards. This work investigates how to transfer this principle to Multimodal
LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage
paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning,
followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps,
surpassing all previous open-source efforts in scale. This pioneering work
reveals three fundamental insights: 1) Behavior transfer emerges surprisingly
early in cold start due to linguistic mental imagery. 2) Cold start broadly
memorizes visual behaviors, while RL critically discerns and scales up
effective patterns. 3) Transfer strategically favors high-utility behaviors
such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR),
achieves state-of-the-art performance on a suite of reasoning benchmarks,
including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We
release our model, data, and training dynamics to catalyze the development of
more capable, behavior-aligned multimodal reasoners.