VL-Cogito:面向高级多模态推理的渐进式课程强化学习
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
July 30, 2025
作者: Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong
cs.AI
摘要
強化學習在提升大型語言模型的推理能力方面已證實其有效性。近期的研究逐步將這一範式擴展至多模態推理任務。由於多模態任務固有的複雜性與多樣性,尤其是在語義內容與問題表述上,現有模型在不同領域及難度層次上常表現出不穩定的性能。為解決這些限制,我們提出了VL-Cogito,這是一個通過新穎的多階段漸進課程強化學習(PCuRL)框架訓練的高級多模態推理模型。PCuRL系統地引導模型通過逐步增加難度的任務,顯著提升了其在多樣化多模態情境下的推理能力。該框架引入了兩項關鍵創新:(1)在線難度軟權重機制,動態調整連續RL訓練階段的訓練難度;(2)動態長度獎勵機制,鼓勵模型根據任務複雜度自適應調節其推理路徑長度,從而平衡推理效率與正確性。實驗評估表明,VL-Cogito在涵蓋數學、科學、邏輯及一般理解的主流多模態基準測試中,持續匹配或超越現有的推理導向模型,驗證了我們方法的有效性。
English
Reinforcement learning has proven its effectiveness in enhancing the
reasoning capabilities of large language models. Recent research efforts have
progressively extended this paradigm to multimodal reasoning tasks. Due to the
inherent complexity and diversity of multimodal tasks, especially in semantic
content and problem formulations, existing models often exhibit unstable
performance across various domains and difficulty levels. To address these
limitations, we propose VL-Cogito, an advanced multimodal reasoning model
trained via a novel multi-stage Progressive Curriculum Reinforcement Learning
(PCuRL) framework. PCuRL systematically guides the model through tasks of
gradually increasing difficulty, substantially improving its reasoning
abilities across diverse multimodal contexts. The framework introduces two key
innovations: (1) an online difficulty soft weighting mechanism, dynamically
adjusting training difficulty across successive RL training stages; and (2) a
dynamic length reward mechanism, which encourages the model to adaptively
regulate its reasoning path length according to task complexity, thus balancing
reasoning efficiency with correctness. Experimental evaluations demonstrate
that VL-Cogito consistently matches or surpasses existing reasoning-oriented
models across mainstream multimodal benchmarks spanning mathematics, science,
logic, and general understanding, validating the effectiveness of our approach.