VL-Cogito：面向高级多模态推理的渐进式课程强化学习

摘要

强化学习在提升大型语言模型的推理能力方面已展现出显著成效。近期研究逐步将这一范式扩展至多模态推理任务。鉴于多模态任务在语义内容和问题表述上的固有复杂性与多样性，现有模型往往在不同领域和难度级别上表现出不稳定的性能。为应对这些局限，我们提出了VL-Cogito，一种通过新颖的多阶段渐进课程强化学习（PCuRL）框架训练的高级多模态推理模型。PCuRL系统性地引导模型逐步攻克难度递增的任务，显著提升了其在多样化多模态情境下的推理能力。该框架引入了两大创新点：(1) 在线难度软权重机制，动态调整连续强化学习训练阶段的训练难度；(2) 动态长度奖励机制，激励模型根据任务复杂度自适应调节其推理路径长度，从而在推理效率与准确性之间取得平衡。实验评估表明，VL-Cogito在涵盖数学、科学、逻辑及常识理解的主流多模态基准测试中，持续匹配或超越现有以推理为导向的模型，验证了我们方法的有效性。

English

Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.

VL-Cogito：面向高级多模态推理的渐进式课程强化学习

VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

摘要

Support