VL-Cogito：面向高级多模态推理的渐进式课程强化学习

摘要

強化學習在提升大型語言模型的推理能力方面已證實其有效性。近期的研究逐步將這一範式擴展至多模態推理任務。由於多模態任務固有的複雜性與多樣性，尤其是在語義內容與問題表述上，現有模型在不同領域及難度層次上常表現出不穩定的性能。為解決這些限制，我們提出了VL-Cogito，這是一個通過新穎的多階段漸進課程強化學習（PCuRL）框架訓練的高級多模態推理模型。PCuRL系統地引導模型通過逐步增加難度的任務，顯著提升了其在多樣化多模態情境下的推理能力。該框架引入了兩項關鍵創新：（1）在線難度軟權重機制，動態調整連續RL訓練階段的訓練難度；（2）動態長度獎勵機制，鼓勵模型根據任務複雜度自適應調節其推理路徑長度，從而平衡推理效率與正確性。實驗評估表明，VL-Cogito在涵蓋數學、科學、邏輯及一般理解的主流多模態基準測試中，持續匹配或超越現有的推理導向模型，驗證了我們方法的有效性。

English

Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.