VL-Cogito: 高度なマルチモーダル推論のための段階的カリキュラム強化学習

要旨

強化学習は、大規模言語モデルの推論能力を向上させる上でその有効性を証明してきました。最近の研究では、このパラダイムをマルチモーダル推論タスクに段階的に拡張する取り組みが進められています。しかし、マルチモーダルタスクの本質的な複雑さと多様性、特に意味内容と問題定式化の面において、既存のモデルは様々なドメインや難易度で不安定な性能を示すことがしばしばあります。これらの課題を解決するため、我々はVL-Cogitoを提案します。これは、新たな多段階型プログレッシブカリキュラム強化学習（PCuRL）フレームワークを用いて訓練された高度なマルチモーダル推論モデルです。PCuRLは、モデルを徐々に難易度を上げたタスクを通じて体系的に導き、多様なマルチモーダルコンテキストにおける推論能力を大幅に向上させます。このフレームワークは、2つの主要な革新を導入しています：(1) オンライン難易度ソフト重み付けメカニズムにより、連続する強化学習訓練段階で訓練難易度を動的に調整します；(2) 動的長さ報酬メカニズムにより、モデルがタスクの複雑さに応じて推論パスの長さを適応的に調整し、推論効率と正確性のバランスを取ります。実験的評価では、VL-Cogitoが数学、科学、論理、一般理解にわたる主流のマルチモーダルベンチマークにおいて、既存の推論指向モデルを一貫して匹敵または凌駕することを示し、我々のアプローチの有効性を検証しています。

English

Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.