VL-Cogito: 고급 다중모달 추론을 위한 점진적 커리큘럼 강화 학습

초록

강화 학습은 대규모 언어 모델의 추론 능력을 향상시키는 데 있어 그 효과성을 입증해 왔습니다. 최근 연구들은 이러한 패러다임을 점차적으로 다중 모달 추론 작업으로 확장하고 있습니다. 다중 모달 작업의 본질적인 복잡성과 다양성, 특히 의미론적 내용과 문제 구성 측면에서 기존 모델들은 다양한 도메인과 난이도에 걸쳐 불안정한 성능을 보이는 경우가 많습니다. 이러한 한계를 해결하기 위해, 우리는 새로운 다단계 점진적 커리큘럼 강화 학습(PCuRL) 프레임워크를 통해 훈련된 고급 다중 모달 추론 모델인 VL-Cogito를 제안합니다. PCuRL은 점차적으로 난이도가 증가하는 작업을 통해 모델을 체계적으로 안내함으로써, 다양한 다중 모달 상황에서의 추론 능력을 크게 향상시킵니다. 이 프레임워크는 두 가지 주요 혁신을 도입합니다: (1) 온라인 난이도 소프트 가중치 메커니즘으로, 연속적인 강화 학습 단계에서 훈련 난이도를 동적으로 조정하며; (2) 동적 길이 보상 메커니즘으로, 모델이 작업 복잡도에 따라 추론 경로 길이를 적응적으로 조절하도록 유도하여 추론 효율성과 정확성 사이의 균형을 맞춥니다. 실험 평가 결과, VL-Cogito는 수학, 과학, 논리, 일반 이해 등 주류 다중 모달 벤치마크에서 기존의 추론 중심 모델들을 꾸준히 따라가거나 능가하는 성능을 보여, 우리의 접근 방식의 효과성을 입증했습니다.

English

Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.