콜드 스타트와 함께 강화 학습을 통한 다중 모달 추론의 발전

초록

최근 대규모 언어 모델(LLMs)의 발전은 인상적인 사고의 연쇄적 추론 능력을 보여주었으며, 강화 학습(RL)이 이러한 진전에 중요한 역할을 하고 있습니다. 모델이 반성을 통해 자기 수정을 보이는 "아하 순간" 패턴은 종종 RL에서 비롯된 창발적 특성으로 여겨지지만, 우리는 먼저 이러한 패턴이 RL 훈련 전에도 다중 모달 LLMs(MLLMs)에 존재하지만 반드시 향상된 추론 성능과 상관관계가 있지는 않음을 입증합니다. 이러한 통찰을 바탕으로, 우리는 두 단계 접근법을 통해 다중 모달 추론을 강화하는 포괄적인 연구를 제시합니다: (1) 구조화된 사고의 연쇄적 추론 패턴을 사용한 감독된 미세 조정(SFT)을 콜드 스타트로 수행하고, (2) GRPO를 통한 강화 학습을 통해 이러한 능력을 더욱 세밀하게 다듬습니다. 우리의 광범위한 실험은 이 결합된 접근법이 도전적인 다중 모달 추론 벤치마크에서 SFT만 또는 RL만 사용한 방법보다 일관되게 우수한 성능을 보임을 입증합니다. 결과적으로 얻은 모델은 3B와 7B 규모에서 오픈소스 MLLMs 중 최첨단 성능을 달성하며, 특히 7B 모델은 기본 모델 대비 상당한 개선을 보입니다(예: MathVista에서 66.3% → 73.4%, We-Math에서 62.9% → 70.4%). 또한, 3B 모델은 여러 7B 모델과 경쟁력 있는 성능을 달성합니다. 전반적으로, 이 연구는 고급 다중 모달 추론 모델 구축을 위한 실용적인 지침을 제공합니다. 우리의 코드는 https://github.com/waltonfuture/RL-with-Cold-Start에서 확인할 수 있습니다.

English

Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %rightarrow73.4 % on MathVista, 62.9 %rightarrow70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

콜드 스타트와 함께 강화 학습을 통한 다중 모달 추론의 발전

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

초록

Support