SFT vs RL? R1과 유사한 추론 능력을 가진 대형 시각-언어 모델 학습에 대한 초기 탐구

초록

본 연구는 대형 시각-언어 모델(LVLMs) 훈련을 위해 널리 사용되는 지도 미세 조정(SFT) 후 강화 학습(RL) 패러다임을 재검토하며, 중요한 발견을 제시합니다: SFT는 전문 모델로부터 모방된 "가짜 추론 경로"를 유도함으로써 후속 RL을 크게 저해할 수 있습니다. 이러한 경로는 RL 모델의 고유한 추론 경로와 유사해 보일 수 있지만, 종종 지나치게 길고 망설이며 정보가 부족한 단계와 잘못된 추론을 포함합니다. 이러한 효과를 체계적으로 연구하기 위해, 우리는 LVLMs의 추론을 지원하기 위해 설계된 새로운 멀티모달 데이터셋인 VLAA-Thinking을 소개합니다. 캡셔닝, 추론 증류, 답변 재작성 및 검증의 6단계 파이프라인을 통해 구성된 VLAA-Thinking은 SFT를 위한 고품질의 단계별 시각 추론 흔적과 동일한 데이터 소스에서 더 도전적인 RL 분할을 포함합니다. 이 데이터셋을 사용하여 SFT, RL 및 이들의 조합을 비교하는 광범위한 실험을 수행합니다. 결과는 SFT가 모델이 추론 형식을 학습하는 데 도움을 주지만, 종종 정렬된 모델을 모방적이고 경직된 추론 모드에 고정시켜 추가 학습을 방해한다는 것을 보여줍니다. 반면, 그룹 상대 정책 최적화(GRPO)를 기반으로 지각과 인지 신호를 통합한 새로운 혼합 보상 모듈을 사용한 우리의 RL 접근법은 더 진정적이고 적응적인 추론 행동을 촉진합니다. 특히, Qwen2.5VL 3B를 기반으로 한 우리의 모델 VLAA-Thinker는 4B 규모 LVLMs 중 Open LMM Reasoning Leaderboard(https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard)에서 최고 성능을 달성하며, 이전 최첨단 모델을 1.8% 능가합니다. 우리의 연구 결과가 추론 능력을 갖춘 LVLMs 개발에 유용한 통찰을 제공하고, 이 분야의 미래 연구에 기여하기를 바랍니다.

English

This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.

SFT vs RL? R1과 유사한 추론 능력을 가진 대형 시각-언어 모델 학습에 대한 초기 탐구

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

초록

Support