SFT-to-RL을 넘어서: 멀티모달 강화학습을 위한 블랙박스 온-폴리시 디스틸레이션 기반 사전 얼라인먼트

초록

대규모 멀티모달 모델(LMM)의 표준 학습 후 처리는 선별된 데모에 대한 지도 미세 조정(SFT)을 적용한 후 검증 가능한 보상을 활용한 강화 학습(RLVR)을 수행하는 방식입니다. 그러나 SFT는 모델의 원래 능력을 보존하지도 않고 감독 분포에 충실히 일치하지도 않는 분포 편이를 초래합니다. 이 문제는 인식 오류와 추론 실패가 서로 다른 편이 패턴을 보이며 후속 RL 과정에서 누적되는 멀티모달 추론에서 더욱 증폭됩니다. 본 논문에서는 SFT와 RLVR 사이에 명시적인 분포 정렬 단계를 추가하여 이러한 편이를 완화하는 3단계 파이프라인인 PRISM을 소개합니다. 온-정책 지식 증류(OPD) 원리에 기반하여, PRISM은 정책과 전용 인식 및 추론 전문가로 구성된 MoE(Mixture-of-Experts) 판별자 간의 블랙박스 응답 수준 적대적 게임으로 정렬을 구성하며, 교사 로짓에 접근할 필요 없이 감독 분포로 정책을 유도하는 분리된 수정 신호를 제공합니다. 126만 개의 공개 데모는 광범위한 SFT 초기화에는 충분하지만, 분포 정렬에는 더 높은 정밀도의 감독이 필요합니다. 따라서 우리는 가장 해결되지 않은 난제들에 대해 조밀한 시각적 근거와 단계별 추론을 특징으로 하는 Gemini 3 Flash에서 113,000개의 추가 데모를 선별했습니다. Qwen3-VL에 대한 실험 결과, PRISM은 여러 RL 알고리즘(GRPO, DAPO, GSPO)과 다양한 멀티모달 벤치마크에서 하류 RLVR 성능을 지속적으로 향상시키며, 40억 및 80억 파라미터 모델에서 SFT-to-RLVR 기준선 대비 평균 정확도를 각각 +4.4점, +6.0점 향상시켰습니다. 우리의 코드, 데이터 및 모델 체크포인트는 https://github.com/XIAO4579/PRISM 에 공개되어 있습니다.

English

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.

SFT-to-RL을 넘어서: 멀티모달 강화학습을 위한 블랙박스 온-폴리시 디스틸레이션 기반 사전 얼라인먼트

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

초록

Support