超越SFT到RL：基於黑盒同策略蒸餾的多模態強化學習預對齊方法（注：標題翻譯在保持技術術語準確性的同時，採用符合中文論文標題習慣的動賓結構。"Pre-alignment"譯為「預對齊」是強化學習領域的標準譯法，"On-Policy Distillation"採用意譯「同策略蒸餾」以區別於異策略學習，並通過添加「方法」二字使標題更符合中文學術表達規範。）

摘要

大型多模態模型（LMM）的標準後訓練流程通常包含兩個階段：先在精選示範數據上進行監督式微調（SFT），接著採用可驗證獎勵的強化學習（RLVR）。然而，SFT會引發分佈偏移問題——既無法保留模型的原始能力，又難以完全擬合監督數據的分佈。這種問題在多模態推理任務中更為顯著，因為感知錯誤與推理失誤會呈現不同的偏移模式，並在後續RL階段產生疊加效應。為此，我們提出PRISM三階段訓練流程，通過在SFT與RLVR之間插入顯式的分佈對齊階段來緩解此偏移。基於策略蒸餾（OPD）原理，PRISM將對齊過程構建為策略模型與混合專家（MoE）判別器之間的黑盒響應層級對抗博弈：該判別器配備專職的感知與推理專家，能提供解耦的校正信號，引導策略模型趨近監督分佈，且無需訪問教師模型的邏輯輸出。雖然126萬條公開示範數據足以實現廣泛的SFT初始化，但分佈對齊需要更高保真度的監督數據；為此我們基於Gemini 3 Flash額外構建11.3萬條示範數據，其特點是對最難未解問題提供密集視覺定位與逐步推理註解。在Qwen3-VL上的實驗表明，PRISM能持續提升多種RL算法（GRPO、DAPO、GSPO）在各類多模態基準的下游RLVR性能，在40億和80億參數模型上相比SFT直接銜接RLVR的基線分別實現平均準確率提升4.4分和6.0分。我們的代碼、數據及模型權重已公開於https://github.com/XIAO4579/PRISM。

English

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

摘要

Support