ChatPaper.aiChatPaper

超越SFT到RL:基于黑盒同策略蒸馏的多模态RL预对齐方法

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

May 1, 2026
作者: Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin
cs.AI

摘要

大型多模态模型(LMM)的标准后训练方案通常包括对精选示例进行监督微调(SFT),随后采用可验证奖励的强化学习(RLVR)。然而,SFT会引发分布漂移,既无法保留模型的原始能力,也难以忠实匹配监督数据的分布。这一问题在多模态推理中更为突出——感知错误与推理失败会遵循不同的漂移模式,并在后续强化学习中持续叠加。我们提出PRISM三阶段流程,通过在SFT与RLVR之间插入显式的分布对齐阶段来缓解此漂移。基于在线策略蒸馏(OPD)思想,PRISM将对齐过程构建为策略与混合专家(MoE)判别器之间的黑盒响应级对抗博弈:该判别器配备专有的感知与推理专家,可提供解耦的校正信号,引导策略逼近监督分布,且无需访问教师模型的逻辑输出。虽然126万公开示例足以实现广泛的SFT初始化,但分布对齐需要更高保真度的监督数据。为此,我们基于Gemini 3 Flash额外标注了11.3万条示例,针对最棘手的未解问题提供密集视觉定位与分步推理。在Qwen3-VL上的实验表明,PRISM能持续提升多种强化学习算法(GRPO、DAPO、GSPO)与多模态基准下的RLVR性能,在40亿和80亿参数模型上相较SFT直接接RLVR的基线分别平均提升4.4和6.0个准确率百分点。相关代码、数据及模型权重已开源:https://github.com/XIAO4579/PRISM。
English
The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.
PDF353May 7, 2026