多模态离散扩散模型的强化学习整合

摘要

优化带有奖励机制的离散扩散模型（DDM）仍面临挑战：非自回归范式使得重要性采样难以处理，且展开过程复杂，这令诸如群体相对策略优化（GRPO）等强化学习方法陷入困境。本研究提出了MaskGRPO，这是首个在离散扩散中实现可扩展多模态强化学习的可行方案，它具备有效的重要性采样及模态特定适应性。为此，我们首先明确了DDM的理论基础，进而构建了一个重要性估计器，该估计器能捕捉到对梯度更新有价值的标记波动。随后，我们精心设计了针对视觉序列的展开方法，该方法不仅生成了多样化的完成结果，还提供了可靠的优化梯度。在数学推理、编码及视觉生成基准测试中，MaskGRPO带来了更稳定、高效的更新，从而实现了更强的推理性能与更优的生成质量。本研究确立了MaskGRPO作为一种系统性策略优化方法的地位，并首次为离散化视觉扩散提供了实用途径。

English

Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

多模态离散扩散模型的强化学习整合

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

摘要

Support