強化學習在多模態離散擴散模型中的整合

摘要

優化帶有獎勵的離散擴散模型（DDM）仍是一大挑戰：非自迴歸的範式使得重要性採樣難以處理，且推演過程複雜，這讓如群組相對策略優化（GRPO）等強化學習方法陷入困境。在本研究中，我們引入了MaskGRPO，這是首個可行的方法，能在離散擴散中實現可擴展的多模態強化學習，並配備有效的重要性採樣及模態專屬的適應機制。為此，我們首先闡明了DDM的理論基礎，這有助於構建一個能捕捉有價值詞元波動以進行梯度更新的重要性估計器。接著，我們精心調整了針對視覺序列的推演方法，從而產生多樣化的完成結果和可靠的優化梯度。在數學推理、編碼及視覺生成基準測試中，MaskGRPO帶來了更穩定且高效的更新，從而實現了更強的推理性能和更優的生成質量。本研究確立了MaskGRPO作為一種系統性的策略優化方法，並成為離散視覺擴散的首個實用途徑。

English

Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

強化學習在多模態離散擴散模型中的整合

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

摘要

Support