マルチモーダル離散拡散モデルのための強化学習の統合

要旨

報酬を用いた離散拡散モデル（DDM）の最適化は依然として課題である：非自己回帰的なパラダイムは重要度サンプリングを困難にし、ロールアウトを複雑にするため、Group Relative Policy Optimization（GRPO）などの強化学習手法を難解にしている。本研究では、MaskGRPOを導入し、離散拡散におけるスケーラブルな多モーダル強化学習を可能にする初めての実用的なアプローチを提案する。これにより、効果的な重要度サンプリングとモダリティ固有の適応を実現する。そのために、まずDDMの理論的基盤を明確にし、勾配更新に有用なトークンの変動を捉える重要度推定器の構築を容易にする。次に、視覚シーケンスのためのロールアウト手法を精巧に調整し、多様な補完と信頼性の高い最適化勾配を生み出す。数学的推論、コーディング、視覚生成のベンチマークにおいて、MaskGRPOはより安定した効率的な更新をもたらし、より強力な推論性能と優れた生成品質を実現する。本研究は、MaskGRPOを体系的なポリシー最適化アプローチとして確立し、離散化された視覚拡散のための初めての実用的な方法を示すものである。

English

Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

マルチモーダル離散拡散モデルのための強化学習の統合

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

要旨

Support