拡散型大規模言語モデルのためのインペインティング誘導型ポリシー最適化

要旨

マスク拡散型大規模言語モデル（dLLM）は、自己回帰型LLMの有望な代替として注目を集めており、競争力のある性能を提供しながら、インペインティングなどの独自の生成能力をサポートしています。本論文では、インペインティングがdLLMの強化学習アルゴリズム設計にどのように役立つかを探ります。LLMと強化学習を整合させる際には、探索の課題が存在します。具体的には、報酬信号が疎であり、モデルが正しい解を見つけられない場合にサンプルが無駄になるという問題です。この非効率性はLLM全般に影響を及ぼしますが、dLLMは独自の機会を提供します。つまり、そのインペインティング能力が探索を導くことができるのです。本論文では、IGPO（Inpainting Guided Policy Optimization）を紹介します。これは、オンラインサンプリング中に部分的に正しい推論トレースを戦略的に挿入する強化学習フレームワークです。完全な解を提供するのではなく、インペインティングは有望な軌道空間に向けて探索を導きながら、自己生成された推論を保持し、教師ありファインチューニングと強化学習を橋渡しします。IGPOをGRPOなどのグループベースの最適化手法に適用します。これらの手法では、探索の失敗がゼロのアドバンテージと勾配を引き起こします。IGPOは意味のある勾配を回復し、サンプル効率を向上させます。また、dLLMの生成パターンに適した合成された簡潔なトレースを用いた教師ありファインチューニングを提案します。エントロピーベースのフィルタリングなどの追加技術とともに、我々のトレーニングレシピは、GSM8K、Math500、AMCの3つの数学ベンチマークで大幅な向上をもたらし、フルアテンション型マスクdLLMの新たな最先端の結果を達成しました。

English

Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new state-of-the-art results for full-attention masked dLLMs.

拡散型大規模言語モデルのためのインペインティング誘導型ポリシー最適化

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

要旨

Support