基於修復引導的策略優化用於擴散式大型語言模型

摘要

掩碼擴散大型語言模型（dLLMs）正逐漸成為自回歸LLMs的有力替代方案，不僅展現出競爭性的性能，還支持諸如圖像修復等獨特的生成能力。我們探討了如何利用圖像修復技術來指導dLLMs的強化學習算法設計。將LLMs與強化學習對齊面臨一個探索挑戰：稀疏的獎勵信號以及當模型未能發現正確解決方案時的樣本浪費。雖然這種低效性廣泛影響LLMs，但dLLMs提供了一個獨特的機會——它們的圖像修復能力可以引導探索。我們引入了IGPO（圖像修復引導策略優化），這是一個在線採樣過程中策略性地插入部分真實推理軌跡的強化學習框架。與提供完整解決方案不同，圖像修復將探索引向有希望的軌跡空間，同時保留自我生成的推理，從而橋接監督微調與強化學習。我們將IGPO應用於基於群組的優化方法如GRPO中，其中探索失敗會導致零優勢和梯度。IGPO恢復了有意義的梯度，同時提高了樣本效率。我們還提出了對合成重寫的簡潔軌跡進行監督微調，這些軌跡更符合dLLM的生成模式。結合基於熵的過濾等額外技術，我們的訓練方案在三個數學基準測試——GSM8K、Math500和AMC——上取得了顯著提升，為全注意力掩碼dLLMs實現了新的最優結果。

English

Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new state-of-the-art results for full-attention masked dLLMs.

基於修復引導的策略優化用於擴散式大型語言模型

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

摘要

Support