ChatPaper.aiChatPaper

基于修复引导策略优化的扩散大语言模型

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

September 12, 2025
作者: Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen
cs.AI

摘要

掩码扩散大语言模型(dLLMs)正逐渐成为自回归大语言模型的有力替代者,它们在提供竞争性能的同时,还支持诸如图像修复等独特的生成能力。我们探讨了如何利用图像修复技术来指导dLLMs的强化学习算法设计。将大语言模型与强化学习对齐面临一个探索挑战:稀疏的奖励信号以及当模型未能发现正确解决方案时的样本浪费。尽管这种低效性广泛影响大语言模型,但dLLMs提供了一个独特的机会——它们的图像修复能力可以引导探索。我们引入了IGPO(图像修复引导策略优化),这是一个在在线采样过程中策略性地插入部分真实推理轨迹的强化学习框架。与提供完整解决方案不同,图像修复将探索引导至有潜力的轨迹空间,同时保留自我生成的推理,从而在监督微调与强化学习之间架起桥梁。我们将IGPO应用于基于群体的优化方法,如GRPO,其中探索失败会导致零优势和梯度。IGPO恢复了有意义的梯度,同时提高了样本效率。我们还提出对合成重写的简洁轨迹进行监督微调,这些轨迹更符合dLLM的生成模式。结合基于熵的过滤等额外技术,我们的训练方案在三个数学基准测试——GSM8K、Math500和AMC上取得了显著提升,为全注意力掩码dLLMs实现了新的最先进成果。
English
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new state-of-the-art results for full-attention masked dLLMs.
PDF142September 15, 2025