ChatPaper.aiChatPaper

扩散语言模型的解掩码策略学习

Learning Unmasking Policies for Diffusion Language Models

December 9, 2025
作者: Metod Jazbec, Theo X. Olausson, Louis Béthune, Pierre Ablin, Michael Kirchhof, Joao Monterio, Victor Turrisi, Jason Ramapuram, Marco Cuturi
cs.AI

摘要

扩散式(大型)语言模型(dLLMs)目前在多项任务的下游性能上已能与自回归模型相媲美,同时具备推理效率更高的潜力。其中一种特别成功的变体是掩码离散扩散模型,该模型通过将填充特殊掩码符的缓冲区逐步替换为从模型词汇表中采样的真实标记来实现生成。通过并行解掩多个标记可提升效率,但一次性解掩过多标记会降低生成质量。因此,dLLMs的关键设计环节在于采样流程——即在扩散过程的每一步选择需要替换的标记。事实上,最新研究发现,相比随机解掩,采用置信度阈值等启发式策略能同时提升生成质量和标记吞吐量。但此类启发式方法存在缺陷:需要人工调参,且我们观察到其性能会随缓冲区规模扩大而下降。本研究转而提出使用强化学习训练采样流程。具体而言,我们将掩码扩散采样形式化为马尔可夫决策过程,其中dLLM作为环境载体,并设计了一种基于单层Transformer的轻量级策略架构,可将dLLM的标记置信度映射为解掩决策。实验表明,经训练的采样策略结合半自回归生成时,能达到顶尖启发式方法的性能,并在完整扩散设定中实现超越。我们还验证了策略的可迁移性,发现其能泛化至新的底层dLLM和更长序列。但同时也观察到,当应用于域外数据时策略性能会下降,且通过本方法实现精度-效率权衡的细粒度调优仍具挑战性。
English
Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary. Efficiency can be gained by unmasking several tokens in parallel, but doing too many at once risks degrading the generation quality. Thus, one critical design aspect of dLLMs is the sampling procedure that selects, at each step of the diffusion process, which tokens to replace. Indeed, recent work has found that heuristic strategies such as confidence thresholding lead to both higher quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger buffer sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy architecture based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive generation, while outperforming them in the full diffusion setting. We also examine the transferability of these policies, finding that they can generalize to new underlying dLLMs and longer sequence lengths. However, we also observe that their performance degrades when applied to out-of-domain data, and that fine-grained tuning of the accuracy-efficiency trade-off can be challenging with our approach.
PDF52December 13, 2025