通过一致性轨迹强化学习驯服掩码扩散语言模型:减少解码步数
Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step
September 28, 2025
作者: Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao
cs.AI
摘要
掩码扩散语言模型(MDLMs)近期作为一种有前景的自回归(AR)语言模型替代方案崭露头角,其优势包括并行解码、灵活的生成顺序以及可能减少推理步骤。尽管具备这些优点,针对MDLMs的解码策略和强化学习(RL)算法仍待深入探索。一种直观的做法是将AR模型中成熟的技术直接迁移至MDLMs。然而,这引发了一个直接的问题:这种简单的迁移是否真的最优?例如,1)块状和半自回归解码策略并未在MDLMs的训练过程中使用,为何在推理时它们却优于全扩散式解码?2)将专为AR模型设计的RL算法直接应用于MDLMs,由于MDLM解码是非因果的(并行),导致了训练与推理间的不一致性,表现为滚动轨迹与优化轨迹之间的不一致。为解决这些挑战,我们提出了EOS早期拒绝(EOSER)和递增步长(ASS)解码调度器,它们释放了MDLMs进行全扩散式解码的潜力,以更少的解码步骤实现了竞争性的性能。此外,我们引入了用于驯服MDLMs的一致性轨迹组相对策略优化(CJ-GRPO),强调滚动轨迹与优化轨迹的一致性,并减少了由跳步优化引起的优化误差。我们在LLaDA-8B-Instruct上对数学和规划等推理任务进行了广泛实验。结果表明,所提出的EOSER和ASS机制,结合CJ-GRPO,在有效且高效地驯服MDLMs方面展现出显著潜力。代码地址:https://github.com/yjyddq/EOSER-ASS-RL。
English
Masked diffusion language models (MDLMs) have recently emerged as a promising
alternative to autoregressive (AR) language models, offering properties such as
parallel decoding, flexible generation orders, and the potential for fewer
inference steps. Despite these advantages, decoding strategies and
reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored.
A naive approach is to directly transfer techniques well-established for AR
models to MDLMs. However, this raises an immediate question: Is such a naive
transfer truly optimal? For example, 1) Block-wise and semi-AR decoding
strategies are not employed during the training of MDLMs, so why do they
outperform full diffusion-style decoding during inference? 2) Applying RL
algorithms designed for AR models directly to MDLMs exhibits a
training-inference inconsistency, since MDLM decoding are non-causal
(parallel). This results in inconsistencies between the rollout trajectory and
the optimization trajectory. To address these challenges, we propose EOS Early
Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which
unlock the potential of MDLMs to perform full diffusion-style decoding,
achieving competitive performance with fewer decoding steps. Additionally, we
introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO)
for taming MDLMs, which emphasizes the consistency between rollout trajectory
and optimization trajectory, and reduces the optimization errors caused by
skip-step optimization. We conduct extensive experiments on reasoning tasks,
such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The
results demonstrate that the proposed EOSER and ASS mechanisms, together with
CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs.
Code: https://github.com/yjyddq/EOSER-ASS-RL.