馴服遮蔽擴散語言模型:透過一致性軌跡強化學習實現更少解碼步驟
Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step
September 28, 2025
作者: Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao
cs.AI
摘要
掩碼擴散語言模型(MDLMs)近期作為自回歸(AR)語言模型的一種有前景的替代方案嶄露頭角,其具備並行解碼、靈活的生成順序以及可能更少的推理步驟等特性。儘管存在這些優勢,針對MDLMs的解碼策略和強化學習(RL)算法仍待深入探索。一種直觀的方法是直接將AR模型中成熟的技術遷移至MDLMs。然而,這立即引發了一個疑問:這種簡單的遷移是否真的最優?例如,1)塊狀和半自回歸解碼策略並未在MDLMs的訓練過程中採用,為何它們在推理時卻優於全擴散式解碼?2)將為AR模型設計的RL算法直接應用於MDLMs會出現訓練與推理不一致的問題,因為MDLMs的解碼是非因果性的(並行)。這導致了rollout軌跡與優化軌跡之間的不一致。為應對這些挑戰,我們提出了EOS早期拒絕(EOSER)和遞增步長(ASS)解碼調度器,它們釋放了MDLMs進行全擴散式解碼的潛力,以更少的解碼步驟實現了競爭性的性能。此外,我們引入了用於馴服MDLMs的一致性軌跡群組相對策略優化(CJ-GRPO),它強調rollout軌跡與優化軌跡之間的一致性,並減少了由跳步優化引起的優化誤差。我們在LLaDA-8B-Instruct上對數學和規劃基準等推理任務進行了廣泛實驗。結果表明,所提出的EOSER和ASS機制,結合CJ-GRPO,在有效且高效地馴服MDLMs方面展現出顯著潛力。代碼:https://github.com/yjyddq/EOSER-ASS-RL。
English
Masked diffusion language models (MDLMs) have recently emerged as a promising
alternative to autoregressive (AR) language models, offering properties such as
parallel decoding, flexible generation orders, and the potential for fewer
inference steps. Despite these advantages, decoding strategies and
reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored.
A naive approach is to directly transfer techniques well-established for AR
models to MDLMs. However, this raises an immediate question: Is such a naive
transfer truly optimal? For example, 1) Block-wise and semi-AR decoding
strategies are not employed during the training of MDLMs, so why do they
outperform full diffusion-style decoding during inference? 2) Applying RL
algorithms designed for AR models directly to MDLMs exhibits a
training-inference inconsistency, since MDLM decoding are non-causal
(parallel). This results in inconsistencies between the rollout trajectory and
the optimization trajectory. To address these challenges, we propose EOS Early
Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which
unlock the potential of MDLMs to perform full diffusion-style decoding,
achieving competitive performance with fewer decoding steps. Additionally, we
introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO)
for taming MDLMs, which emphasizes the consistency between rollout trajectory
and optimization trajectory, and reduces the optimization errors caused by
skip-step optimization. We conduct extensive experiments on reasoning tasks,
such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The
results demonstrate that the proposed EOSER and ASS mechanisms, together with
CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs.
Code: https://github.com/yjyddq/EOSER-ASS-RL.