ChatPaper.aiChatPaper

馴服遮蔽擴散語言模型:透過一致性軌跡強化學習實現更少解碼步驟

Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

September 28, 2025
作者: Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao
cs.AI

摘要

掩碼擴散語言模型(MDLMs)近期作為自回歸(AR)語言模型的一種有前景的替代方案嶄露頭角,其具備並行解碼、靈活的生成順序以及可能更少的推理步驟等特性。儘管存在這些優勢,針對MDLMs的解碼策略和強化學習(RL)算法仍待深入探索。一種直觀的方法是直接將AR模型中成熟的技術遷移至MDLMs。然而,這立即引發了一個疑問:這種簡單的遷移是否真的最優?例如,1)塊狀和半自回歸解碼策略並未在MDLMs的訓練過程中採用,為何它們在推理時卻優於全擴散式解碼?2)將為AR模型設計的RL算法直接應用於MDLMs會出現訓練與推理不一致的問題,因為MDLMs的解碼是非因果性的(並行)。這導致了rollout軌跡與優化軌跡之間的不一致。為應對這些挑戰,我們提出了EOS早期拒絕(EOSER)和遞增步長(ASS)解碼調度器,它們釋放了MDLMs進行全擴散式解碼的潛力,以更少的解碼步驟實現了競爭性的性能。此外,我們引入了用於馴服MDLMs的一致性軌跡群組相對策略優化(CJ-GRPO),它強調rollout軌跡與優化軌跡之間的一致性,並減少了由跳步優化引起的優化誤差。我們在LLaDA-8B-Instruct上對數學和規劃基準等推理任務進行了廣泛實驗。結果表明,所提出的EOSER和ASS機制,結合CJ-GRPO,在有效且高效地馴服MDLMs方面展現出顯著潛力。代碼:https://github.com/yjyddq/EOSER-ASS-RL。
English
Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.
PDF71September 30, 2025