마스크된 확산 언어 모델의 길들이기: 더 적은 디코딩 단계를 통한 일관성 궤도 강화 학습

초록

마스크된 확산 언어 모델(MDLMs)은 최근 자동회귀(AR) 언어 모델의 유망한 대안으로 떠오르며, 병렬 디코딩, 유연한 생성 순서, 더 적은 추론 단계의 가능성과 같은 특성을 제공합니다. 이러한 장점에도 불구하고, MDLMs에 맞춤화된 디코딩 전략과 강화 학습(RL) 알고리즘은 아직 충분히 탐구되지 않았습니다. 단순한 접근 방식은 AR 모델에 대해 잘 확립된 기술을 MDLMs에 직접 적용하는 것입니다. 그러나 이는 즉각적인 질문을 제기합니다: 이러한 단순한 전이가 정말 최적일까요? 예를 들어, 1) 블록 단위 및 준-AR 디코딩 전략은 MDLMs의 훈련 중에 사용되지 않는데, 왜 추론 중에 완전한 확산 스타일 디코딩보다 더 나은 성능을 보일까요? 2) AR 모델을 위해 설계된 RL 알고리즘을 MDLMs에 직접 적용하면, MDLM 디코딩이 비인과적(병렬)이기 때문에 훈련-추론 불일치가 발생합니다. 이는 롤아웃 궤적과 최적화 궤적 간의 불일치를 초래합니다. 이러한 문제를 해결하기 위해, 우리는 EOS 조기 거부(EOSER) 및 오름차순 단계 크기(ASS) 디코딩 스케줄러를 제안합니다. 이는 MDLMs가 완전한 확산 스타일 디코딩을 수행할 수 있는 잠재력을 발휘하게 하며, 더 적은 디코딩 단계로도 경쟁력 있는 성능을 달성합니다. 또한, 우리는 MDLMs를 다루기 위해 일관성 궤적 그룹 상대 정책 최적화(CJ-GRPO)를 소개합니다. 이는 롤아웃 궤적과 최적화 궤적 간의 일관성을 강조하고, 건너뛰기 단계 최적화로 인한 최적화 오류를 줄입니다. 우리는 LLaDA-8B-Instruct를 사용하여 수학 및 계획 벤치마크와 같은 추론 작업에 대해 광범위한 실험을 수행했습니다. 결과는 제안된 EOSER 및 ASS 메커니즘과 CJ-GRPO가 MDLMs를 효과적이고 효율적으로 다루는 데 상당한 가능성을 보여줍니다. 코드: https://github.com/yjyddq/EOSER-ASS-RL.

English

Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.

마스크된 확산 언어 모델의 길들이기: 더 적은 디코딩 단계를 통한 일관성 궤도 강화 학습

Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

초록

Support