확산 대형 언어 모델을 위한 강화 학습 프레임워크 혁신

초록

우리는 추론 경로를 사후 훈련에 통합하고 다양한 아키텍처에 적용 가능한 확산 언어 모델(DLM)을 위한 궤적 인식 강화 학습 프레임워크인 TraceRL을 제안합니다. 훈련 안정성을 향상시키는 확산 기반 가치 모델을 장착하여 복잡한 수학 및 코딩 작업에서 향상된 추론 성능을 입증했습니다. 또한, 이는 블록별 모델을 더 큰 블록에 적응시켜 샘플링 유연성을 개선하는 데도 적용할 수 있습니다. TraceRL을 사용하여 최첨단 확산 언어 모델 시리즈인 TraDo를 도출했습니다. 7B 규모의 AR 모델보다 작은 TraDo-4B-Instruct는 여전히 복잡한 수학 추론 작업에서 일관되게 더 나은 성능을 보입니다. TraDo-8B-Instruct는 수학적 추론 벤치마크에서 Qwen2.5-7B-Instruct 대비 6.1%, Llama3.1-8B-Instruct 대비 51.3%의 상대적 정확도 향상을 달성했습니다. 커리큘럼 학습을 통해 첫 번째 장기 CoT DLM을 도출하여 MATH500에서 Qwen2.5-7B-Instruct 대비 18.1%의 상대적 정확도 향상을 보였습니다. 재현 가능한 연구와 실용적인 응용을 위해 다양한 아키텍처에서 확산 LLM을 구축, 훈련 및 배포하기 위한 포괄적인 오픈소스 프레임워크를 공개합니다. 이 프레임워크는 추론 및 강화 학습을 위한 가속화된 KV-캐시 기술과 추론 엔진을 통합하고, 수학, 코딩 및 일반 작업을 위한 다양한 지도 미세 조정 및 RL 방법의 구현을 포함합니다. 코드와 모델: https://github.com/Gen-Verse/dLLM-RL

English

We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL