LFPO: 마스크 확산 모델을 위한 우도 무관 정책 최적화

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 수학적 추론 및 코드 생성과 같은 정확성이 요구되는 영역에서 자기회귀 모델의 성능 향상에 있어 뛰어난 성과를 거두었습니다. 그러나 이러한 패러다임을 Diffusion 대규모 언어 모델(dLLM)에 직접 적용하는 것은 정확한 가능도 계산의 어려움으로 인해 근본적으로 제한되며, 이로 인해 기존 방법들은 높은 분산을 가진 근사치에 의존할 수밖에 없었습니다. 이러한 격차를 해소하기 위해 본 논문에서는 벡터 장 흐름 매칭 개념을 이산 토큰 공간에 매핑하는 새로운 프레임워크인 Likelihood-Free Policy Optimization(LFPO)을 제안합니다. 구체적으로 LFPO는 정렬 문제를 기하학적 속도 보정으로 공식화하여 대조적 업데이트를 통해 노이즈 제거 로짓을 직접 최적화합니다. 이 설계는 가능도 근사에서 비롯되는 오류를 효과적으로 우회하여 정밀한 기울기 추정을 가능하게 합니다. 더불어 LFPO는 중간 단계에서 최종 해를 예측함으로써 일관성을 강화하며, 확률 흐름을 직교화하여 더 적은 반복 횟수로도 고품질 생성을 가능하게 합니다. 광범위한 실험을 통해 LFPO가 코드 및 추론 벤치마크에서 최첨단 기법들을 능가할 뿐만 아니라 확산 단계 감소를 통해 추론 속도를 약 20% 가속화함을 입증하였습니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

LFPO: 마스크 확산 모델을 위한 우도 무관 정책 최적화

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

초록

Support