LFPO: Likelihood-freie Policy-Optimierung für maskierte Diffusionsmodelle

Zusammenfassung

Reinforcement Learning mit verifizierbaren Belohnungen (RLVR) hat bemerkenswerte Erfolge bei der Verbesserung autoregressiver Modelle erzielt, insbesondere in Domänen, die Korrektheit erfordern, wie mathematisches Reasoning und Code-Generierung. Die direkte Anwendung solcher Paradigmen auf Diffusions-Großsprachmodelle (dLLMs) wird jedoch grundlegend durch die Nicht-Berechenbarkeit der exakten Likelihood behindert, was bestehende Methoden dazu zwingt, auf hochvarianzapproximationen zurückzugreifen. Um diese Lücke zu schließen, schlagen wir Likelihood-Free Policy Optimization (LFPO) vor, einen nativen Framework, der das Konzept des Vektorfeld-Flow-Matchings auf den diskreten Token-Raum abbildet. Konkret formuliert LFPO Alignment als geometrische Geschwindigkeitskorrektur, die Denoising-Logits direkt durch kontrastive Updates optimiert. Dieser Entwurf umgeht effektiv die in der Likelihood-Approximation inhärenten Fehler und liefert eine präzise Gradientenschätzung. Darüber hinaus erzwingt LFPO Konsistenz, indem es Endlösungen aus Zwischenschritten vorhersagt, was den Wahrscheinlichkeitsfluss effektiv begradigt, um eine hochwertige Generierung mit deutlich weniger Iterationen zu ermöglichen. Umfangreiche Experimente belegen, dass LFPO nicht nur state-of-the-art Baseline-Methoden auf Code- und Reasoning-Benchmarks übertrifft, sondern auch den Inferenzvorgang durch reduzierte Diffusionsschritte um etwa 20 % beschleunigt.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

LFPO: Likelihood-freie Policy-Optimierung für maskierte Diffusionsmodelle

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Zusammenfassung

Support