De flexibiliteitsval: Waarom willekeurige volgorde het redeneerpotentieel van diffusion language models beperkt

Samenvatting

Diffusion Large Language Models (dLLM's) doorbreken de rigide links-naar-rechtsbeperking van traditionele LLM's, waardoor tokens in willekeurige volgorde gegenereerd kunnen worden. Intuïtief impliceert deze flexibiliteit een oplossingsruimte die strikt groter is dan het vaste autogressieve pad, wat in theorie superieur redeneervermogen zou moeten ontsluiten voor algemene taken zoals wiskunde en programmeren. Als gevolg hiervan hebben tal van onderzoeken reinforcement learning (RL) ingezet om het redeneervermogen van dLLM's te stimuleren. In dit artikel onthullen we een contra-intuïtieve realiteit: generatie in willekeurige volgorde, in haar huidige vorm, verkleint in plaats van vergroot de redeneergrens van dLLM's. Wij constateren dat dLLM's de neiging hebben deze ordeningsflexibiliteit te misbruiken om tokens met hoge onzekerheid, die cruciaal zijn voor exploratie, te omzeilen, wat leidt tot een vroegtijdige ineenstorting van de oplossingsruimte. Deze observatie tart het uitgangspunt van bestaande RL-benaderingen voor dLLM's, waarbij aanzienlijke complexiteiten, zoals het hanteren van combinatorische trajecten en onhanteerbare waarschijnlijkheden, vaak worden ingezet om deze flexibiliteit te behouden. Wij tonen aan dat effectief redeneren beter wordt gestimuleerd door opzettelijk af te zien van willekeurige volgorde en in plaats daarvan standaard Group Relative Policy Optimization (GRPO) toe te passen. Onze benadering, JustGRPO, is minimalistisch maar verrassend effectief (bijvoorbeeld 89,1% nauwkeurigheid op GSM8K) en behoudt tegelijkertijd volledig het parallelle decodeervermogen van dLLM's. Projectpagina: https://nzl-thu.github.io/the-flexibility-trap

English

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

De flexibiliteitsval: Waarom willekeurige volgorde het redeneerpotentieel van diffusion language models beperkt

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Samenvatting

Support