ChatPaper.aiChatPaper

灵活性的陷阱:为何任意顺序限制会制约扩散语言模型推理潜力

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

January 21, 2026
作者: Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang
cs.AI

摘要

扩散大语言模型(dLLMs)突破了传统LLMs严格的从左到右生成限制,实现了按任意顺序生成标记的能力。直观来看,这种灵活性意味着其解空间严格包含了固定自回归轨迹,理论上为数学和编程等通用任务解锁了更强的推理潜力。因此,已有大量研究采用强化学习(RL)来激发dLLMs的推理能力。本文揭示了一个反直觉的现实:当前形式的任意顺序生成非但没有拓宽dLLMs的推理边界,反而使其收窄。我们发现dLLMs倾向于利用这种顺序灵活性来规避对探索至关重要的高不确定性标记,导致解空间过早坍缩。这一发现对现有dLLMs强化学习方法的前提提出了挑战——这些方法往往需要处理组合轨迹和难解似然等复杂问题以保持顺序灵活性。我们证明,通过主动放弃任意顺序生成并采用标准的分组相对策略优化(GRPO),能更有效地激发推理能力。我们提出的JustGRPO方法虽极简却效果惊人(如在GSM8K上达到89.1%准确率),同时完整保留了dLLMs的并行解码能力。项目页面:https://nzl-thu.github.io/the-flexibility-trap
English
Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap
PDF551January 24, 2026