靈活性的陷阱:為何任意順序限制會制約擴散語言模型的推理潛力
The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
January 21, 2026
作者: Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang
cs.AI
摘要
擴散式大型語言模型(dLLMs)突破了傳統LLMs嚴格的從左到右生成限制,實現了以任意順序生成標記的能力。直觀來看,這種靈活性意味著其解空間嚴格包含了固定自回歸軌跡,理論上為數學和程式設計等通用任務釋放了更優越的推理潛力。因此,已有大量研究利用強化學習(RL)來激發dLLMs的推理能力。本文揭示了一個反直覺的現實:當前形式的任意順序生成非但沒有擴展dLLMs的推理邊界,反而使其縮窄。我們發現dLLMs傾向於利用這種順序靈活性來繞過對探索至關重要的高不確定性標記,導致解空間過早坍縮。這一觀察對現有dLLMs的RL方法前提提出了挑戰——這些方法往往為維持此靈活性而投入大量複雜度(如處理組合軌跡和難解似然)。我們證實,通過刻意放棄任意順序生成並改用標準的群組相對策略優化(GRPO),能更有效地激發推理能力。我們提出的JustGRPO方法極簡卻效果驚人(如在GSM8K達到89.1%準確率),同時完整保留了dLLMs的平行解碼能力。項目頁面:https://nzl-thu.github.io/the-flexibility-trap
English
Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap