LongLLaDA：释放扩散式大语言模型的长上下文处理能力

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

June 17, 2025

作者: Xiaoran Liu, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

cs.AI

摘要

大语言扩散模型（Diffusion LLMs）已成为自然语言处理研究的重要焦点，大量工作致力于理解其可扩展性和下游任务表现。然而，其长上下文能力仍未被探索，缺乏系统分析或上下文扩展方法。在本研究中，我们首次系统性地比较了扩散LLMs与传统自回归LLMs的长上下文性能。我们首先发现扩散LLMs的一个独特特性：与自回归LLMs不同，它们在直接上下文外推时保持了显著的\textit{稳定困惑度}。此外，在“大海捞针”任务中，当上下文长度超过预训练长度时，自回归模型完全失败，而扩散LLMs则展现出独特的\textit{局部感知}现象，能够成功从最近的上下文片段中检索信息。我们通过旋转位置编码（RoPE）缩放理论解释了这两种现象。基于这些观察，我们提出了LongLLaDA，一种无需训练的方法，将LLaDA与基于NTK的RoPE外推相结合。我们的结果验证了既定的外推缩放定律在扩展扩散LLMs上下文窗口方面仍然有效。此外，我们识别出扩散LLMs在部分长上下文任务中优于自回归LLMs，而在其他任务中则表现不足。因此，本研究不仅为扩散LLMs建立了首个上下文外推方法，还提供了推进未来长上下文扩散LLMs研究所需的关键理论洞见和实证基准。

English

Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textit{stable perplexity} during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{local perception} phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.