LongLLaDA：釋放擴散式大型語言模型的長上下文處理能力

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

June 17, 2025

作者: Xiaoran Liu, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

cs.AI

摘要

大型语言扩散模型（Diffusion LLMs）已成为自然语言处理（NLP）研究的重要焦点，大量研究致力于理解其可扩展性和下游任务表现。然而，其长上下文能力尚未得到探索，缺乏系统性的分析或上下文扩展方法。在本研究中，我们首次系统地比较了扩散LLMs与传统自回归LLMs在长上下文任务中的表现。我们首先发现扩散LLMs的一个独特特性，与自回归LLMs不同，它们在直接上下文外推时保持了显著的\textit{稳定困惑度}。此外，在“大海捞针”任务中，当上下文长度超过预训练长度时，自回归模型完全失败，而扩散LLMs则表现出独特的\textit{局部感知}现象，能够成功从最近的上下文片段中检索信息。我们通过旋转位置嵌入（RoPE）缩放理论解释了这两种现象。基于这些观察，我们提出了LongLLaDA，一种无需训练的方法，将LLaDA与基于NTK的RoPE外推相结合。我们的结果验证了既有的外推缩放定律在扩展扩散LLMs上下文窗口时仍然有效。此外，我们识别出扩散LLMs在某些长上下文任务中优于自回归LLMs，而在其他任务中则表现不足。因此，本研究首次为扩散LLMs建立了上下文外推方法，同时提供了对推动未来长上下文扩散LLMs研究至关重要的理论见解和实证基准。

English

Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textit{stable perplexity} during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{local perception} phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.