LongLLaDA: 拡散型LLMにおける長文脈処理能力の解放

要旨

大規模言語拡散モデル（Diffusion LLMs）は、NLP研究において重要な焦点として浮上し、そのスケーラビリティと下流タスクの性能を理解するための多大な努力が注がれています。しかし、その長文脈能力は未開拓のままであり、体系的な分析や文脈拡張の手法が欠如しています。本研究では、Diffusion LLMsと従来の自己回帰型LLMsの長文脈性能を比較する初の体系的な調査を提示します。まず、自己回帰型LLMsとは異なり、Diffusion LLMsが直接的な文脈外挿において驚くほど\textit{安定したパープレキシティ}を維持するという独特の特性を明らかにします。さらに、事前学習された長さを超える文脈において自己回帰モデルが完全に失敗する「針を干し草の山から探す」タスクにおいて、Diffusion LLMsが\textit{局所的な知覚}現象を示し、最近の文脈セグメントからの成功した検索を可能にすることを発見します。これらの現象を、Rotary Position Embedding（RoPE）スケーリング理論の観点から説明します。これらの観察に基づいて、LLaDAとNTKベースのRoPE外挿を統合したトレーニング不要の手法であるLongLLaDAを提案します。結果は、確立された外挿スケーリング則がDiffusion LLMsの文脈ウィンドウを拡張するために有効であることを検証します。さらに、Diffusion LLMsが自己回帰型LLMsを上回る長文脈タスクと、それらが及ばないタスクを特定します。したがって、本研究はDiffusion LLMsの初の文脈外挿手法を確立するとともに、長文脈Diffusion LLMsの将来の研究を進めるために不可欠な理論的洞察と実証的ベンチマークを提供します。

English

Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textit{stable perplexity} during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{local perception} phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.

LongLLaDA: 拡散型LLMにおける長文脈処理能力の解放

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

要旨

Support