LongLLaDA: 확산 LLM의 장문맥 처리 능력 개방

초록

대규모 언어 확산 모델(Large Language Diffusion Models, 이하 확산 LLM)은 NLP 연구에서 중요한 주제로 부상하며, 이들의 확장성과 다운스트림 작업 성능에 대한 이해를 목표로 상당한 연구 노력이 집중되고 있습니다. 그러나 이들의 장문맥(long-context) 능력은 체계적인 분석이나 문맥 확장 방법이 부족하여 아직 탐구되지 않았습니다. 본 연구에서는 확산 LLM과 전통적인 자기회귀(auto-regressive) LLM의 장문맥 성능을 비교하는 첫 체계적인 조사를 제시합니다. 먼저, 확산 LLM이 자기회귀 LLM과 달리 직접적인 문맥 외삽(direct context extrapolation) 동안 놀랍도록 \textit{안정적인 복잡도(perplexity)}를 유지하는 독특한 특성을 확인했습니다. 또한, 사전 학습된 길이를 초과하는 문맥에서 '건초 더미 속 바늘 찾기(Needle-In-A-Haystack)' 작업 중 자기회귀 모델이 완전히 실패하는 반면, 확산 LLM은 최근 문맥 세그먼트에서 성공적인 검색을 가능하게 하는 독특한 \textit{지역적 인식(local perception)} 현상을 보임을 발견했습니다. 우리는 이 두 현상을 회전 위치 임베딩(Rotary Position Embedding, RoPE) 스케일링 이론을 통해 설명합니다. 이러한 관찰을 바탕으로, LLaDA와 NTK 기반 RoPE 외삽을 통합한 학습이 필요 없는 방법인 LongLLaDA를 제안합니다. 우리의 결과는 확산 LLM의 문맥 창을 확장하기 위해 기존의 외삽 스케일링 법칙이 여전히 유효함을 검증합니다. 또한, 확산 LLM이 자기회귀 LLM을 능가하는 장문맥 작업과 그렇지 못한 작업을 식별합니다. 결과적으로, 본 연구는 확산 LLM의 첫 문맥 외삽 방법을 확립함과 동시에 장문맥 확산 LLM 연구를 진전시키기 위한 필수적인 이론적 통찰과 실증적 벤치마크를 제공합니다.

English

Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textit{stable perplexity} during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{local perception} phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.

LongLLaDA: 확산 LLM의 장문맥 처리 능력 개방

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

초록

Support