CLEX: 대규모 언어 모델을 위한 연속적 길이 외삽법

초록

Transformer 기반의 대규모 언어 모델(LLMs)은 다양한 자연어 처리 작업에서 혁신적인 발전을 이끌고 있지만, 그들의 탁월한 능력은 Transformer의 사전 설정된 컨텍스트 윈도우 내에서만 제한적으로 발휘됩니다. 위치 임베딩(PE) 스케일링 방법은 컨텍스트 윈도우를 특정 길이로 확장하는 데 효과적이지만, 외삽 능력에서 현저한 한계를 보이거나 컨텍스트 윈도우 내에서 부분적인 성능 저하를 초래합니다. 길이 외삽 방법은 이론적으로 훈련 시퀀스 길이를 초과하는 컨텍스트 윈도우를 확장할 수 있지만, 실제 장문 컨텍스트 응용에서는 종종 성능이 떨어집니다. 이러한 문제를 해결하기 위해, 우리는 LLMs을 위한 연속 길이 외삽(Continuous Length EXtrapolation, CLEX)을 제안합니다. 우리는 PE 스케일링 접근법을 일반화하여 길이 스케일링 인자에 대한 상미분 방정식을 통해 연속적인 동역학을 모델링함으로써, 특정 길이를 위해 설계된 현재의 PE 스케일링 방법의 제약을 극복합니다. 더 나아가, 훈련 시퀀스 길이를 초과하는 원하는 컨텍스트 길이로 동역학을 확장함으로써, CLEX는 실제 작업에서 인상적인 성능으로 길이 외삽을 가능하게 합니다. 우리는 CLEX가 Rotary Position Embedding을 갖춘 LLaMA 및 GPT-NeoX와 같은 LLMs에 원활하게 통합될 수 있으며, 훈련 및 추론 지연 시간에 미미한 영향을 미친다는 것을 보여줍니다. 실험 결과는 CLEX가 컨텍스트 윈도우를 훈련 길이의 4배 이상 또는 거의 8배까지 효과적으로 확장할 수 있으며, 성능 저하 없이 이를 달성할 수 있음을 보여줍니다. 또한, 실제 LongBench 벤치마크에서 평가했을 때, 4k 길이로 훈련된 우리의 모델은 최대 32k 길이의 컨텍스트로 훈련된 최첨단 오픈소스 모델들과 경쟁력 있는 성능을 보였습니다.

English

Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.

CLEX: 대규모 언어 모델을 위한 연속적 길이 외삽법

CLEX: Continuous Length Extrapolation for Large Language Models

초록

Support