CLEX: 大規模言語モデルのための連続的長さ外挿法

要旨

Transformerベースの大規模言語モデル（LLM）は、多くの自然言語処理タスクにおいて画期的な進歩を遂げています。しかし、その優れた能力はTransformerの事前設定されたコンテキストウィンドウ内に制限されています。位置埋め込み（PE）スケーリング手法は、コンテキストウィンドウを特定の長さに拡張する点では有効ですが、外挿能力に顕著な制限があるか、コンテキストウィンドウ内での性能を一部犠牲にしています。長さ外挿手法は、理論的にはトレーニングシーケンス長を超えてコンテキストウィンドウを拡張できるものの、実際の長文コンテキストアプリケーションではしばしば性能が低下します。これらの課題に対処するため、我々はLLM向けのContinuous Length EXtrapolation（CLEX）を提案します。我々はPEスケーリング手法を一般化し、長さスケーリング係数に関する常微分方程式によって連続的なダイナミクスをモデル化することで、特定の長さに設計された現在のPEスケーリング手法の制約を克服します。さらに、ダイナミクスをトレーニングシーケンス長を超える所望のコンテキスト長に拡張することで、CLEXは実用的なタスクにおいて優れた性能を発揮する長さ外挿を可能にします。我々は、CLEXがRotary Position Embeddingを備えたLLM（LLaMAやGPT-NeoXなど）にシームレスに組み込まれ、トレーニングおよび推論の遅延にほとんど影響を与えないことを示します。実験結果から、CLEXはコンテキストウィンドウをトレーニング長の4倍以上またはほぼ8倍に効果的に拡張し、性能の劣化なしに実現できることが明らかになりました。さらに、実用的なLongBenchベンチマークで評価したところ、4k長でトレーニングされた我々のモデルは、32kまでのコンテキスト長でトレーニングされた最先端のオープンソースモデルと競争力のある性能を示しました。

English

Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.

CLEX: 大規模言語モデルのための連続的長さ外挿法

CLEX: Continuous Length Extrapolation for Large Language Models

要旨

Support