InfiniteHiP: 1 枚の単一 GPU で最大 3 百万トークンまで言語モデルのコンテキストを拡張

要旨

現代の大規模言語モデル（LLM）では、非常に長いコンテキスト長を扱うことは、推論速度の低下やメモリコストの増加といった重要な課題を引き起こします。さらに、既存の事前学習済みLLMのほとんどは、元のトレーニングシーケンス長を超えて一般化することができません。効率的かつ実用的な長いコンテキストの活用を可能にするために、私たちはInfiniteHiPを導入します。これは、新しい実用的なLLM推論フレームワークであり、モジュラー階層型トークン剪定アルゴリズムを用いて不要なコンテキストトークンを動的に除外することで処理を加速します。私たちの手法は、LLM内部の注意パターンに応じてさまざまなRoPE調整方法を選択的に適用することで、より長いシーケンスへの一般化も可能とします。さらに、推論中にキー値キャッシュをホストメモリにオフロードすることで、GPUメモリの負荷を大幅に軽減します。その結果、InfiniteHiPは、1つのL40s 48GB GPUで最大3百万トークンの処理を可能にし、コンテキスト情報の永続的な損失を伴うことなく、トークンを3倍増やします。私たちのフレームワークは、追加のトレーニングを必要とせずに、100万トークンのコンテキストに対する注意デコーディングを18.95倍高速化します。私たちはこの手法をSGLangフレームワークで実装し、広範な評価を通じてその効果と実用性を実証します。

English

In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of up to 3 million tokens on a single L40s 48GB GPU -- 3x larger -- without any permanent loss of context information. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training. We implement our method in the SGLang framework and demonstrate its effectiveness and practicality through extensive evaluations.

InfiniteHiP: 1 枚の単一 GPU で最大 3 百万トークンまで言語モデルのコンテキストを拡張

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

要旨

Support