LazyLLM: 効率的な長文脈LLM推論のための動的トークンプルーニング

要旨

Transformerベースの大規模言語モデルの推論は、2つの連続した段階で構成されています：1) プロンプトのKVキャッシュを計算し、最初のトークンを生成するプリフィリング段階、2) 後続のトークンを生成するデコード段階です。長いプロンプトの場合、プリフィリング段階で全てのトークンのKVキャッシュを計算する必要があり、これにより最初のトークンの生成に要する時間が大幅に増加する可能性があります。その結果、プリフィリング段階が生成プロセスのボトルネックとなることがあります。ここで、最初のトークンを生成するために全てのプロンプトトークンが必須であるかどうかは未解決の問題です。この疑問に答えるため、我々はLazyLLMという新しい手法を提案します。LazyLLMは、プリフィリング段階とデコード段階の両方において、次のトークン予測に重要なトークンのKVを選択的に計算します。プロンプトを一度に刈り込む静的なプルーニング手法とは異なり、LazyLLMは言語モデルが異なる生成ステップでコンテキストから異なるトークンのサブセットを動的に選択することを可能にします。たとえ前のステップで刈り込まれたトークンであってもです。様々なタスクにおける標準データセットでの広範な実験により、LazyLLMが既存の言語モデルにシームレスに統合可能な汎用的な手法であり、ファインチューニングなしで生成を大幅に加速できることが実証されました。例えば、複数ドキュメントの質問応答タスクにおいて、LazyLLMはLLama 2 7Bモデルのプリフィリング段階を2.34倍加速しつつ、精度を維持しました。

English

The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.

LazyLLM: 効率的な長文脈LLM推論のための動的トークンプルーニング

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

要旨

Support