SubGen: サブリニア時間・メモリでのトークン生成

要旨

大規模言語モデル（LLM）の顕著な成功にもかかわらず、その膨大なメモリ要件は、長文脈トークン生成における展開に課題を提起しています。LLMデコーダの大きなメモリフットプリントは、キー・バリュー（KV）キャッシュによって課せられる、アテンションモジュール内のすべての過去のトークンを保存する必要性に起因しています。本研究では、KVキャッシュの効率的な圧縮技術の開発に焦点を当てています。経験的証拠は、アテンションモジュール内のキー埋め込みに有意なクラスタリング傾向があることを示しています。この重要な洞察に基づいて、キートークンに対するオンラインクラスタリングとバリューに対するオンラインell_2サンプリングを用いた、サブリニア複雑度の新しいキャッシュ手法を考案しました。その結果、SubGenと呼ばれる、証明可能な精度と効率を備えたアテンションデコーディングアルゴリズムが得られました。このアルゴリズムは、サブリニアなメモリフットプリントとサブリニアな時間複雑度を保証するだけでなく、我々のアプローチに対する厳密な誤差限界も確立しています。長文脈質問応答タスクにおける実証評価では、SubGenが既存および最先端のKVキャッシュ圧縮手法を性能と効率の両面で大幅に上回ることが示されています。

English

Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building on this key insight, we have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online ell_2 sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed SubGen. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach. Empirical evaluations on long-context question-answering tasks demonstrate that SubGen significantly outperforms existing and state-of-the-art KV cache compression methods in terms of performance and efficiency.

SubGen: サブリニア時間・メモリでのトークン生成

SubGen: Token Generation in Sublinear Time and Memory

要旨

Support