GoldFinch: 線形プリフィルと極限KVキャッシュ圧縮を備えた高性能RWKV/トランスフォーマーハイブリッド

要旨

私たちは、新しい技術を用いてシーケンス長に対して線形時間・空間で高度に圧縮され再利用可能なKVキャッシュを効率的に生成するハイブリッドLinear Attention/Transformerシーケンスモデル、GoldFinchを紹介します。GoldFinchは、Finch（RWKV-6）アーキテクチャの拡張版の上に、新たに開発したGOLDトランスフォーマーを積み重ねた構造です。Finch、Llama、およびGoldFinchアーキテクチャの最大1.5Bパラメータクラスのモデルをトレーニングし、FinchおよびLlamaと比較して劇的に改善されたモデリング性能を確認しました。キャッシュサイズの削減効果はモデルの層数に比例して線形に増加し、一般的なサイズでは従来のトランスフォーマーキャッシュと比べて756～2550倍小さくなり、限られたハードウェア上でも極めて大きなコンテキスト長の推論を可能にします。自己回帰生成はAttentionのためトークンあたりO(n)の時間計算量を要しますが、提出されたコンテキストに対する初期キャッシュ状態の事前計算は、このキャッシュを生成するためにリカレントニューラルネットワーク（RNN）を使用するため、トークンあたりO(1)時間しかかかりません。私たちは、トレーニング済みの重みとトレーニングコードをApache 2.0ライセンスの下でコミュニティ利用のために公開します。

English

We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.

GoldFinch: 線形プリフィルと極限KVキャッシュ圧縮を備えた高性能RWKV/トランスフォーマーハイブリッド

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

要旨

Support