ChatPaper.aiChatPaper

GoldFinch:具有線性預填充和極端KV-Cache壓縮的高性能RWKV/Transformer混合系統

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

July 16, 2024
作者: Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah
cs.AI

摘要

我們介紹了GoldFinch,一種混合線性注意力/Transformer序列模型,採用一種新技術來高效生成高度壓縮且可重複使用的KV-Cache,其時間和空間複雜度均與序列長度成線性關係。GoldFinch將我們的新GOLD Transformer堆疊在增強版本的Finch(RWKV-6)架構之上。我們訓練了高達15億參數的Finch、Llama和GoldFinch架構的模型,發現相對於Finch和Llama,建模性能顯著提高。我們的緩存大小節省隨著模型層數的增加呈線性增長,對於常見大小,比傳統Transformer緩存小756-2550倍,即使在有限硬件上也能推斷極大的上下文長度。儘管自回歸生成每個標記的時間複雜度為O(n),因為注意力,但由於使用循環神經網絡(RNN)生成此緩存的整個初始狀態的預填充計算每個標記僅需O(1)時間。我們根據Apache 2.0許可證釋放我們訓練好的權重和訓練代碼供社區使用。
English
We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.

Summary

AI-Generated Summary

PDF578November 28, 2024