ChatPaper.aiChatPaper

GoldFinch:具有线性预填充和极端KV缓存压缩的高性能RWKV/Transformer混合模型

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

July 16, 2024
作者: Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah
cs.AI

摘要

我们介绍GoldFinch,这是一个混合线性注意力/Transformer序列模型,它使用一种新技术来高效生成一个高度压缩且可重复使用的KV-Cache,其时间和空间复杂度均与序列长度成线性关系。GoldFinch在增强版本的Finch(RWKV-6)架构之上叠加了我们的新GOLD transformer。我们训练了高达15亿参数的Finch、Llama和GoldFinch架构的模型,相对于Finch和Llama,我们发现建模性能得到了显著改善。我们的缓存大小节省随着模型层数的增加呈线性增长,对于常见尺寸,比传统Transformer缓存小756-2550倍,使得即使在有限硬件上也能推断极大的上下文长度。尽管由于注意力,自回归生成每个标记的时间复杂度为O(n),但由于使用循环神经网络(RNN)生成此缓存的整个初始状态的预填充计算仅每个标记耗费O(1)的时间。我们以Apache 2.0许可证发布我们训练的权重和训练代码,供社区使用。
English
We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.

Summary

AI-Generated Summary

PDF578November 28, 2024