GoldFinch: 선형 프리필과 극단적인 KV 캐시 압축을 지원하는 고성능 RWKV/트랜스포머 하이브리드

초록

우리는 GoldFinch를 소개합니다. 이는 새로운 기술을 사용하여 시퀀스 길이에 대해 선형 시간과 공간으로 고도로 압축되고 재사용 가능한 KV 캐시를 효율적으로 생성하는 하이브리드 선형 어텐션/트랜스포머 시퀀스 모델입니다. GoldFinch는 개선된 Finch(RWKV-6) 아키텍처 위에 새로운 GOLD 트랜스포머를 쌓아 구성됩니다. 우리는 Finch, Llama, 그리고 GoldFinch 아키텍처의 최대 15억 파라미터 규모 모델을 학습시켰으며, Finch와 Llama에 비해 극적으로 향상된 모델링 성능을 확인했습니다. 우리의 캐시 크기 절감 효과는 모델 레이어 수에 따라 선형적으로 증가하며, 일반적인 크기에서 기존 트랜스포머 캐시보다 756~2550배 작아 제한된 하드웨어에서도 매우 큰 컨텍스트 길이의 추론을 가능하게 합니다. 자동회귀 생성은 어텐션으로 인해 토큰당 O(n) 시간 복잡도를 가지지만, 제출된 컨텍스트에 대한 초기 캐시 상태의 전체 사전 계산은 순환 신경망(RNN)을 사용하여 이 캐시를 생성하기 때문에 토큰당 O(1) 시간만 소요됩니다. 우리는 학습된 가중치와 학습 코드를 Apache 2.0 라이선스 하에 커뮤니티 사용을 위해 공개합니다.

English

We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.

GoldFinch: 선형 프리필과 극단적인 KV 캐시 압축을 지원하는 고성능 RWKV/트랜스포머 하이브리드

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

초록

Support