RADLADS: 대규모 선형 어텐션 디코더를 위한 신속한 어텐션 증류

초록

우리는 대규모 선형 어텐션 디코더 모델로의 신속한 소프트맥스 어텐션 트랜스포머 변환 프로토콜인 RADLADS(Rapid Attention Distillation to Linear Attention Decoders at Scale)를 제안하며, 두 가지 새로운 RWKV 변형 아키텍처와 7B, 32B, 72B 크기의 인기 있는 Qwen2.5 오픈소스 모델에서 변환된 모델들을 함께 소개합니다. 우리의 변환 프로세스는 원본 교사 모델을 훈련하는 데 사용된 토큰 수의 0.005% 미만인 350-700M 토큰만을 필요로 합니다. 우리의 72B 선형 어텐션 모델로의 변환 비용은 현재 가격 기준으로 \$2,000 USD 미만이지만, 추론 시 품질은 원본 트랜스포머에 근접합니다. 이러한 모델들은 해당 크기의 선형 어텐션 모델에 대한 표준 벤치마크에서 최첨단 하위 작업 성능을 달성합니다. 우리는 모든 모델을 Apache 2.0 라이선스 하에 HuggingFace에 공개하며, 72B 모델은 Qwen 라이선스 협약의 적용을 받습니다. 모델은 https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102에서 확인할 수 있습니다. 훈련 코드는 https://github.com/recursal/RADLADS-paper에서 확인할 수 있습니다.

English

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

RADLADS: 대규모 선형 어텐션 디코더를 위한 신속한 어텐션 증류

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

초록

Support