RADLADS：大规模快速注意力蒸馏至线性注意力解码器

摘要

我们提出了大规模快速注意力蒸馏至线性注意力解码器（Rapid Attention Distillation to Linear Attention Decoders at Scale, RADLADS）协议，该协议能够迅速将基于softmax注意力的Transformer模型转换为线性注意力解码器模型。同时，我们引入了两种新的RWKV变体架构，并成功将流行的Qwen2.5开源模型转换为7B、32B和72B规模的模型。我们的转换过程仅需350至700M个token，不到原始教师模型训练所用token数量的0.005%。转换为72B线性注意力模型的成本在当前价格下低于2000美元，而推理质量仍接近原Transformer模型。这些模型在其规模级别的线性注意力模型中，在一系列标准基准测试上均达到了最先进的下游性能。我们已在HuggingFace平台上以Apache 2.0许可证发布了所有模型，其中72B模型还受Qwen许可协议约束。模型地址： https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 训练代码地址： https://github.com/recursal/RADLADS-paper

English

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

RADLADS：大规模快速注意力蒸馏至线性注意力解码器

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

摘要

Support