RADLADS:大规模快速注意力蒸馏至线性注意力解码器
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
May 5, 2025
作者: Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah
cs.AI
摘要
我们提出了大规模快速注意力蒸馏至线性注意力解码器(Rapid Attention Distillation to Linear Attention Decoders at Scale, RADLADS)协议,该协议能够迅速将基于softmax注意力的Transformer模型转换为线性注意力解码器模型。同时,我们引入了两种新的RWKV变体架构,并成功将流行的Qwen2.5开源模型转换为7B、32B和72B规模的模型。我们的转换过程仅需350至700M个token,不到原始教师模型训练所用token数量的0.005%。转换为72B线性注意力模型的成本在当前价格下低于2000美元,而推理质量仍接近原Transformer模型。这些模型在其规模级别的线性注意力模型中,在一系列标准基准测试上均达到了最先进的下游性能。我们已在HuggingFace平台上以Apache 2.0许可证发布了所有模型,其中72B模型还受Qwen许可协议约束。
模型地址:
https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102
训练代码地址:
https://github.com/recursal/RADLADS-paper
English
We present Rapid Attention Distillation to Linear Attention Decoders at Scale
(RADLADS), a protocol for rapidly converting softmax attention transformers
into linear attention decoder models, along with two new RWKV-variant
architectures, and models converted from popular Qwen2.5 open source models in
7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens,
less than 0.005% of the token count used to train the original teacher models.
Converting to our 72B linear attention model costs less than \$2,000 USD at
today's prices, yet quality at inference remains close to the original
transformer. These models achieve state-of-the-art downstream performance
across a set of standard benchmarks for linear attention models of their size.
We release all our models on HuggingFace under the Apache 2.0 license, with the
exception of our 72B models which are also governed by the Qwen License
Agreement.
Models at
https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102
Training Code at https://github.com/recursal/RADLADS-paperSummary
AI-Generated Summary