RADLADS:大規模快速注意力蒸餾至線性注意力解碼器
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
May 5, 2025
作者: Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah
cs.AI
摘要
我們提出了大規模快速注意力蒸餾至線性注意力解碼器(Rapid Attention Distillation to Linear Attention Decoders at Scale, RADLADS)的協議,該協議能迅速將基於softmax注意力的Transformer模型轉換為線性注意力解碼器模型,並伴隨兩種新的RWKV變體架構,以及從流行的Qwen2.5開源模型轉換而來的7B、32B和72B規模的模型。我們的轉換過程僅需350至700M個token,不到原始教師模型訓練所用token總數的0.005%。轉換至我們的72B線性注意力模型的成本在當前價格下低於2000美元,然而推理時的質量仍接近原始Transformer。這些模型在其規模的線性注意力模型中,於一系列標準基準測試上達到了最先進的下游性能。我們在HuggingFace上以Apache 2.0許可證發布了所有模型,除了我們的72B模型,這些模型還受Qwen許可協議的約束。
模型位於:
https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102
訓練代碼位於:
https://github.com/recursal/RADLADS-paper
English
We present Rapid Attention Distillation to Linear Attention Decoders at Scale
(RADLADS), a protocol for rapidly converting softmax attention transformers
into linear attention decoder models, along with two new RWKV-variant
architectures, and models converted from popular Qwen2.5 open source models in
7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens,
less than 0.005% of the token count used to train the original teacher models.
Converting to our 72B linear attention model costs less than \$2,000 USD at
today's prices, yet quality at inference remains close to the original
transformer. These models achieve state-of-the-art downstream performance
across a set of standard benchmarks for linear attention models of their size.
We release all our models on HuggingFace under the Apache 2.0 license, with the
exception of our 72B models which are also governed by the Qwen License
Agreement.
Models at
https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102
Training Code at https://github.com/recursal/RADLADS-paperSummary
AI-Generated Summary