RADLADS: 大規模線形注意デコーダへの高速注意蒸留

要旨

本論文では、Rapid Attention Distillation to Linear Attention Decoders at Scale（RADLADS）を提案します。これは、ソフトマックスアテンショントランスフォーマーを線形アテンションデコーダーモデルに迅速に変換するプロトコルであり、2つの新しいRWKVバリアントアーキテクチャと、7B、32B、72Bサイズの人気オープンソースモデルQwen2.5から変換したモデルを含みます。我々の変換プロセスは、350-700Mトークンのみを必要とし、元の教師モデルの訓練に使用されたトークン数の0.005%未満です。72Bの線形アテンションモデルへの変換コストは、現在の価格で2,000米ドル未満でありながら、推論時の品質は元のトランスフォーマーに近いままです。これらのモデルは、そのサイズの線形アテンションモデルにおける一連の標準ベンチマークで、最先端の下流性能を達成します。我々は、72Bモデルを除くすべてのモデルをApache 2.0ライセンスの下でHuggingFaceに公開します。72BモデルはQwenライセンス契約にも準拠します。モデルは以下で公開しています: https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 訓練コードは以下で公開しています: https://github.com/recursal/RADLADS-paper

English

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

RADLADS: 大規模線形注意デコーダへの高速注意蒸留

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

要旨

Support