効率的な蒸留によるハイブリッドxLSTMアーキテクチャへの適用

要旨

二次コストの注意機構に基づく大規模言語モデル（LLM）を、準二次的な線形化アーキテクチャに蒸留する試みは数多くなされてきた。しかし、広範な研究にもかかわらず、こうした蒸留モデルは様々な下流タスクにおいて教師LLMの性能を満たすことが往々にしてできない。我々は、タスク集合における生徒と教師の間の許容補正済みWin-and-Tie率によって定義する、ロスレス蒸留の目標を掲げた。この目的のために、xLSTMベースの生徒モデルに対する効果的な蒸留パイプラインを提案する。個別に線形化された専門家を単一モデルに結合する追加のマージ段階を導入する。Llama、Qwen、Olmoファミリーからベースモデルと指示チューニングモデルを蒸留することで、本パイプラインの有効性を示す。多くの設定において、xLSTMベースの生徒モデルは教師モデルの性能の大部分を回復し、一部の下流タスクではそれを上回ることさえある。我々の貢献は、トランスフォーマーベースのLLMに代わる、よりエネルギー効率が高く費用対効果の良い代替モデルへの重要な一歩である。

English

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

効率的な蒸留によるハイブリッドxLSTMアーキテクチャへの適用

Effective Distillation to Hybrid xLSTM Architectures

要旨

Support