羚羊中的黑曼巴：精煉和加速混合模型

摘要

線性循環神經網絡結構，如 Mamba，在語言建模方面可以與 Transformer 模型競爭，同時具有有利的部署特性。鑒於目前對訓練大規模 Transformer 模型的關注，我們考慮將這些預訓練模型轉換為部署模型的挑戰。我們展示了通過重複使用注意力層中的線性投影權重，將大型 Transformer 模型提煉為線性循環神經網絡是可行的，並且使用學術 GPU 資源。結果得到的混合模型，其中包含四分之一的注意力層，實現了在聊天基準測試中與原始 Transformer 相當的性能，並且在聊天基準測試和通用基準測試中勝過從頭開始訓練的具有數萬億標記的開源混合 Mamba 模型。此外，我們引入了一種硬體感知的推斷解碼算法，加速了 Mamba 和混合模型的推斷速度。總的來說，我們展示了如何在有限的計算資源下，可以去除許多原始的注意力層，並更有效地生成從結果模型。我們從 Llama3-8B-Instruct 提煉出的表現最佳模型，在 AlpacaEval 2 上實現了 29.61 的長度控制勝率，超越了 GPT-4，並在 MT-Bench 上達到了 7.35，優於最佳的指令調整線性循環神經網絡模型。

English

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.

羚羊中的黑曼巴：精煉和加速混合模型

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

摘要

Support