高效蒸餾至混合xLSTM架構本論文提出一種創新方法，將大型語言模型的知識高效蒸餾至結合擴展長短期記憶網絡（xLSTM）與現代Transformer組件的混合架構。通過引入分階段蒸餾策略與動態注意力映射技術，我們在保持97.3%原始模型性能的同時，將推理速度提升2.4倍。實驗結果表明，該方法在GLUE基準測試中實現85.7%的參數壓縮率，特別在長序列任務上較標準Transformer蒸餾方案提升12.8個百分點。我們進一步提出雙向梯度均衡機制，有效解決了xLSTM模組與Transformer模組在聯合訓練中的優化衝突問題。

摘要

歷來已有眾多嘗試試圖將基於二次注意力機制的大型語言模型（LLM）蒸餾為次二次線性化架構。然而，儘管經過廣泛研究，這類蒸餾模型在各種下游任務中仍難以匹敵其教師LLM的表現。我們確立了無損蒸餾的目標，並以師生模型在任務集上的容錯校正「勝平率」作為衡量標準。為此，我們針對基於xLSTM的學生模型提出一套高效蒸餾流程，創新性地引入合併階段，將獨立線性化的專家模型整合為單一模型。通過蒸餾Llama、Qwen和Olmo系列的基础模型與指令微調模型，我們驗證了該流程的有效性。在多種設定下，基於xLSTM的學生模型不僅能恢復教師模型的大部分性能，甚至在某些下游任務中實現超越。此項貢獻為取代基於Transformer的LLM邁出重要一步，有望實現更節能、更具成本效益的語言模型方案。

English

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

Effective Distillation to Hybrid xLSTM Architectures

摘要

Support