高效蒸馏至混合xLSTM架构

摘要

当前已有大量研究尝试将基于二次注意力的预训练大语言模型（LLM）蒸馏为次二次线性化架构。然而尽管研究广泛，此类蒸馏模型在各类下游任务中仍难以达到教师LLM的性能水平。我们设定了无损蒸馏的目标，并通过师生模型在任务集上的容差校正胜平率来定义该目标。为此，我们针对基于xLSTM的学生模型提出了一套高效蒸馏流程，创新性地引入了专家合并阶段——将独立线性化的专家模型整合为单一模型。通过蒸馏Llama、Qwen和Olmo系列的基础模型与指令微调模型，我们验证了该流程的有效性。实验表明，基于xLSTM的学生模型在多数场景下能复现教师模型绝大部分性能，甚至在某些下游任务实现反超。我们的研究成果为取代基于Transformer的LLM提供了更节能、更经济的重要路径。

English

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.