ChatPaper.aiChatPaper

混合线性注意力机制的正确实现:面向超长上下文的高效蒸馏与有效架构

Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

January 29, 2026
作者: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu
cs.AI

摘要

混合Transformer架构通过结合softmax注意力模块与循环神经网络(RNN),在长上下文建模中展现出理想的性能与吞吐量平衡,但其大规模从头预训练的巨大成本阻碍了实际应用与研究。近期研究表明,预训练的softmax注意力模块可通过参数迁移与知识蒸馏转化为RNN模块。然而这些迁移方法需消耗大量训练数据(超过100亿标记),且所得混合模型在长上下文场景下表现不佳——而这正是混合模型相比纯Transformer模型具有显著推理加速优势的场景。本文提出HALO(基于层优化的混合注意力),一种将Transformer模型蒸馏为RNN-注意力混合模型的流程;同时推出HypeNet混合架构,该架构通过新型位置编码方案(命名为HyPE)及多项结构改进,实现了卓越的长度泛化能力。我们使用HALO将Qwen3系列模型转换为HypeNet,在保持与原始Transformer模型相当性能的同时,获得了更优异的长上下文性能与效率。该转换仅需23亿标记,不足其预训练数据量的0.01%。
English
Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
PDF54January 31, 2026