滑动窗口注意力自适应
Sliding Window Attention Adaptation
December 11, 2025
作者: Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei
cs.AI
摘要
基于Transformer的大语言模型(LLM)中的自注意力机制会随输入长度呈二次方增长,导致长上下文推理成本高昂。滑动窗口注意力(SWA)可将计算复杂度降至线性,但若对全注意力(FA)预训练的模型在推理时直接启用完整SWA,会因训练-推理失配引发长上下文性能严重退化。这引发我们思考:FA预训练的LLM能否在不重新预训练的情况下良好适配SWA?我们通过提出滑动窗口注意力适配(SWAA)方案展开研究,该方案融合五种优化方法实现更好适配:(1)仅在预填充阶段应用SWA;(2)保留"沉淀"标记;(3)交错排列FA/SWA层;(4)思维链(CoT)技术;(5)微调调优。实验表明SWA适配具有可行性但非易事:单一方法均不足够,而特定协同组合能有效恢复原始长上下文性能。我们进一步分析不同SWAA配置的性能-效率权衡关系,并为多样化场景提供推荐方案。代码已开源:https://github.com/yuyijiong/sliding-window-attention-adaptation
English
The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation