滑動視窗注意力適應
Sliding Window Attention Adaptation
December 11, 2025
作者: Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei
cs.AI
摘要
基於Transformer架構的大型語言模型(LLM)中的自注意力機制會隨輸入長度呈二次方增長,導致長上下文推理成本高昂。滑動窗口注意力(SWA)可將計算複雜度降至線性,但若直接對採用完整注意力(FA)預訓練的模型在推理階段啟用全域SWA,會因訓練與推理模式失配而引發嚴重的長上下文性能衰退。這促使我們思考:無需重新預訓練,能否使FA預訓練的LLM良好適配SWA?為此我們提出滑動窗口注意力適應(SWAA)方案,整合五種改進適配的實用方法:(1)僅在預填充階段應用SWA;(2)保留「沉澱」標記;(3)交錯佈局FA/SWA層;(4)思維鏈(CoT)技術;(5)微調策略。實驗表明SWA適配具有可行性但非易事:單一方法均不足夠,而特定協同組合能有效恢復原始長上下文性能。我們進一步分析不同SWAA配置的性能-效率權衡,並針對多樣化場景給出推薦方案。程式碼已開源於:https://github.com/yuyijiong/sliding-window-attention-adaptation
English
The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation