滑動視窗注意力適應

摘要

基於Transformer架構的大型語言模型（LLM）中的自注意力機制會隨輸入長度呈二次方增長，導致長上下文推理成本高昂。滑動窗口注意力（SWA）可將計算複雜度降至線性，但若直接對採用完整注意力（FA）預訓練的模型在推理階段啟用全域SWA，會因訓練與推理模式失配而引發嚴重的長上下文性能衰退。這促使我們思考：無需重新預訓練，能否使FA預訓練的LLM良好適配SWA？為此我們提出滑動窗口注意力適應（SWAA）方案，整合五種改進適配的實用方法：（1）僅在預填充階段應用SWA；（2）保留「沉澱」標記；（3）交錯佈局FA/SWA層；（4）思維鏈（CoT）技術；（5）微調策略。實驗表明SWA適配具有可行性但非易事：單一方法均不足夠，而特定協同組合能有效恢復原始長上下文性能。我們進一步分析不同SWAA配置的性能-效率權衡，並針對多樣化場景給出推薦方案。程式碼已開源於：https://github.com/yuyijiong/sliding-window-attention-adaptation

English

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation