슬라이딩 윈도우 어텐션 적응

초록

트랜스포머 기반 대규모 언어 모델(LLM)의 자기 주의(self-attention) 메커니즘은 입력 길이에 대해 2차적으로 확장되므로 장문맥 추론 시 비용이 많이 듭니다. 슬라이딩 윈도우 주의(SWA)는 이 비용을 선형 복잡도로 줄이지만, 전체 주의(FA)로 사전 학습된 모델에 추론 시 단순히 완전한 SWA를 적용하면 학습-추론 불일치로 인해 장문맥 성능이 심각하게 저하됩니다. 이로 인해 우리는 다음과 같은 의문을 갖게 되었습니다: FA로 사전 학습된 LLM을 재사전 학습 없이 SWA에 잘 적응시킬 수 있을까? 우리는 이 문제를 탐구하기 위해 더 나은 적응을 위한 다섯 가지 방법을 결합한 실용적인 방법론 집합인 SWAA(Sliding Window Attention Adaptation)를 제안합니다. 이 방법론은 (1) 프리필링(prefilling) 단계에서만 SWA 적용, (2) "싱크"(sink) 토큰 보존, (3) FA/SWA 계층 교차 배치, (4) 사고 연쇄(CoT), (5) 미세 조정으로 구성됩니다. 우리의 실험 결과는 SWA 적응이 비단순적이면서도 가능함을 보여줍니다: 단일 방법으로는 충분하지 않지만, 특정 시너지 조합을 통해 원본 장문맥 성능을 효과적으로 회복할 수 있습니다. 우리는さらに 다양한 SWAA 구성의 성능-효율성 트레이드오프를 분석하고 다양한 시나리오에 대한 권장 방법을 제시합니다. 코드는 https://github.com/yuyijiong/sliding-window-attention-adaptation 에서 확인할 수 있습니다.

English

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation