전체 어텐션의 역습: 100회의 훈련 단계 내에 전체 어텐션을 희소 어텐션으로 전환하기

초록

대규모 언어 모델의 장기 컨텍스트 추론은 전체 주의(attention)의 이차 비용으로 인해 병목 현상이 발생한다. 기존의 효율적인 대안들은 종종 네이티브 희소 학습이나 휴리스틱 토큰 제거에 의존하여 효율성, 학습 비용, 정확도 사이에 바람직하지 않은 절충을 초래한다. 본 연구에서는 전체 주의 LLM이 이미 본질적으로 희소하며, 최소한의 적응만으로도 고도로 희소한 모델로 변환될 수 있음을 보여준다. 우리의 접근법은 세 가지 관찰에 기반한다: (1) 소수의 주의 헤드만이 실제로 전체 장기 컨텍스트 처리를 필요로 한다; (2) 장거리 검색은 주로 저차원 부분 공간에 의해 제어되므로, 16차원 인덱서를 사용하여 관련 토큰을 효율적으로 검색할 수 있다; (3) 유용한 토큰 예산은 쿼리에 크게 의존적이므로, 고정 top-k 희소화보다 동적 top-p 선택이 더 적합하다. 이러한 통찰을 바탕으로, 우리는 검색 헤드에 대해서만 전체 KV 캐시를 유지하고 희소 주의를 위한 경량 토큰 인덱서를 도입하는 RTPurbo를 제안한다. RTPurbo는 모델의 내재적 희소성을 활용하여 수백 번의 학습 단계만으로 희소화를 달성한다. 장기 컨텍스트 벤치마크 및 추론 작업에 대한 실험 결과, RTPurbo는 거의 손실 없는 정확도를 유지하면서도 상당한 효율성 향상을 제공하며, 100만 컨텍스트에서 최대 9.36배의 프리필 속도 향상과 약 2.01배의 디코드 속도 향상을 달성한다. 이러한 결과는 값비싼 네이티브 희소 사전 학습 없이도 표준 전체 주의 학습을 통해 강력한 희소 추론을 얻을 수 있음을 시사한다.

English

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-p selection more suitable than fixed top-k sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36times prefill speedup at 1M context and about a 2.01times decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.