IndexCache: 크로스 레이어 인덱스 재사용을 통한 희소 어텐션 가속

초록

장문 컨텍스트 에이전트 워크플로우는 대규모 언어 모델의 대표적인 사용 사례로 부상하며, 추론 속도와 서비스 비용 측면에서 어텐션 효율성을 중요하게 만들었습니다. 희소 어텐션은 이 문제를 효과적으로 해결하며, DeepSeek 희소 어텐션(DSA)은 이를 구현한 대표적인 프로덕션급 솔루션입니다: 경량화된 라이트닝 인덱서가 쿼리당 가장 관련성 높은 상위 k개 토큰을 선택하여 핵심 어텐션 연산을 O(L^2)에서 O(Lk)로 줄입니다. 그러나 인덱서 자체는 여전히 O(L^2) 복잡도를 가지며, 연속된 레이어에서 선택된 상위 k개 결과가 매우 유사함에도 불구하고 모든 레이어에서 독립적으로 실행되어야 합니다. 본 논문에서는 이러한 계층 간 중복성을 활용하는 IndexCache를 제안합니다. 이는 레이어를 자체 인덱서를 실행하는 소수의 전체(Full) 레이어와 가장 가까운 전체 레이어의 상위 k 인덱스를 재사용하는 다수의 공유(Shared) 레이어로 분할합니다. 이 구성을 결정하고 최적화하기 위해 두 가지 상호 보완적인 접근법을 제시합니다. 학습 불필요형(Training-free) IndexCache는 캘리브레이션 데이터셋에서 언어 모델링 손실을 직접 최소화하는 탐욕적 탐색 알고리즘을 적용하여 인덱서를 유지할 레이어를 선택하며, 가중치 업데이트가 필요 없습니다. 학습 인지형(Training-aware) IndexCache는 각 유지된 인덱서가 담당하는 모든 레이어의 평균화된 어텐션 분포에 대해 학습하도록 하는 다중 레이어 지식 증류 손실을 도입하여, 단순한 인터리빙 패턴만으로도 전체 인덱서 정확도를 달성할 수 있게 합니다. 30B DSA 모델에 대한 실험 결과는 IndexCache가 인덱서 연산의 75%를 제거하면서도 미미한 성능 저하만 발생시키며, 기준 DSA 대비 최대 1.82배의 프리필 속도 향상과 1.48배의 디코딩 속도 향상을 달성함을 보여줍니다. 이러한 긍정적인 결과는 프로덕션 규모의 GLM-5 모델에 대한 예비 실험(그림 1)에서도 추가로 확인되었습니다.

English

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82times prefill speedup and 1.48times decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

IndexCache: 크로스 레이어 인덱스 재사용을 통한 희소 어텐션 가속

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

초록

Support