ChatPaper.aiChatPaper

CompactAttention:使用區塊聯合KV選擇加速分塊預填充

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

May 16, 2026
作者: Jiwon Song, Dongwon Jo, Beomseok Kang, Jae-Joon Kim
cs.AI

摘要

分塊預填充已成為長上下文大型語言模型中廣泛採用的服務策略,但在該模式下實現高效的注意力計算仍極具挑戰。現有的稀疏注意力方法主要針對一次性預填充設計,無法有效遷移至分塊預填充:當查詢長度受限於塊大小時,塊稀疏內核效率下降;而細粒度模式搜索在每次處理新塊時需重複掃描累積的KV快取,導致計算成本高昂。近期提出的QUOKA方法直接針對分塊預填充,避免了稀疏內核的開銷,但其依賴查詢子採樣的標記層級KV選擇機制,可能遺漏查詢特定的KV條目,並引入顯式的KV複製開銷。為解決上述限制,我們提出CompactAttention——一種基於塊聯合KV選擇的分塊預填充注意力機制。CompactAttention將二維塊稀疏遮罩視為KV選擇信號而非直接執行稀疏內核的計劃,並通過Q塊聯合與組內聯合將其轉換為GQA感知的每組KV塊表。此構造能在分頁執行約束下,生成保留輸入遮罩所選所有KV塊的最小塊表,從而使選中的KV塊無需顯式KV壓縮即可原地存取。在LLaMA-3.1-8B-Instruct模型上,CompactAttention在RULER基準測試中保持與密集注意力相近的準確度,同時在128K上下文長度的分塊預填充場景下,實現高達2.72倍的注意力加速。
English
Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72times attention speedup at 128K context length under chunked prefill.