紧凑注意力:通过块联合KV选择加速分块预填充
CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection
May 16, 2026
作者: Jiwon Song, Dongwon Jo, Beomseok Kang, Jae-Joon Kim
cs.AI
摘要
分块预填充已成为长上下文大型语言模型中广泛采用的服务策略,但在此模式下的高效注意力计算仍具挑战性。现有的稀疏注意力方法主要针对一次性预填充设计,无法高效迁移至分块预填充:当查询长度受限于块大小时,块稀疏内核效率降低;而在每个分块中对累积的KV缓存重复进行细粒度模式搜索,成本高昂。QUOKA是近期直接针对分块预填充的方法,它避免了稀疏内核的开销,但依赖查询子采样和令牌级KV选择,可能遗漏查询特定的KV条目并引入显式的KV复制开销。为解决这些限制,我们提出CompactAttention,一种基于块联合KV选择的分块预填充注意力机制。CompactAttention将二维块稀疏掩码视为KV选择信号而非直接的内核执行计划,并通过查询块联合和组内联合将其转换为GQA感知的每分组KV块表。该构造生成最小的块表,在分页执行约束下保留输入掩码选择的所有KV块,使得所选KV块可在原地访问而无需显式的KV压缩。在LLaMA-3.1-8B-Instruct模型上,CompactAttention在RULER基准测试中保持与密集注意力接近的准确率,同时在分块预填充下,于128K上下文长度时实现高达2.72倍的注意力加速。
English
Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72times attention speedup at 128K context length under chunked prefill.