Adamas:面向高效长上下文推理的Hadamard稀疏注意力机制
Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
October 21, 2025
作者: Siyuan Yan, Guo-Qing Jiang, Yuchen Zhang, Xiaoxing Ma, Ran Zhu, Chun Cao, Jingwei Xu
cs.AI
摘要
当前,大型语言模型(LLMs)已支持数十万至数百万标记的上下文窗口,实现了长文档摘要、大规模代码合成、多文档问答及持续性多轮对话等应用。然而,这种扩展上下文加剧了自注意力机制的二次方计算成本,导致自回归解码出现严重延迟。现有的稀疏注意力方法虽能缓解这一成本,但依赖启发式模式,难以准确召回每个查询所需的关键键值对,导致精度下降。我们提出Adamas——一种面向长上下文推理的轻量级高精度稀疏注意力机制。该方法通过哈达玛变换、分桶化和2比特压缩生成紧凑表示,并利用曼哈顿距离估计实现高效top-k筛选。实验表明:Adamas仅需64个标记的预算即可达到全注意力机制的精度,在128标记时实现近乎无损的性能,且相比现有最优方法支持高达8倍的稀疏度,在32K长度序列上实现4.4倍的自注意力加速和1.5倍的端到端加速。值得注意的是,Adamas在极端稀疏条件下仍能保持与全注意力相当甚至更低的困惑度,印证了其在精度维护方面的卓越效能。
English
Large language models (LLMs) now support context windows of hundreds of
thousands to millions of tokens, enabling applications such as long-document
summarization, large-scale code synthesis, multi-document question answering
and persistent multi-turn dialogue. However, such extended contexts exacerbate
the quadratic cost of self-attention, leading to severe latency in
autoregressive decoding. Existing sparse attention methods alleviate these
costs but rely on heuristic patterns that struggle to recall critical key-value
(KV) pairs for each query, resulting in accuracy degradation. We introduce
Adamas, a lightweight yet highly accurate sparse attention mechanism designed
for long-context inference. Adamas applies the Hadamard transform,
bucketization and 2-bit compression to produce compact representations, and
leverages Manhattan-distance estimation for efficient top-k selections.
Experiments show that Adamas matches the accuracy of full attention with only a
64-token budget, achieves near-lossless performance at 128, and supports up to
8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering
up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences.
Remarkably, Adamas attains comparable or even lower perplexity than full
attention, underscoring its effectiveness in maintaining accuracy under
aggressive sparsity.