ChatPaper.aiChatPaper

弹性注意力:面向高效Transformer的测试时自适应稀疏化比率

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

January 24, 2026
作者: Zecheng Tang, Quantong Qiu, Yi Yang, Zhiyi Hong, Haiya Xiang, Kebin Liu, Qingqing Dang, Juntao Li, Min Zhang
cs.AI

摘要

标准注意力机制的二次复杂度在长上下文场景下对大型语言模型(LLMS)构成了显著的可扩展性瓶颈。虽然混合注意力策略通过在同一模型中结合稀疏与全注意力提供了可行方案,但这些方法通常采用静态计算比例(即固定稀疏与全注意力占比),无法在推理过程中适应下游任务对稀疏性的差异化敏感度。为此,我们提出弹性注意力机制,使模型能够根据输入动态调整整体稀疏度。该方法通过在预训练模型中集成轻量级注意力路由器,动态分配各注意力头至不同计算模式。仅需在8xA800 GPU上训练12小时,我们的方法即可使模型同时实现强劲性能与高效推理。在三个长上下文基准测试中针对主流LLMS开展的实验验证了本方法的优越性。
English
The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.
PDF261January 28, 2026