彈性注意力機制：高效能Transformer模型中的測試時自適應稀疏率

摘要

標準注意力機制的二次方複雜度，為大型語言模型在長文本場景下的可擴展性帶來了顯著瓶頸。雖然結合稀疏注意力與全注意力的混合式注意力策略提供了可行解決方案，但這類方法通常採用靜態計算比例（即固定稀疏與全注意力的配比），無法在推理階段適應下游任務對稀疏度的差異化敏感需求。為解決此問題，我們提出彈性注意力機制，使模型能根據輸入動態調整整體稀疏度。該方法通過在預訓練模型中集成輕量級注意力路由器，動態分配各注意力頭至不同計算模式。僅需在8張A800 GPU上進行12小時訓練，我們的方法即可使模型同時實現強勁性能與高效推理。在三個長文本基準測試中對主流大型語言模型的實驗結果，驗證了本方法的優越性。

English

The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.

彈性注意力機制：高效能Transformer模型中的測試時自適應稀疏率

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

摘要

Support