通量注意力：面向高效大语言模型推理的情境感知混合注意力机制

摘要

标准注意力机制的二次计算复杂度在长上下文场景下成为大语言模型严重的可扩展性瓶颈。虽然结合全注意力与稀疏注意力的混合机制提供了潜在解决方案，但现有方法通常依赖静态分配比例，难以适应不同任务对信息检索的动态需求。此外，头部级动态稀疏性往往会引发严重的计算负载不均衡与同步长尾问题，阻碍自回归解码过程中的硬件加速。为弥补这一差距，我们提出Flux Attention——一种在层级动态优化注意力计算的上下文感知框架。通过将轻量级层级路由器集成至冻结的预训练大模型中，该方法能根据输入上下文自适应地将各层路由至全注意力或稀疏注意力计算。这种层级路由策略在保证高保真信息检索的同时，实现了连续内存访问，将理论计算量削减转化为实际端到端加速。作为参数高效型方法，本框架仅需在8张A800 GPU上训练12小时。在多个长上下文与数学推理基准测试中的广泛实验表明，相较于基线模型，Flux Attention在性能与推理速度间实现了更优平衡，其预填充阶段和解码阶段分别实现了最高2.8倍和2.0倍的加速效果。

English

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8timesA800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to 2.8times and 2.0times in the prefill and decode stages.