高阶线性注意力
Higher-order Linear Attention
October 31, 2025
作者: Yifan Zhang, Zhen Qin, Quanquan Gu
cs.AI
摘要
缩放点积注意力的二次计算成本是阻碍自回归语言模型向长上下文扩展的核心障碍。线性注意力与状态空间模型虽提供了可扩展的替代方案,但通常受限于一阶或基于核函数的近似,这可能削弱其表达能力。我们提出高阶线性注意力(HLA),这是一种因果流式处理机制,通过紧凑的前缀充分统计量实现高阶交互。在二阶情形下,HLA仅需维持恒定大小的状态,无需显式构建任何n×n矩阵即可在线性时间内完成逐词元输出。我们给出了封闭形式的流式计算恒等式、使用两个附加摘要的严格因果掩码变体,以及基于关联扫描的块并行训练方案,该方案可精确复现串行递归的激活值。我们进一步勾勒出向三阶及更高阶的扩展路径。这些成果共同将HLA确立为一种兼具注意力式数据依赖混合能力与现代循环架构效率的、具有理论依据的可扩展基础模块。项目页面:https://github.com/yifanzhang-pro/HLA。
English
The quadratic cost of scaled dot-product attention is a central obstacle to
scaling autoregressive language models to long contexts. Linear-time attention
and State Space Models (SSMs) provide scalable alternatives but are typically
restricted to first-order or kernel-based approximations, which can limit
expressivity. We introduce Higher-order Linear Attention (HLA), a causal,
streaming mechanism that realizes higher interactions via compact prefix
sufficient statistics. In the second-order case, HLA maintains a constant-size
state and computes per-token outputs in linear time without materializing any
n times n matrices. We give closed-form streaming identities, a strictly
causal masked variant using two additional summaries, and a chunk-parallel
training scheme based on associative scans that reproduces the activations of a
serial recurrence exactly. We further outline extensions to third and higher
orders. Collectively, these results position HLA as a principled, scalable
building block that combines attention-like, data-dependent mixing with the
efficiency of modern recurrent architectures. Project Page:
https://github.com/yifanzhang-pro/HLA.