ChatPaper.aiChatPaper

高阶线性注意力机制

Higher-order Linear Attention

October 31, 2025
作者: Yifan Zhang, Zhen Qin, Quanquan Gu
cs.AI

摘要

縮放點積注意力(scaled dot-product attention)的二次計算成本是自迴歸語言模型擴展至長上下文的核心障礙。線性時間注意力與狀態空間模型(SSMs)雖提供可擴展替代方案,但通常受限於一階或基於核函數的近似,這可能限制其表達能力。本文提出高階線性注意力(HLA),這是一種因果性流式計算機制,通過緊湊的前綴充分統計量實現高階交互作用。在二階情形下,HLA 保持恒定大小的狀態,並以線性時間計算每個詞元的輸出,無需實體化任何 n×n 矩陣。我們給出封閉形式的流式恆等式、使用兩個附加摘要的嚴格因果掩碼變體,以及基於關聯掃描的塊並行訓練方案,該方案能精確重現序列遞歸的激活值。我們進一步概述向三階及以上階數的擴展。總體而言,這些成果使 HLA 成為兼具理論嚴謹性與可擴展性的基礎構件,融合了類注意力的數據依賴混合能力與現代循環架構的高效性。項目頁面:https://github.com/yifanzhang-pro/HLA。
English
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any n times n matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
PDF151February 7, 2026