无需显式注意力的线性时间全局视觉建模

摘要

现有研究普遍将Transformer的全局序列建模能力归因于注意力权重的显式计算，这一过程本质上具有二次计算复杂度。本文提出全新视角：我们证明注意力机制可通过数学重构转化为具备动态参数预测能力的多层感知机（MLP）。基于此框架，我们将注意力的全局建模能力解释为动态生成参数对全局上下文进行压缩表征的隐式过程，而非传统的显式token聚合机制。受此启发，我们探究了一个根本性问题：能否完全通过动态参数化实现Transformer级别的序列全局建模，同时保持线性复杂度以替代显式注意力？为此，我们设计了多种动态参数预测策略并将其融入标准网络层。在视觉模型上的大量实验表明，动态参数化确实能成为显式注意力的高效线性复杂度替代方案，为高效序列建模开辟了新路径。代码已开源：https://github.com/LeapLabTHU/WeightFormer。

English

Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.

无需显式注意力的线性时间全局视觉建模

Linear-Time Global Visual Modeling without Explicit Attention

摘要

Support