明示的注意機構を用いない線形時間グローバル視覚モデリング

要旨

既存研究では、Transformerのグローバルな系列モデリング能力は、主に注意重みの明示的な計算に起因するとされており、このプロセスは本質的に二次計算量を伴う。本研究では新たな視点を提示する：注意機構が、動的に予測されるパラメータを備えた多層パーセプトロン（MLP）として数学的に再定式化できることを示す。この観点を通じて、注意のグローバルモデリング能力を、トークン単位の明示的な集約としてではなく、動的に生成されたパラメータがグローバル文脈の圧縮表現として機能する暗黙的なプロセスとして説明する。この知見に基づき、我々は次の根本的な問いを探求する：明示的な注意機構を置き換えつつ、線形計算量を維持したまま、動的パラメータ化のみを通じてTransformerレベルの系列グローバルモデリングを実現できるか？これを検証するため、様々な動的パラメータ予測戦略を設計し、標準的なネットワーク層に統合する。視覚モデルにおける大規模な実証研究により、動的パラメータ化が明示的注意機構の非常に効果的な線形計算量代替手段となり得ることを示し、効率的な系列モデリングへの新たな道筋を開く。コードはhttps://github.com/LeapLabTHU/WeightFormer で公開されている。

English

Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.

明示的注意機構を用いない線形時間グローバル視覚モデリング

Linear-Time Global Visual Modeling without Explicit Attention

要旨

Support