명시적 주의 메커니즘 없이 선형 시간에 구현하는 글로벌 시각적 모델링

초록

기존 연구는 트랜스포머(Transformer)의 전역 시퀀스 모델링 능력이 주로 어텐션 가중치의 명시적 계산, 즉 본질적으로 2차 계산 복잡도를 수반하는 과정에 기인한다고 보는 경향이 있습니다. 본 연구에서는 새로운 관점을 제시합니다: 우리는 어텐션이 동적으로 예측된 파라미터를 갖춘 다층 퍼셉트론(MLP)으로 수학적으로 재구성될 수 있음을 증명합니다. 이를 통해 어텐션의 전역 모델링 능력을 개별 토큰 간의 명시적 집계가 아닌, 동적으로 생성된 파라미터가 전역 맥락의 압축된 표현으로 작용하는 암묵적 과정으로 설명합니다. 이러한 통찰에 기반하여, 우리는 근본적인 질문을 탐구합니다: 명시적 어텐션을 효과적으로 대체하면서 선형 복잡도를 유지한 채, 오로지 동적 파라미터화를 통해 트랜스포머 수준의 시퀀스 전역 모델링을 달성할 수 있을까요? 이를 탐구하기 위해 다양한 동적 파라미터 예측 전략을 설계하고 이를 표준 네트워크 계층에 통합합니다. 비전 모델에 대한 광범위한 실험 연구를 통해 동적 파라미터화가 명시적 어텐션을 대체할 수 있는 매우 효과적이고 선형 복잡도의 대안이 될 수 있음을 입증하며, 효율적인 시퀀스 모델링을 위한 새로운 경로를 제시합니다. 코드는 https://github.com/LeapLabTHU/WeightFormer에서 확인할 수 있습니다.

English

Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.

명시적 주의 메커니즘 없이 선형 시간에 구현하는 글로벌 시각적 모델링

Linear-Time Global Visual Modeling without Explicit Attention

초록

Support