ViT-AdaLA：線形注意機構によるVision Transformerの適応

要旨

Vision Transformer (ViT) に基づく視覚基盤モデル(VFM)は、多様な視覚タスクで顕著な性能を達成しているが、二次計算量の問題により長系列への拡張性が制限されている。ViT向けの既存の線形注意機構手法は通常スクラッチから訓練されるため多大な計算資源を要し、大規模言語モデルのデコーダ向けに開発された線形化手法はViTにうまく転移しない。これらの課題に対処するため、本論文はVFMの事前知識を線形注意ViTに効果的に適応・転移させる新規フレームワーク「ViT-AdaLA」を提案する。ViT-AdaLAは、注意整合、特徴量整合、教師ありファインチューニングの3段階から構成される。注意整合段階では、各ブロックにおいて通常の線形注意を元のsoftmaxベースの注意と整合させ、softmax注意の挙動を近似する。しかし、残差近似誤差は層を跨いで不可避に蓄積する。これを緩和するため、線形化ViTをファインチューニングし、最終層特徴量を凍結したsoftmax VFM教師モデルと整合させる。最後に、適応された事前知識は教師ありファインチューニングを通じて下流タスクに転移される。分類とセグメンテーションタスクにおける広範な実験により、ViT-AdaLAが各種最先端線形注意手法を上回る有効性と汎用性を実証する。

English

Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.

ViT-AdaLA：線形注意機構によるVision Transformerの適応

ViT-AdaLA: Adapting Vision Transformers with Linear Attention

要旨

Support