ViT-AdaLA：基于线性注意力机制的自适应视觉变换器

摘要

基于视觉Transformer（ViT）的视觉基础模型（VFM）已在多种视觉任务中取得显著性能，但其二次计算复杂度限制了长序列场景的可扩展性。现有ViT线性注意力方法通常需从头训练，消耗大量计算资源，而针对大语言模型解码器开发的线性化方法难以有效迁移至ViT。为解决这些挑战，我们提出ViT-AdaLA——一种创新框架，能够将视觉基础模型的先验知识高效适配并迁移至线性注意力ViT。该框架包含三阶段：注意力对齐、特征对齐和监督微调。在注意力对齐阶段，我们逐模块对齐普通线性注意力与原始基于softmax的注意力，以逼近后者的行为模式。然而，残差近似误差会随网络层数累积。为此，我们通过微调线性化ViT，使其最终层特征与冻结的softmax-VFM教师模型对齐来缓解该问题。最终，适配后的先验知识通过监督微调迁移至下游任务。在分类和分割任务上的大量实验表明，ViT-AdaLA在不同前沿线性注意力对比模型中均具有卓越有效性和通用性。

English

Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.