ViT-AdaLA: 선형 어텐션을 활용한 비전 트랜스포머 적응

초록

비전 기반 모델(VFM)의 핵심인 Vision Transformer(ViT)는 다양한 비전 작업에서 뛰어난 성능을 보여왔지만, 긴 시퀀스로의 확장성을 제한하는 2차 복잡도 문제를 안고 있습니다. ViT를 위한 기존 선형 어텐션 방법론은 일반적으로 처음부터 재학습을 필요로 하여 상당한 계산 자원을 소모하며, 대규모 언어 모델 디코더용으로 개발된 선형화 기법들은 ViT에 효과적으로 적용되지 못했습니다. 이러한 문제를 해결하기 위해, 본 연구에서는 VFM의 사전 지식을 선형 어텐션 ViT에 효과적으로 적응 및 전이하는 새로운 프레임워크인 ViT-AdaLA를 제안합니다. ViT-AdaLA는 어텐션 정렬, 특징 정렬, 지도 미세 조정의 세 단계로 구성됩니다. 어텐션 정렬 단계에서는 각 블록 내의 기본 선형 어텐션을 원본 소프트맥스 기반 어텐션과 정렬하여 그 동작을 근사합니다. 그러나 잔차 근사 오차는 여러 계층에 걸쳐 누적될 수밖에 없습니다. 이를 완화하기 위해 선형화된 ViT를 미세 조정하여 최종 계층 특징이 고정된 소프트맥스 VFM 교사 모델과 정렬되도록 합니다. 마지막으로, 적응된 사전 지식은 지도 미세 조정을 통해 다운스트림 작업으로 전이됩니다. 분류 및 분할 작업에 대한 광범위한 실험을 통해 ViT-AdaLA가 다양한 최첨단 선형 어텐션 대비 방법론보다 효과적이고 일반화 성능이 뛰어남을 입증합니다.

English

Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.

ViT-AdaLA: 선형 어텐션을 활용한 비전 트랜스포머 적응

ViT-AdaLA: Adapting Vision Transformers with Linear Attention

초록

Support