어댑터튠: 고정된 비전 트랜스포머를 위한 영초기화 저순위 어댑터

초록

Vision Transformer를 활용한 고정 백본 전이 학습에서는 두 가지 제대로 다루어지지 않은 문제가 있습니다: 어댑터를 단순히 고정된 특징 추출기에 삽입할 때 발생하는 최적화 불안정성, 그리고 어댑터 용량 설정을 위한 원칙적인 지침의 부재입니다. 우리는 각 트랜스포머 블록에 상위 투영(up-projection) 가중치를 영점 초기화하여 잔차 저차원 병목 계층을 추가하는 AdapterTune을 제안합니다. 이는 조정된 네트워크가 사전 학습된 함수에서 정확히 시작하도록 보장하고 초기 에포크의 특징 표현 변동을 제거합니다. 분석적 측면에서는, 어댑터의 계수(rank)를 특징 공간에서의 다운스트림 작업 변화를 근사화하기 위한 용량 예산으로 공식화합니다. 이를 통해 도출된 초과 위험(excess-risk) 분해는 계수가 증가함에 따라 정확도 향상이 단조롭지만 체감하는, 즉 "엘보(elbow)" 현상을 예측하며, 우리는 이를 통제된 실험을 통해 확인했습니다. 우리는 9개의 데이터셋과 3가지 규모의 백본에 대해 다중 시드 결과를 포함하여 평가를 수행했습니다. 핵심이 되는 5개 데이터셋 전이 학습 모음에서 AdapterTune은 헤드만 조정하는 방식보다 평균 Top-1 정확도를 +14.9%p 향상시켰으며, 이는 전체 미세 조정 대비 매개변수의 0.92%만을 학습하여 달성했습니다. 또한 15개 데이터셋-백본 조합 중 10개에서 전체 미세 조정을 능가하는 성능을 보였습니다. 전체 벤치마크에서 AdapterTune은 테스트된 모든 데이터셋-백본 조합에서 헤드만 조정하는 방식보다 우수한 성능을 보였습니다. 계수, 배치 위치, 초기화 방법에 대한 제거 실험을 통해 각 설계 선택의 효과를 분리하여 확인했습니다. 코드는 https://github.com/salimkhazem/adaptertune 에서 확인할 수 있습니다.

English

Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune

어댑터튠: 고정된 비전 트랜스포머를 위한 영초기화 저순위 어댑터

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

초록

Support