계층별, 모듈별: ViT의 최적 OOD 탐색을 위한 양자택일이 아닌 병행 접근

초록

최근 연구에 따르면 파운데이션 모델의 중간 계층이 최종 계층보다 더 우수한 판별 표현을 생성하는 것으로 관찰됩니다. 이 현상은 초기에는 자기회귀적 사전학습 때문인 것으로 여겨졌으나, 지도 학습 및 판별적 자기지도 목적 함수로 훈련된 모델에서도 확인되었습니다. 본 논문에서는 사전 학습된 비전 트랜스포머의 중간 계층 동작을 체계적으로 분석합니다. 다양한 이미지 분류 벤치마크를 대상으로 한 포괄적인 선형 탐사 실험을 통해, 사전 학습 데이터와 다운스트림 데이터 간의 분포 변화가 더 깊은 계층에서의 성능 저하 주요 원인임을 확인했습니다. 더 나아가 모듈 수준의 세분화된 분석을 수행한 결과, 트랜스포머 블록 출력에 대한 표준 탐사 방법이 최적이 아니라는 사실을 발견했습니다. 대신, 피드포워드 네트워크 내부 활성화를 탐사할 경우 분포 변화가 심한 환경에서 최고 성능을 보였으며, 다중 헤드 자기 주의 모듈의 정규화된 출력은 분포 변화가 미약할 때 가장 우수한 성능을 나타냈습니다.

English

Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

계층별, 모듈별: ViT의 최적 OOD 탐색을 위한 양자택일이 아닌 병행 접근

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

초록

Support