주의를 기울이세요! 마스크 이미지 모델링을 위한 주의 기반 탐구 재고찰

초록

미세 조정(Fine-Tuning, FT)이 대규모로 적용하기 어려워짐에 따라, 자기 지도 학습(Self-Supervised Learning, SSL)의 평가 프로토콜로 프로빙(probing)이 선호되는 추세이다. 그러나 표준 선형 프로빙(Linear Probing, LP)은 패치 토큰의 분산적 특성으로 인해 마스크 이미지 모델링(Masked Image Modeling, MIM)으로 학습된 모델의 잠재력을 충분히 반영하지 못한다. 이는 주의 집중 프로빙(attentive probing)의 필요성을 부각시키는데, 이 방법은 주의 메커니즘을 사용하여 패치 수준의 특징을 선택적으로 집계한다. 주의 집중 프로빙이 점차 채택되고 있음에도 불구하고, 이 방법은 여전히 충분히 탐구되지 않았으며, 기존 방법들은 과도한 매개변수화와 낮은 계산 효율성으로 인해 어려움을 겪고 있다. 본 연구에서는 정확도-효율성 트레이드오프의 관점에서 주의 집중 프로빙을 재검토한다. 기존 방법들의 메커니즘을 분석하고 성능을 벤치마킹하는 체계적인 연구를 수행한다. 이를 통해 중복 투영을 제거하고 학습 가능한 매개변수의 수를 줄이며, 기존의 다중 헤드 주의(multi-head attention) 방식에 비해 최대 10배의 속도 향상을 달성하는 다중 쿼리 교차 주의(multi-query cross-attention) 메커니즘인 효율적 프로빙(Efficient Probing, EP)을 제안한다. EP는 단순함에도 불구하고, 7개의 벤치마크에서 LP 및 기존의 주의 집중 프로빙 접근법을 능가하며, MIM을 넘어 다양한 사전 학습 패러다임에서도 잘 일반화되고, 해석 가능한 주의 맵을 생성하며, 저샷(low-shot) 및 계층별(layer-wise) 설정에서도 강력한 성능 향상을 달성한다. 코드는 https://github.com/billpsomas/efficient-probing에서 확인할 수 있다.

English

As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10times speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.

주의를 기울이세요! 마스크 이미지 모델링을 위한 주의 기반 탐구 재고찰

Attention, Please! Revisiting Attentive Probing for Masked Image Modeling

초록

Support