ChatPaper.aiChatPaper

请注意!重新审视掩码图像建模中的注意力探测机制

Attention, Please! Revisiting Attentive Probing for Masked Image Modeling

June 11, 2025
作者: Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias
cs.AI

摘要

随着微调(FT)在大规模应用中的可行性日益降低,探测正逐渐成为自监督学习(SSL)的首选评估方法。然而,标准的线性探测(LP)由于图像块(patch)令牌的分布式特性,无法充分反映通过掩码图像建模(MIM)训练的模型潜力。这促使了注意力探测的需求,该方法利用注意力机制有选择性地聚合图像块级别的特征。尽管其应用日益广泛,注意力探测仍处于探索不足的状态,现有方法存在参数过多和计算效率低下的问题。 在本研究中,我们从准确性与效率权衡的角度重新审视了注意力探测。我们对现有方法进行了系统性研究,分析了其机制并对其性能进行了基准测试。我们引入了高效探测(EP),这是一种多查询交叉注意力机制,它消除了冗余的投影,减少了可训练参数的数量,并实现了相比传统多头注意力高达10倍的加速。尽管设计简洁,EP在七个基准测试中均超越了LP及先前的注意力探测方法,不仅对MIM之外的多种预训练范式表现出良好的泛化能力,还能生成可解释的注意力图,并在少样本和逐层设置中实现了显著的性能提升。代码已发布于https://github.com/billpsomas/efficient-probing。
English
As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10times speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.
PDF62June 13, 2025