層ごと、モジュールごと：ViTの最適なOOD検出には両方を選択せよ

要旨

近年の研究では、ファウンデーションモデルの中間層が、最終層よりも識別性の高い表現を生成することが観察されている。この現象は当初、自己回帰的事前学習に起因すると考えられていたが、教師あり学習や識別的な自己教師あり学習目標で訓練されたモデルでも確認されている。本論文では、事前学習済みVision Transformerにおける中間層の振る舞いを分析する包括的研究を行う。多様な画像分類ベンチマークで実施した大規模な線形 probing 実験を通じて、事前学習データと下流データ間の分布シフトが、深い層での性能低下の主原因であることを明らかにする。さらに、モジュールレベルでの詳細分析を実施した。その結果、トランスフォーマーブロックの出力に対する標準的な probing は最適ではなく、フィードフォワードネットワーク内部の活性化を probing することが分布シフトが顕著な場合に最高の性能を発揮すること、一方でマルチヘッド自己注意モジュールの正規化出力はシフトが弱い場合に最適であることを発見した。

English

Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

層ごと、モジュールごと：ViTの最適なOOD検出には両方を選択せよ

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

要旨

Support