Vision Transformerのファインチューニングは非平滑成分から恩恵を受ける

要旨

Transformerアーキテクチャの滑らかさは、一般化性能、学習安定性、敵対的ロバスト性の文脈で広く研究されてきた。しかし、転移学習におけるその役割は未解明のままである。本論文では、視覚Transformerの構成要素が入力の変化に応じて出力を適応させる能力、すなわち可塑性を分析する。平均変化率として定義される可塑性は、入力摂動に対する感度を捉える。特に、高い可塑性は低い滑らかさを意味する。理論分析と包括的実験を通じて、この視点が適応過程で優先すべき構成要素を選択する際の原理的な指針を提供することを示す。実践的な重要な知見として、注意機構と順伝播層の高い可塑性が、一貫して優れたファインチューニング性能につながることを明らかにする。本知見は、滑らかさが望ましいとする従来の前提とは一線を画し、Transformerの機能的特性に関する新たな視点を提供する。コードはhttps://github.com/ambroiseodt/vit-plasticityで公開されている。

English

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.

Vision Transformerのファインチューニングは非平滑成分から恩恵を受ける

Vision Transformer Finetuning Benefits from Non-Smooth Components

要旨

Support