揭示視覺表示學習中的骨幹優化器耦合偏差

摘要

本文探討了視覺主幹和優化器之間的相互作用，揭示了一種稱為\textbf{主幹-優化器耦合偏差}（BOCB）的相互依賴現象。我們觀察到，像VGG和ResNet這樣的經典CNN與SGD家族呈現明顯的相互依存關係，而像ViTs和ConvNeXt這樣的最新架構則與自適應學習率方法緊密耦合。我們進一步展示，BOCB可能由優化器和某些主幹設計引入，並可能顯著影響視覺模型的預訓練和下游微調。通過深入的實證分析，我們總結了對推薦優化器和強健視覺主幹架構的見解。我們希望這項工作能激發社區對主幹和優化器的長期假設提出質疑，促進進一步的探索，從而為更強健的視覺系統做出貢獻。源代碼和模型可在https://bocb-ai.github.io/公開獲得。

English

This paper delves into the interplay between vision backbones and optimizers, unvealing an inter-dependent phenomenon termed \textbf{backbone-optimizer coupling bias} (BOCB). We observe that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones. We further show that BOCB can be introduced by both optimizers and certain backbone designs and may significantly impact the pre-training and downstream fine-tuning of vision models. Through in-depth empirical analysis, we summarize takeaways on recommended optimizers and insights into robust vision backbone architectures. We hope this work can inspire the community to question long-held assumptions on backbones and optimizers, stimulate further explorations, and thereby contribute to more robust vision systems. The source code and models are publicly available at https://bocb-ai.github.io/.

揭示視覺表示學習中的骨幹優化器耦合偏差

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

摘要

Support