バックボーンの戦い：コンピュータビジョンタスクにおける大規模な事前学習モデルの比較

要旨

ニューラルネットワークベースのコンピュータビジョンシステムは、通常、バックボーン、すなわち事前学習済みまたはランダムに初期化された特徴抽出器を基盤として構築されます。数年前までは、ImageNetで学習された畳み込みニューラルネットワーク（CNN）がデフォルトの選択肢でした。しかし、最近では、さまざまなアルゴリズムやデータセットを用いて事前学習された無数のバックボーンが登場しています。この選択肢の豊富さは、さまざまなシステムの性能向上につながっていますが、実務者がどのバックボーンを選ぶべきかについて的確な判断を下すことは困難です。「Battle of the Backbones（BoB）」は、この選択を容易にするために、多様な事前学習済みモデルをベンチマークしています。これには、視覚言語モデル、自己教師あり学習（SSL）で学習されたモデル、Stable Diffusionのバックボーンなどが含まれ、分類から物体検出、OOD（Out-of-Distribution）汎化など、多岐にわたるコンピュータビジョンタスクで評価されます。さらに、BoBは、1500以上の学習実行に基づく包括的な分析を通じて、既存のアプローチの強みと弱みを明らかにし、コンピュータビジョンの研究コミュニティが進むべき有望な方向性を示しています。視覚トランスフォーマー（ViT）や自己教師あり学習（SSL）がますます人気を集めている中で、大規模な訓練セットで教師あり学習された畳み込みニューラルネットワークが、私たちが検討したモデルの中でほとんどのタスクで最高の性能を発揮することがわかりました。さらに、同じアーキテクチャと同規模の事前学習データセットでの公平な比較では、SSLバックボーンが非常に競争力があることがわかり、将来の研究では、高度なアーキテクチャとより大規模な事前学習データセットを用いてSSL事前学習を行うべきであることが示唆されています。私たちは、実験の生データと、研究者が自身のバックボーンをテストできるコードを以下のリンクで公開しています：https://github.com/hsouri/Battle-of-the-Backbones

English

Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones

バックボーンの戦い：コンピュータビジョンタスクにおける大規模な事前学習モデルの比較

Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

要旨

Support