ConvNetsは大規模なスケールにおいてVision Transformersと同等の性能を発揮する

要旨

多くの研究者は、ConvNetsが小規模または中規模のデータセットでは良好な性能を発揮するが、ウェブスケールのデータセットにアクセスできる場合、Vision Transformersに匹敵しないと考えています。私たちはこの考えに挑戦するため、大規模なラベル付き画像データセットであるJFT-4Bで事前学習された高性能なConvNetアーキテクチャを評価しました。0.4kから110k TPU-v4コア計算時間の事前学習計算予算を考慮し、NFNetモデルファミリーから深さと幅を増やした一連のネットワークを学習しました。保持された損失と計算予算の間にlog-logスケーリング則が観察されました。ImageNetでファインチューニングした後、NFNetsは同等の計算予算を持つVision Transformersの報告された性能に匹敵しました。私たちの最も強力なファインチューニングモデルは、Top-1精度90.4%を達成しました。

English

Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

ConvNetsは大規模なスケールにおいてVision Transformersと同等の性能を発揮する

ConvNets Match Vision Transformers at Scale

要旨

Support