ConvNets在規模上與視覺Transformer相匹敵。
ConvNets Match Vision Transformers at Scale
October 25, 2023
作者: Samuel L. Smith, Andrew Brock, Leonard Berrada, Soham De
cs.AI
摘要
許多研究人員認為ConvNets在小型或中等大小的數據集上表現良好,但在網絡規模的數據集上,與Vision Transformers相比並不具競爭力。我們通過評估在JFT-4B上預先訓練的高效ConvNet架構來挑戰這種觀點,JFT-4B是一個常用於訓練基礎模型的大型標記圖像數據集。我們考慮預訓練計算預算介於0.4k和110k TPU-v4核心計算小時之間,並從NFNet模型系列中訓練一系列不斷加深和加寬的網絡。我們觀察到保留損失和計算預算之間存在對數對數比例定律。在在ImageNet上進行微調後,NFNets與具有相應計算預算的Vision Transformers的報告性能相匹配。我們最強的微調模型實現了90.4%的Top-1精度。
English
Many researchers believe that ConvNets perform well on small or moderately
sized datasets, but are not competitive with Vision Transformers when given
access to datasets on the web-scale. We challenge this belief by evaluating a
performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset
of images often used for training foundation models. We consider pre-training
compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a
series of networks of increasing depth and width from the NFNet model family.
We observe a log-log scaling law between held out loss and compute budget.
After fine-tuning on ImageNet, NFNets match the reported performance of Vision
Transformers with comparable compute budgets. Our strongest fine-tuned model
achieves a Top-1 accuracy of 90.4%.