ChatPaper.aiChatPaper

卷积神经网络在规模上与视觉Transformer相匹配

ConvNets Match Vision Transformers at Scale

October 25, 2023
作者: Samuel L. Smith, Andrew Brock, Leonard Berrada, Soham De
cs.AI

摘要

许多研究人员认为,卷积神经网络在小型或中等规模数据集上表现良好,但在具有 Web 规模数据集访问权限时,无法与视觉Transformer竞争。我们通过评估在JFT-4B上预训练的高性能卷积神经网络架构来挑战这一观点,JFT-4B是一组大型带标签图像数据集,通常用于训练基础模型。我们考虑了预训练计算预算在0.4k到110k TPU-v4核心计算小时之间,并从NFNet模型系列中训练一系列不断加深和加宽的网络。我们观察到留出损失与计算预算之间存在对数-对数缩放规律。在在ImageNet上微调后,NFNet与具有相似计算预算的视觉Transformer的报告性能相匹配。我们最强的微调模型实现了90.4%的Top-1准确率。
English
Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.
PDF211December 15, 2024