大规模自回归图像模型的可扩展预训练

摘要

本文介绍了AIM，这是一组使用自回归目标预训练的视觉模型集合。这些模型受其文本对应物——即大型语言模型（LLMs）的启发，并表现出类似的扩展特性。具体而言，我们强调了两个关键发现：（1）视觉特征的性能随着模型容量和数据量的增加而提升，（2）目标函数的价值与模型在下游任务上的性能相关。我们通过在20亿张图像上预训练了一个70亿参数的AIM来说明这些发现的实际影响，该模型在ImageNet-1k上达到了84.0%的准确率，且冻结主干部分。有趣的是，即使在这个规模下，我们观察到性能没有饱和迹象，这表明AIM可能代表了训练大规模视觉模型的一个新前沿。AIM的预训练类似于LLMs的预训练，并且不需要任何图像特定策略来稳定大规模训练。

English

This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, that achieves 84.0% on ImageNet-1k with a frozen trunk. Interestingly, even at this scale, we observe no sign of saturation in performance, suggesting that AIM potentially represents a new frontier for training large-scale vision models. The pre-training of AIM is similar to the pre-training of LLMs, and does not require any image-specific strategy to stabilize the training at scale.

大规模自回归图像模型的可扩展预训练

Scalable Pre-training of Large Autoregressive Image Models

摘要

Support