スケーラブルな大規模自己回帰型画像モデルの事前学習

要旨

本論文では、自己回帰目的で事前学習された視覚モデルのコレクションであるAIMを紹介する。これらのモデルは、そのテキスト版である大規模言語モデル（LLM）に着想を得ており、同様のスケーリング特性を示す。具体的には、2つの重要な発見を強調する：（1）視覚特徴の性能はモデルの容量とデータ量の両方に比例してスケールする、（2）目的関数の値は下流タスクにおけるモデルの性能と相関がある。これらの発見の実用的な意味を、20億枚の画像で事前学習した70億パラメータのAIMを用いて示し、凍結されたトランクでImageNet-1kにおいて84.0%を達成した。興味深いことに、この規模においても性能の飽和の兆候は観察されず、AIMが大規模視覚モデルの訓練における新たなフロンティアを表す可能性を示唆している。AIMの事前学習はLLMの事前学習と類似しており、大規模な訓練を安定化するための画像固有の戦略を必要としない。

English

This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, that achieves 84.0% on ImageNet-1k with a frozen trunk. Interestingly, even at this scale, we observe no sign of saturation in performance, suggesting that AIM potentially represents a new frontier for training large-scale vision models. The pre-training of AIM is similar to the pre-training of LLMs, and does not require any image-specific strategy to stabilize the training at scale.

スケーラブルな大規模自己回帰型画像モデルの事前学習

Scalable Pre-training of Large Autoregressive Image Models

要旨

Support