대규모 자기회귀 이미지 모델의 확장 가능한 사전 학습

초록

본 논문은 자동회귀 목적 함수로 사전 학습된 비전 모델 컬렉션인 AIM을 소개합니다. 이러한 모델들은 대규모 언어 모델(LLMs)에서 영감을 받아 유사한 확장 특성을 보여줍니다. 구체적으로, 우리는 두 가지 주요 발견을 강조합니다: (1) 시각적 특징의 성능은 모델 용량과 데이터 양 모두에 따라 확장되며, (2) 목적 함수의 값은 다운스트림 작업에서의 모델 성능과 상관관계가 있습니다. 우리는 이러한 발견의 실질적 의미를 20억 장의 이미지로 70억 개의 파라미터를 가진 AIM을 사전 학습하여 설명하며, 이 모델은 고정된 트렁크로 ImageNet-1k에서 84.0%의 성능을 달성했습니다. 흥미롭게도, 이 규모에서도 성능 포화의 징후가 관찰되지 않아, AIM이 대규모 비전 모델 학습을 위한 새로운 프론티어를 대표할 가능성이 있음을 시사합니다. AIM의 사전 학습은 LLMs의 사전 학습과 유사하며, 대규모 학습을 안정화하기 위한 이미지 특화 전략이 필요하지 않습니다.

English

This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, that achieves 84.0% on ImageNet-1k with a frozen trunk. Interestingly, even at this scale, we observe no sign of saturation in performance, suggesting that AIM potentially represents a new frontier for training large-scale vision models. The pre-training of AIM is similar to the pre-training of LLMs, and does not require any image-specific strategy to stabilize the training at scale.

대규모 자기회귀 이미지 모델의 확장 가능한 사전 학습

Scalable Pre-training of Large Autoregressive Image Models

초록

Support