강력한 시각 표현 학습자로서의 Image-GPT 재활성화

초록

본 논문은 시각 표현 학습을 위해 다음 픽셀을 예측하는 자기회귀 사전 학습을 도입한 선구적 연구 중 하나인 이미지-GPT(iGPT)를 개선한다. 두 가지 간단하지만 핵심적인 변경 사항을 적용하였다. 첫째, 예측 대상으로 원시 픽셀 대신 의미론적 토큰을 사용하여 시각적 내용에 대한 더 높은 수준의 이해를 가능하게 하였다. 둘째, 다음 토큰뿐만 아니라 가시적 토큰도 예측하도록 모델을 지시함으로써 자기회귀 모델링을 보완하였다. 이러한 파이프라인은 CLIP과 같이 판별적으로 학습된 모델에 의해 의미론적 토큰이 인코딩될 때 특히 효과적이다. 우리는 이러한 새로운 접근법을 D-iGPT로 소개한다. 광범위한 실험을 통해 D-iGPT가 시각 표현의 강력한 학습자로 우수함을 입증하였다: D-iGPT의 주목할 만한 성과는 ImageNet-1K 데이터셋에서의 뛰어난 성능이다. 공개적으로 이용 가능한 데이터셋으로 학습한 D-iGPT는 기본 ViT-Large 모델을 사용하여 89.5%의 top-1 정확도를 달성하였다. 또한 이 모델은 다운스트림 작업에서 강력한 일반화 능력과 분포 외 샘플에 대한 견고성을 보여준다. 코드는 https://github.com/OliverRensu/D-iGPT에서 확인할 수 있다.

English

This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement of D-iGPT is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\% top-1 accuracy with a vanilla ViT-Large model. This model also shows strong generalization on the downstream task and robustness on out-of-distribution samples. Code is avaiable at https://github.com/OliverRensu/D-iGPT{https://github.com/OliverRensu/D-iGPT}.

강력한 시각 표현 학습자로서의 Image-GPT 재활성화

Rejuvenating image-GPT as Strong Visual Representation Learners

초록

Support