將影像-GPT 重新打造為強大的視覺表徵學習者

摘要

本文增強了圖像-GPT（iGPT），這是一項開創性的工作，引入自回歸預訓練以預測視覺表示學習中下一個像素。我們做了兩個簡單但重要的改變。首先，我們將預測目標從原始像素轉移到語義標記，從而實現對視覺內容的更高層次理解。其次，我們通過指導模型預測不僅是下一個標記，還包括可見標記，來補充自回歸建模。當語義標記由經過區分性訓練的模型（如CLIP）編碼時，這種流程尤其有效。我們將這種新方法稱為D-iGPT。大量實驗展示了D-iGPT作為視覺表示學習的強大學習者：D-iGPT的一個顯著成就是其在ImageNet-1K數據集上的引人注目表現——通過在公開可用數據集上訓練，D-iGPT在使用普通的ViT-Large模型時達到了89.5\%的top-1準確率。該模型還在下游任務上表現出良好的泛化能力，對分布外樣本具有強健性。代碼可在https://github.com/OliverRensu/D-iGPT{https://github.com/OliverRensu/D-iGPT}找到。

English

This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement of D-iGPT is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\% top-1 accuracy with a vanilla ViT-Large model. This model also shows strong generalization on the downstream task and robustness on out-of-distribution samples. Code is avaiable at https://github.com/OliverRensu/D-iGPT{https://github.com/OliverRensu/D-iGPT}.

將影像-GPT 重新打造為強大的視覺表徵學習者

Rejuvenating image-GPT as Strong Visual Representation Learners

摘要

Support