将图像生成预训练变换（image-GPT）重新塑造为强大的视觉表征学习器

摘要

本文增强了图像-GPT（iGPT），这是引入自回归预训练以预测视觉表示学习中下一个像素的开创性工作之一。我们进行了两项简单但至关重要的改变。首先，我们将预测目标从原始像素转移到语义标记，实现了对视觉内容的更高级理解。其次，我们通过指导模型预测不仅是下一个标记，还包括可见标记，来补充自回归建模。当语义标记由如CLIP等经过区分性训练的模型编码时，这种流程特别有效。我们将这种新颖方法称为D-iGPT。大量实验证明，D-iGPT在视觉表示学习中表现出色：D-iGPT的一个显著成就是在ImageNet-1K数据集上表现出色——通过在公开可用数据集上训练，D-iGPT使用Vanilla ViT-Large模型实现了89.5%的top-1准确率。该模型还在下游任务上表现出强大的泛化能力，并对分布之外的样本具有鲁棒性。代码可在https://github.com/OliverRensu/D-iGPT{https://github.com/OliverRensu/D-iGPT}获取。

English

This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement of D-iGPT is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\% top-1 accuracy with a vanilla ViT-Large model. This model also shows strong generalization on the downstream task and robustness on out-of-distribution samples. Code is avaiable at https://github.com/OliverRensu/D-iGPT{https://github.com/OliverRensu/D-iGPT}.

将图像生成预训练变换（image-GPT）重新塑造为强大的视觉表征学习器

Rejuvenating image-GPT as Strong Visual Representation Learners

摘要

Support