画像-GPTを強力な視覚表現学習者として再生する

要旨

本論文は、視覚表現学習のために次のピクセルを予測する自己回帰型事前学習を導入した先駆的な研究であるimage-GPT（iGPT）を強化するものである。2つのシンプルだが重要な変更を加えた。第一に、予測対象を生のピクセルから意味トークンにシフトし、視覚コンテンツのより高レベルの理解を可能にした。第二に、モデルに次のトークンだけでなく可視トークンも予測するよう指示することで、自己回帰モデリングを補完した。このパイプラインは、CLIPなどの識別的に訓練されたモデルによって意味トークンがエンコードされる場合に特に有効である。この新しいアプローチをD-iGPTとして紹介する。大規模な実験により、D-iGPTが視覚表現の強力な学習者として優れていることが示されている：D-iGPTの顕著な成果の一つは、ImageNet-1Kデータセットでの説得力のあるパフォーマンスである——公開されているデータセットで訓練することで、D-iGPTは標準的なViT-Largeモデルで89.5%のトップ1精度を達成した。このモデルは、下流タスクでの強い一般化能力と、分布外サンプルに対するロバスト性も示している。コードはhttps://github.com/OliverRensu/D-iGPT{https://github.com/OliverRensu/D-iGPT}で公開されている。

English

This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement of D-iGPT is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\% top-1 accuracy with a vanilla ViT-Large model. This model also shows strong generalization on the downstream task and robustness on out-of-distribution samples. Code is avaiable at https://github.com/OliverRensu/D-iGPT{https://github.com/OliverRensu/D-iGPT}.

画像-GPTを強力な視覚表現学習者として再生する

Rejuvenating image-GPT as Strong Visual Representation Learners

要旨

Support