下一代嵌入預測打造強健視覺學習系統
Next-Embedding Prediction Makes Strong Vision Learners
December 18, 2025
作者: Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu
cs.AI
摘要
受到生成式預訓練在自然語言領域成功的啟發,我們探討相同原理能否培育出強大的自監督視覺學習器。有別於訓練模型輸出特徵供下游任務使用,我們訓練模型直接生成能執行預測任務的嵌入表徵。本研究探索這種從學習表徵到學習模型的轉變——具體而言,模型透過因果遮罩與梯度截斷技術(我們稱之為「下一嵌入預測自回歸」NEPA),學習根據過往圖塊嵌入來預測未來嵌入。我們證實,僅以下一嵌入預測為單一學習目標、在ImageNet-1k上預訓練的簡單Transformer模型即可實現卓越效果,無需像素重建、離散符號、對比損失或任務特定頭部。此方法在保持架構簡潔性與擴展性的同時,無需引入額外設計複雜度。NEPA在各項任務中表現優異:經微調後,採用ViT-B與ViT-L骨幹的模型在ImageNet-1K上分別達到83.8%與85.3%的top-1準確率,並能有效遷移至ADE20K的語義分割任務。我們認為基於嵌入的生成式預訓練,為視覺自監督學習提供了一種簡潔、可擴展且潛在跨模態通用的替代方案。
English
Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.