下一代嵌入预测铸就卓越视觉学习模型
Next-Embedding Prediction Makes Strong Vision Learners
December 18, 2025
作者: Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu
cs.AI
摘要
受生成式预训练在自然语言领域成功的启发,我们探究相同原理能否催生强大的自监督视觉学习器。不同于训练模型输出下游任务特征,我们直接训练模型生成嵌入以执行预测任务。本研究探索了这种从学习表征到学习模型的转变——具体而言,模型通过因果掩码和梯度截断技术,学习基于历史块嵌入预测未来嵌入,我们将其称为"下一嵌入预测自回归(NEPA)"。实验表明,仅以下一嵌入预测为学习目标、在ImageNet-1k上预训练的简单Transformer即可取得显著效果,无需像素重建、离散标记、对比损失或任务特定头。该方案保持了架构简洁性与可扩展性,无需引入额外设计复杂度。NEPA在多项任务中表现优异:ViT-B和ViT-L骨干网络经微调后在ImageNet-1K上分别达到83.8%和85.3%的top-1准确率,并在ADE20K语义分割任务上实现有效迁移。我们相信基于嵌入的生成式预训练为视觉自监督学习提供了一条简洁、可扩展且可能模态无关的新路径。
English
Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.