ChatPaper.aiChatPaper

从下一帧预测中学习:自回归视频建模编码有效表征

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

December 24, 2025
作者: Jinghan Li, Yang Jin, Hao Jiang, Yadong Mu, Yang Song, Kun Xu
cs.AI

摘要

近期,通用基础模型的预训练技术取得显著进展,大幅提升了各类下游任务的性能。尽管以GPT为代表的自回归生成模型已彻底改变自然语言处理领域,但大多数视觉生成式预训练方法仍依赖于BERT风格的掩码建模,这种方式常常忽略视频分析所必需的时间信息。现有少数自回归视觉预训练方法存在语义定位不准、生成质量差等问题,导致语义理解能力不足。本研究提出NExT-Vid——一种创新的自回归视觉生成预训练框架,通过掩码下一帧预测实现图像与视频的联合建模。该框架引入上下文隔离的自回归预测器,将语义表征与目标解码解耦;同时采用条件流匹配解码器来提升生成质量与多样性。通过上下文隔离的流匹配预训练,我们的方法能学习到强表征能力。在大规模预训练模型上的大量实验表明,通过下游分类任务中的注意力探测评估,本方法在视觉表征学习方面持续优于以往的生成式预训练方法。
English
Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.
PDF71December 26, 2025