ChatPaper.aiChatPaper

從下一幀預測中學習:自回歸影片建模編碼有效表徵

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

December 24, 2025
作者: Jinghan Li, Yang Jin, Hao Jiang, Yadong Mu, Yang Song, Kun Xu
cs.AI

摘要

近期通用基礎模型的預訓練進展顯著提升了各類下游任務的性能。儘管如GPT這類自迴歸生成模型已徹底改變自然語言處理領域,但大多數視覺生成式預訓練方法仍依賴於BERT風格的掩碼建模,這種方法往往忽視了影片分析所需的時序信息。現有少數自迴歸視覺預訓練方法存在語義定位不準確和生成質量差等問題,導致語義表現不佳。本研究提出NExT-Vid——一種新穎的自迴歸視覺生成預訓練框架,通過掩碼下一幀預測聯合建模圖像與影片。NExT-Vid引入上下文隔離的自迴歸預測器來解耦語義表徵與目標解碼,並採用條件化流匹配解碼器來提升生成質量與多樣性。通過上下文隔離的流匹配預訓練,我們的方法能獲得強健的表徵能力。在大規模預訓練模型上的大量實驗表明,通過下游分類任務的注意力探測評估,我們提出的方法在視覺表徵學習方面持續優於先前的生成式預訓練方法。
English
Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.
PDF71December 26, 2025