ChatPaper.aiChatPaper

InternVideo-Next:迈向无需视频文本监督的通用视频基础模型

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

December 1, 2025
作者: Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan, Yali Wang, Yi Wang, Limin Wang
cs.AI

摘要

大規模影片-文本預訓練雖能實現強勁性能,但其依賴於噪聲多、語義覆蓋有限的合成字幕,往往忽略物體運動、三維幾何和物理線索等隱性世界知識。相比之下,掩碼影片建模(MVM)能直接利用時空結構,但在通用任務上落後於文本監督方法。我們發現此差距源於被忽視的架構問題:像素級重構存在收斂困難,其低層級需求常與語義特徵衝突,而潛在表徵預測易引發捷徑學習。為此,我們將傳統編碼器-解碼器架構解耦為編碼器-預測器-解碼器(EPD)框架,其中預測器充當潛在世界模型,並提出InternVideo-Next——一種兩階段預訓練方案,為該世界模型構建語義一致且保留細節的潛在空間。首先,像素級MVM中的傳統線性解碼器強制預測器輸出的潛在表徵需線性投影至像素空間,導致其與語義抽象產生衝突。我們的第一階段提出條件擴散解碼器,注入可靠的圖像級語義先驗以增強語義性與收斂性,從而銜接像素級保真度與高層語義抽象。第二階段在此空間內預測凍結的第一階段目標,進一步學習世界知識,緩解捷徑學習問題。基於公開未標註影片訓練的InternVideo-Next在多個基準測試中達到最先進水平,為通用影片表徵學習提供了可擴展路徑。
English
Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.
PDF140December 3, 2025