InternVideo-Next:迈向无需视频文本监督的通用视频基础模型
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
December 1, 2025
作者: Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan, Yali Wang, Yi Wang, Limin Wang
cs.AI
摘要
大规模视频-文本预训练虽能实现强劲性能,但其依赖的合成标注存在噪声且语义覆盖有限,常常忽略物体运动、三维几何和物理线索等隐含的世界知识。相比之下,掩码视频建模(MVM)能直接利用时空结构,但在通用任务上仍落后于文本监督方法。我们发现这一差距源于被忽视的架构问题:像素级重建存在收敛困难,其低层次要求常与语义特征冲突,而潜在特征预测易引发捷径学习。为此,我们将传统编码器-解码器架构解耦为编码器-预测器-解码器(EPD)框架,其中预测器充当潜在世界模型,并提出InternVideo-Next——一种两阶段预训练方案,为该世界模型构建语义一致且保留细节的潜在空间。首先,像素级MVM中传统的线性解码器强制预测器输出的潜在特征需线性映射至像素空间,导致其与语义抽象产生冲突。我们的第一阶段提出条件扩散解码器,并注入可靠的图像级语义先验以增强语义理解与收敛性,从而弥合像素级保真度与高层语义抽象间的鸿沟。第二阶段通过在此空间内预测已冻结的第一阶段目标,进一步学习世界知识,有效抑制捷径学习。基于公开无标注视频训练的InternVideo-Next在多项基准测试中达到最先进水平,为通用视频表征学习提供了可扩展路径。
English
Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.