像素之前：语义引导的层次化视频预测

摘要

精准的未来视频预测需兼顾高视觉保真度与连贯的场景语义，这在自动驾驶等复杂动态环境中尤为关键。我们提出Re2Pix——一种分层视频预测框架，将预测任务分解为两个阶段：语义表征预测与表征引导的视觉合成。该方法不直接预测RGB帧，而是先在冻结视觉基础模型的特征空间中预测未来场景结构，再以这些预测表征为条件驱动隐空间扩散模型生成逼真帧序列。这种分解使模型能分别聚焦于场景动态学习和外观生成。核心挑战在于训练时可用真实表征与推理时预测表征之间的失配问题。为此，我们引入嵌套丢弃与混合监督两种条件策略，提升模型对不完美自回归预测的鲁棒性。在挑战性驾驶基准测试上的实验表明，相较于强扩散基线，这种语义优先的设计能显著提升时序语义一致性、感知质量及训练效率。实现代码已发布于https://github.com/Sta8is/Re2Pix。

English

Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix