像素生成之前:语义引导的层次化视频预测框架
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
April 13, 2026
作者: Efstathios Karypidis, Spyros Gidaris, Nikos Komodakis
cs.AI
摘要
準確的未來影片預測需兼具高視覺真實性與連貫的場景語義,這在自動駕駛等複雜動態環境中尤為關鍵。本文提出Re2Pix——一種分層式影片預測框架,將預測任務解構為兩個階段:語義表徵預測與表徵引導的視覺合成。有別於直接預測RGB影格,我們的方法先於凍結的視覺基礎模型特徵空間中預測未來場景結構,再以潛在擴散模型根據這些預測表徵生成逼真影格。此解耦機制使模型能專注於先處理場景動態,再進行外觀生成。核心挑戰在於訓練時可用的真實表徵與推論時使用的預測表徵存在訓練-測試失配問題。為此,我們引入嵌套棄置與混合監督兩種條件化策略,提升對不完美自回歸預測的魯棒性。在具挑戰性的駕駛基準測試中,實驗表明相較於強勁的擴散模型基線,這種語義優先的設計能顯著提升時間語義一致性、感知質量及訓練效率。實現代碼已開源於https://github.com/Sta8is/Re2Pix。
English
Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix