ピクセル以前の表現：意味論に導かれた階層的動画予測

要旨

正確な未来の動画予測には、高い視覚的忠実度と一貫したシーン意味論の両方が必要であり、特に自動運転のような複雑で動的な環境ではその重要性が増す。本論文では、予測を意味的表現の予測と表現誘導型の視覚的合成の2段階に分解する階層的動画予測フレームワーク「Re2Pix」を提案する。将来のRGBフレームを直接予測する代わりに、本手法ではまず凍結された視覚基盤モデルの特徴空間において将来のシーン構造を予測し、その後、潜在拡散モデルをこれらの予測された表現に条件付けすることで、写実的なフレームを生成する。この分解により、モデルはまずシーンのダイナミクスに、次に外観生成に集中することが可能となる。重要な課題は、学習時に利用可能な正解表現と、推論時に使用される予測表現との間の訓練-テストミスマッチから生じる。これに対処するため、ネスト化ドロップアウトと混合教師あり学習という2つの条件付け戦略を導入し、不完全な自己回帰的予測に対するロバスト性を向上させる。挑戦的な運転ベンチマークでの実験により、提案する意味論優先の設計が、強力な拡散ベースラインと比較して、時間的意味的一貫性、知覚的品質、学習効率を大幅に改善することを実証する。実装コードはhttps://github.com/Sta8is/Re2Pix で公開している。

English

Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix

ピクセル以前の表現：意味論に導かれた階層的動画予測

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

要旨

Support