픽셀 이전의 표현: 의미론 기반 계층적 비디오 예측

초록

정확한 미래 영상 예측은 높은 시각적 충실도와 일관된 장면 의미론을 모두 요구하며, 특히 자율 주행과 같은 복잡한 동적 환경에서 그러합니다. 본 논문에서는 예측을 의미론적 표현 예측과 표현-유도 시각적 합성의 두 단계로 분해하는 계층적 영상 예측 프레임워크인 Re2Pix를 제안합니다. 미래의 RGB 프레임을 직접 예측하는 대신, 본 접근법은 먼저 고정된 비전 기초 모델의 특징 공간에서 미래 장면 구조를 예측한 다음, 잠재 확산 모델을 이러한 예측된 표현에 조건화하여 사실적인 프레임을 렌더링합니다. 이러한 분해는 모델이 먼저 장면 동역학에, 그 다음에 외관 생성에 집중할 수 있게 합니다. 주요 난제는 훈련 중에 이용 가능한 실제 표현과 추론 시 사용되는 예측된 표현 간의 훈련-테스트 불일치에서 비롯됩니다. 이를 해결하기 위해, 우리는 불완전한 자기회귀 예측에 대한 견고성을 향상시키는 두 가지 조건화 전략인 중첩 드롭아웃과 혼합 감독을 도입합니다. 까다로운 주행 벤치마크에 대한 실험을 통해 제안된 의미론-우선 설계가 강력한 확산 기준선 대비 시간적 의미론적 일관성, 지각적 품질 및 훈련 효율성을 크게 향상시킴을 입증합니다. 구현 코드는 https://github.com/Sta8is/Re2Pix에서 제공됩니다.

English

Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix

픽셀 이전의 표현: 의미론 기반 계층적 비디오 예측

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

초록

Support