ViDAR: 単眼入力からのビデオ拡散を考慮した4D再構成

要旨

動的視点合成は、移動する被写体を任意の視点からフォトリアルに生成することを目的としている。このタスクは、モノクロ動画に依存する場合に特に困難であり、構造と動きを分離することが不良設定問題となり、教師信号も不足しがちである。本研究では、パーソナライズド拡散モデルを活用して、ガウススプラッティング表現を訓練するための疑似多視点教師信号を合成する新しい4次元再構成フレームワークであるVideo Diffusion-Aware Reconstruction (ViDAR)を提案する。シーン固有の特徴を条件付けることで、ViDARは細かな外観の詳細を回復しつつ、モノクロの曖昧さによって導入されるアーティファクトを軽減する。拡散ベースの教師信号の時空間的不整合に対処するために、合成視点と基盤となるシーン幾何学を整合させる拡散対応損失関数とカメラポーズ最適化戦略を提案する。極端な視点変化を含む難易度の高いベンチマークDyCheckでの実験により、ViDARが視覚品質と幾何学的整合性において全ての最先端ベースラインを上回ることを示す。さらに、動的領域におけるベースラインに対するViDARの大幅な改善を強調し、シーンの動きの多い部分の再構成性能を比較するための新しいベンチマークを提供する。プロジェクトページ: https://vidar-4d.github.io

English

Dynamic Novel View Synthesis aims to generate photorealistic views of moving subjects from arbitrary viewpoints. This task is particularly challenging when relying on monocular video, where disentangling structure from motion is ill-posed and supervision is scarce. We introduce Video Diffusion-Aware Reconstruction (ViDAR), a novel 4D reconstruction framework that leverages personalised diffusion models to synthesise a pseudo multi-view supervision signal for training a Gaussian splatting representation. By conditioning on scene-specific features, ViDAR recovers fine-grained appearance details while mitigating artefacts introduced by monocular ambiguity. To address the spatio-temporal inconsistency of diffusion-based supervision, we propose a diffusion-aware loss function and a camera pose optimisation strategy that aligns synthetic views with the underlying scene geometry. Experiments on DyCheck, a challenging benchmark with extreme viewpoint variation, show that ViDAR outperforms all state-of-the-art baselines in visual quality and geometric consistency. We further highlight ViDAR's strong improvement over baselines on dynamic regions and provide a new benchmark to compare performance in reconstructing motion-rich parts of the scene. Project page: https://vidar-4d.github.io

ViDAR: 単眼入力からのビデオ拡散を考慮した4D再構成

ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs

要旨

Support