高忠実度な新視点合成のためのスプラッティング誘導型拡散モデル

要旨

新視点合成（NVS）における最近の進展にもかかわらず、単一または疎な観測から高忠実度の視点を生成することは依然として大きな課題です。既存のスプラッティングベースのアプローチでは、スプラッティングエラーによる歪んだ幾何学が生成されることがよくあります。一方、拡散ベースの手法は豊富な3D事前情報を活用して改善された幾何学を実現しますが、テクスチャの幻覚化に悩まされることが多いです。本論文では、単一画像から高忠実度の新視点を合成するために設計された、ピクセルスプラッティングガイド付きビデオ拡散モデルであるSplatDiffを紹介します。具体的には、ターゲット視点の正確な制御と幾何学的一貫性のある視点合成のための整列合成戦略を提案します。テクスチャの幻覚化を軽減するために、適応的特徴融合を通じて高忠実度のテクスチャ生成を可能にするテクスチャブリッジモジュールを設計します。このようにして、SplatDiffはスプラッティングと拡散の強みを活用し、一貫した幾何学と高忠実度のディテールを持つ新視点を生成します。広範な実験により、SplatDiffの単一視点NVSにおける最先端の性能が検証されました。さらに、追加のトレーニングなしで、SplatDiffは疎視点NVSやステレオビデオ変換を含む多様なタスクにおいて顕著なゼロショット性能を示します。

English

Despite recent advances in Novel View Synthesis (NVS), generating high-fidelity views from single or sparse observations remains a significant challenge. Existing splatting-based approaches often produce distorted geometry due to splatting errors. While diffusion-based methods leverage rich 3D priors to achieve improved geometry, they often suffer from texture hallucination. In this paper, we introduce SplatDiff, a pixel-splatting-guided video diffusion model designed to synthesize high-fidelity novel views from a single image. Specifically, we propose an aligned synthesis strategy for precise control of target viewpoints and geometry-consistent view synthesis. To mitigate texture hallucination, we design a texture bridge module that enables high-fidelity texture generation through adaptive feature fusion. In this manner, SplatDiff leverages the strengths of splatting and diffusion to generate novel views with consistent geometry and high-fidelity details. Extensive experiments verify the state-of-the-art performance of SplatDiff in single-view NVS. Additionally, without extra training, SplatDiff shows remarkable zero-shot performance across diverse tasks, including sparse-view NVS and stereo video conversion.

高忠実度な新視点合成のためのスプラッティング誘導型拡散モデル

High-Fidelity Novel View Synthesis via Splatting-Guided Diffusion

要旨

Support