FlexWorld: 柔軟な視点合成のための段階的に拡張する3Dシーン

要旨

単一画像から360度回転やズームを含む柔軟な視点の3Dシーンを生成することは、3Dデータの不足により困難です。この課題に対処するため、我々はFlexWorldという新しいフレームワークを提案します。FlexWorldは2つの主要なコンポーネントで構成されています：(1)粗いシーンからレンダリングされた不完全な入力から高品質な新規視点画像を生成する強力なビデオ間（V2V）拡散モデル、(2)完全な3Dシーンを構築するための漸進的拡張プロセスです。特に、高度に事前学習されたビデオモデルと正確な深度推定トレーニングペアを活用することで、我々のV2Vモデルは大きなカメラポーズの変化下でも新規視点を生成できます。これを基盤として、FlexWorldは新しい3Dコンテンツを漸進的に生成し、ジオメトリを考慮したシーン融合を通じてグローバルシーンに統合します。大規模な実験により、FlexWorldが単一画像から高品質な新規視点ビデオと柔軟な視点の3Dシーンを生成する有効性が実証され、複数の人気のあるメトリクスとデータセットにおいて既存の最先端手法を上回る視覚品質を達成しました。定性的には、FlexWorldが360度回転やズームのような柔軟な視点を持つ高忠実度シーンを生成できることを強調します。プロジェクトページ: https://ml-gsai.github.io/FlexWorld。

English

Generating flexible-view 3D scenes, including 360{\deg} rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360{\deg} rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.