V3D：ビデオ拡散モデルは効果的な3D生成器である

要旨

自動3D生成は最近広く注目を集めている。最近の手法は生成速度を大幅に向上させたが、モデルの容量や3Dデータの制限により、通常は詳細に欠けるオブジェクトを生成する。ビデオ拡散モデルの最近の進展に触発され、我々はV3Dを導入し、事前学習済みのビデオ拡散モデルの世界シミュレーション能力を活用して3D生成を促進する。ビデオ拡散が3D世界を認識する潜在能力を最大限に引き出すために、幾何学的整合性の事前知識を導入し、ビデオ拡散モデルをマルチビュー整合性のある3D生成器に拡張する。これにより、最先端のビデオ拡散モデルを微調整して、単一の画像から物体を囲む360度軌道フレームを生成することが可能となる。我々の特化した再構築パイプラインにより、3分以内に高品質のメッシュまたは3Dガウシアンを生成できる。さらに、本手法はシーンレベルの新規視点合成に拡張可能であり、疎な入力ビューでカメラパスを精密に制御することを実現する。広範な実験により、提案手法の優れた性能、特に生成品質とマルチビュー整合性の点で、その優位性が実証された。我々のコードはhttps://github.com/heheyas/V3Dで公開されている。

English

Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency. Our code is available at https://github.com/heheyas/V3D

V3D：ビデオ拡散モデルは効果的な3D生成器である

V3D: Video Diffusion Models are Effective 3D Generators

要旨

Support