MVDiffusion++：単一または疎な視点からの3Dオブジェクト再構成のための高解像度多視点拡散モデル

要旨

本論文では、カメラ姿勢なしに1枚または少数の画像から物体の高密度かつ高解像度な視点を合成する3D物体再構成のためのニューラルアーキテクチャMVDiffusion++を提案する。MVDiffusion++は、驚くほどシンプルな2つのアイデアにより、優れた柔軟性とスケーラビリティを実現している：1）2D潜在特徴間の標準的なセルフアテンションが、カメラ姿勢情報を明示的に使用せずに、任意の数の条件付き視点と生成視点間の3D一貫性を学習する「姿勢フリーアーキテクチャ」、および2）トレーニング中に多数の出力視点を破棄する「視点ドロップアウト戦略」であり、これによりトレーニング時のメモリ使用量を削減し、テスト時に高密度かつ高解像度な視点合成を可能にする。トレーニングにはObjaverseを、評価にはGoogle Scanned Objectsを使用し、標準的な新規視点合成および3D再構成のメトリクスを用いて、MVDiffusion++が現在の最先端技術を大幅に上回ることを示す。また、MVDiffusion++とテキストから画像を生成するモデルを組み合わせたテキストから3Dを生成するアプリケーション例も示す。

English

This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses. MVDiffusion++ achieves superior flexibility and scalability with two surprisingly simple ideas: 1) A ``pose-free architecture'' where standard self-attention among 2D latent features learns 3D consistency across an arbitrary number of conditional and generation views without explicitly using camera pose information; and 2) A ``view dropout strategy'' that discards a substantial number of output views during training, which reduces the training-time memory footprint and enables dense and high-resolution view synthesis at test time. We use the Objaverse for training and the Google Scanned Objects for evaluation with standard novel view synthesis and 3D reconstruction metrics, where MVDiffusion++ significantly outperforms the current state of the arts. We also demonstrate a text-to-3D application example by combining MVDiffusion++ with a text-to-image generative model.

MVDiffusion++：単一または疎な視点からの3Dオブジェクト再構成のための高解像度多視点拡散モデル

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

要旨

Support