CRM: 畳み込み再構成モデルによる単一画像からの3Dテクスチャメッシュ生成

要旨

Large Reconstruction Model（LRM）のようなフィードフォワード型3D生成モデルは、優れた生成速度を実証しています。しかし、トランスフォーマーベースの手法は、そのアーキテクチャにおいてトライプレーンコンポーネントの幾何学的な事前情報を活用しておらず、3Dデータの限られたサイズと遅い学習速度のため、しばしば最適ではない品質をもたらします。本研究では、高忠実度のフィードフォワード型単一画像から3Dを生成するConvolutional Reconstruction Model（CRM）を提案します。疎な3Dデータがもたらす制約を認識し、ネットワーク設計に幾何学的な事前情報を統合する必要性を強調します。CRMは、トライプレーンの可視化が6つの正射投影画像の空間的対応を示すという重要な観察に基づいています。まず、単一の入力画像から6つの正射投影画像を生成し、これらの画像を畳み込みU-Netに供給することで、その強力なピクセルレベルでのアライメント能力と大きな帯域幅を活用し、高解像度のトライプレーンを作成します。CRMはさらに、Flexicubesを幾何学的表現として採用し、テクスチャ付きメッシュ上での直接的なエンドツーエンド最適化を容易にします。全体として、我々のモデルは、テスト時の最適化を一切行わずに、画像から高忠実度のテクスチャ付きメッシュをわずか10秒で生成します。

English

Feed-forward 3D generative models like the Large Reconstruction Model (LRM) have demonstrated exceptional generation speed. However, the transformer-based methods do not leverage the geometric priors of the triplane component in their architecture, often leading to sub-optimal quality given the limited size of 3D data and slow training. In this work, we present the Convolutional Reconstruction Model (CRM), a high-fidelity feed-forward single image-to-3D generative model. Recognizing the limitations posed by sparse 3D data, we highlight the necessity of integrating geometric priors into network design. CRM builds on the key observation that the visualization of triplane exhibits spatial correspondence of six orthographic images. First, it generates six orthographic view images from a single input image, then feeds these images into a convolutional U-Net, leveraging its strong pixel-level alignment capabilities and significant bandwidth to create a high-resolution triplane. CRM further employs Flexicubes as geometric representation, facilitating direct end-to-end optimization on textured meshes. Overall, our model delivers a high-fidelity textured mesh from an image in just 10 seconds, without any test-time optimization.

CRM: 畳み込み再構成モデルによる単一画像からの3Dテクスチャメッシュ生成

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

要旨

Support