Flex3D: フィードフォワード3D生成と柔軟な再構築モデルおよび入力ビューのキュレーション

要旨

テキスト、単一画像、またはスパースビュー画像から高品質な3Dコンテンツを生成することは、幅広い応用があるが、依然として困難な課題である。既存の手法は、通常、マルチビュー拡散モデルを用いてマルチビュー画像を合成し、その後に3D再構築のためのフィードフォワードプロセスを採用している。しかし、これらのアプローチは、しばしば少数かつ固定された入力ビューの制約により、多様な視点を捉える能力が制限され、さらには、合成されたビューが低品質である場合には、最適でない生成結果につながることがある。これらの制約に対処するために、我々はFlex3Dを提案する。これは、任意の数の高品質な入力ビューを活用できる革新的な2段階フレームワークである。第1段階は、候補ビュー生成およびキュレーションパイプラインで構成されている。微調整されたマルチビュー画像拡散モデルとビデオ拡散モデルを用いて候補ビューのプールを生成し、対象の3Dオブジェクトの豊富な表現を可能にする。その後、ビュー選択パイプラインがこれらのビューを品質と一貫性に基づいてフィルタリングし、再構築に使用されるのは高品質かつ信頼性のあるビューのみとなるようにする。第2段階では、キュレーションされたビューが柔軟な再構築モデル（FlexRM）に供給される。このモデルは、任意の数の入力を効果的に処理できるトランスフォーマーアーキテクチャに基づいて構築されており、三平面表現を活用して3Dガウス点を直接出力することができる。設計とトレーニング戦略の幅広い探索を通じて、FlexRMを最適化し、再構築および生成タスクの両方で優れたパフォーマンスを実現する。我々の結果は、Flex3Dが最新のフィードフォワード3D生成モデルと比較した際に、3D生成タスクにおいて92%以上のユーザースタディ勝率を達成し、最先端のパフォーマンスを適用していることを示している。

English

Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications.Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality. To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object. Subsequently, a view selection pipeline filters these views based on quality and consistency, ensuring that only the high-quality and reliable views are used for reconstruction. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. FlemRM directly outputs 3D Gaussian points leveraging a tri-plane representation, enabling efficient and detailed 3D generation. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.

Flex3D: フィードフォワード3D生成と柔軟な再構築モデルおよび入力ビューのキュレーション

Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

要旨

Support