MV-RAG: Retrieval-augmented Multiview Diffusie

Samenvatting

Text-naar-3D-generatiebenaderingen hebben aanzienlijke vooruitgang geboekt door gebruik te maken van vooraf getrainde 2D-diffusieprioriteiten, wat hoogwaardige en 3D-consistente resultaten oplevert. Ze slagen er echter vaak niet in om out-of-domain (OOD) of zeldzame concepten te produceren, wat resulteert in inconsistente of onnauwkeurige uitkomsten. Daarom stellen we MV-RAG voor, een nieuwe text-naar-3D-pipeline die eerst relevante 2D-afbeeldingen ophaalt uit een grote 2D-database in het wild en vervolgens een multiview-diffusiemodel conditioneert op deze afbeeldingen om consistente en nauwkeurige multiview-uitkomsten te synthetiseren. Het trainen van zo'n retrieval-geconditioneerd model wordt bereikt via een nieuwe hybride strategie die gestructureerde multiview-data en diverse 2D-afbeeldingscollecties overbrugt. Dit omvat training op multiview-data met behulp van geaugmenteerde conditioneringsviews die retrieval-variantie simuleren voor viewspecifieke reconstructie, naast training op sets van opgehaalde real-world 2D-afbeeldingen met behulp van een onderscheidend held-out-view-voorspellingsdoel: het model voorspelt de held-out-view vanuit de andere views om 3D-consistentie af te leiden uit 2D-data. Om een rigoureuze OOD-evaluatie mogelijk te maken, introduceren we een nieuwe verzameling uitdagende OOD-prompts. Experimenten in vergelijking met state-of-the-art text-naar-3D, image-naar-3D en personalisatie-baselines tonen aan dat onze aanpak de 3D-consistentie, fotorealisme en tekstnaleving voor OOD/zeldzame concepten aanzienlijk verbetert, terwijl competitieve prestaties op standaardbenchmarks worden behouden.

English

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

MV-RAG: Retrieval-augmented Multiview Diffusie

MV-RAG: Retrieval Augmented Multiview Diffusion

Samenvatting

Support