MV-RAG:檢索增強型多視角擴散模型
MV-RAG: Retrieval Augmented Multiview Diffusion
August 22, 2025
作者: Yosef Dayani, Omer Benishu, Sagie Benaim
cs.AI
摘要
文本至三維生成技術通過利用預訓練的二維擴散先驗取得了顯著進展,能夠產出高質量且三維一致的結果。然而,這些方法在處理域外(OOD)或罕見概念時往往表現不佳,導致生成結果不一致或不准確。為此,我們提出了MV-RAG,一種新穎的文本至三維生成流程,該流程首先從一個大型的野外二維圖像數據庫中檢索相關的二維圖像,然後基於這些圖像條件化多視角擴散模型,以合成一致且準確的多視角輸出。訓練這種基於檢索的條件模型是通過一種新穎的混合策略實現的,該策略橋接了結構化的多視角數據和多樣化的二維圖像集合。這包括使用模擬檢索變化的增強條件視圖對多視角數據進行訓練,以實現視圖特定的重建,同時使用一組檢索到的真實世界二維圖像進行訓練,並採用獨特的保留視圖預測目標:模型從其他視圖預測保留視圖,從而從二維數據推斷三維一致性。為了促進嚴格的域外評估,我們引入了一組具有挑戰性的域外提示。與最先進的文本至三維、圖像至三維以及個性化基線的實驗對比表明,我們的方法在域外/罕見概念上顯著提高了三維一致性、照片真實感和文本依從性,同時在標準基準測試中保持了競爭力。
English
Text-to-3D generation approaches have advanced significantly by leveraging
pretrained 2D diffusion priors, producing high-quality and 3D-consistent
outputs. However, they often fail to produce out-of-domain (OOD) or rare
concepts, yielding inconsistent or inaccurate results. To this end, we propose
MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images
from a large in-the-wild 2D database and then conditions a multiview diffusion
model on these images to synthesize consistent and accurate multiview outputs.
Training such a retrieval-conditioned model is achieved via a novel hybrid
strategy bridging structured multiview data and diverse 2D image collections.
This involves training on multiview data using augmented conditioning views
that simulate retrieval variance for view-specific reconstruction, alongside
training on sets of retrieved real-world 2D images using a distinctive held-out
view prediction objective: the model predicts the held-out view from the other
views to infer 3D consistency from 2D data. To facilitate a rigorous OOD
evaluation, we introduce a new collection of challenging OOD prompts.
Experiments against state-of-the-art text-to-3D, image-to-3D, and
personalization baselines show that our approach significantly improves 3D
consistency, photorealism, and text adherence for OOD/rare concepts, while
maintaining competitive performance on standard benchmarks.