复用几何基础模型实现多视角扩散生成
Repurposing Geometric Foundation Models for Multi-view Diffusion
March 23, 2026
作者: Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu
cs.AI
摘要
尽管生成式隐空间的最新进展已推动单张图像生成领域取得显著进步,但适用于新颖视角合成(NVS)的最优隐空间仍属空白。尤其值得注意的是,NVS要求跨视角的几何一致性生成,而现有方法通常基于视角无关的VAE隐空间。本文提出几何隐扩散(GLD)框架,通过重构几何基础模型中具有几何一致性的特征空间,将其作为多视角扩散的隐空间。我们证明这些特征不仅支持高保真RGB重建,还编码了强视角间几何对应关系,为NVS提供了适配度极高的隐空间。实验表明,GLD在二维图像质量与三维一致性指标上均优于VAE和RAE,且相较VAE隐空间加速训练超4.4倍。值得注意的是,尽管GLD的扩散模型完全从头训练而未借助大规模文生图预训练,其性能仍可与采用此类生成式预训练的先进方法相媲美。
English
While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.