几何基础模型在多视角扩散中的再利用

摘要

尽管生成式隐空间的最新进展已推动单图像生成领域取得显著进步，但适用于新视角合成任务的最优隐空间仍属探索不足的领域。尤其值得注意的是，新视角合成要求在不同视角间保持几何一致的生成效果，而现有方法通常基于视角无关的VAE隐空间进行运算。本文提出几何隐扩散模型，该框架将几何基础模型中具有几何一致性的特征空间重新定位为多视角扩散的隐空间。我们证明这些特征不仅支持高保真度的RGB重建，还能编码强视角间几何对应关系，从而为新视角合成提供适配度极高的隐空间。实验表明，GLD在二维图像质量与三维一致性指标上均优于VAE和RAE，同时相较VAE隐空间将训练速度提升超4.4倍。值得关注的是，尽管GLD的扩散模型完全从头开始训练而未借助大规模文生图预训练，其性能仍可与利用此类生成式预训练的先进方法相媲美。

English

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

几何基础模型在多视角扩散中的再利用

Repurposing Geometric Foundation Models for Multi-view Diffusion

摘要

Support