기하학적 파운데이션 모델의 다중 뷰 확산 적용 재구성

초록

최근 생성형 잠재 공간의 발전으로 단일 이미지 생성 분야에서 상당한 진전이 이루어졌지만, 새로운 시점 합성(NVS)에 최적화된 잠재 공간은 여전히 크게 탐구되지 않았다. 특히 NVS는 시점 간 기하학적 일관성을 요구하지만, 기존 접근법들은 일반적으로 시점 독립적인 VAE 잠재 공간에서 작동한다. 본 논문에서는 기하학적 기초 모델의 기하학적 일관성 특징 공간을 다중 시점 확산 모델의 잠재 공간으로 재활용하는 프레임워크인 GLD(Geometric Latent Diffusion)를 제안한다. 해당 특징들이 높은 정밀도의 RGB 재구성을 지원할 뿐만 아니라 강력한 시점 간 기하학적 대응 관계를 인코딩함으로써 NVS에 적합한 잠재 공간을 제공함을 보여준다. 실험 결과, GLD는 2D 이미지 품질 및 3D 일관성 메트릭에서 VAE와 RAE를 모두 능가하며, VAE 잠재 공간 대비 4.4배 이상 학습 속도를 향상시킨다. 특히 GLD는 대규모 텍스트-이미지 사전 학습을 활용하는 최신 방법들과 비교했을 때, 해당 생성적 사전 학습 없이 확산 모델을 처음부터 학습함에도 불구하고 경쟁력을 유지한다.

English

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

기하학적 파운데이션 모델의 다중 뷰 확산 적용 재구성

Repurposing Geometric Foundation Models for Multi-view Diffusion

초록

Support