SpaRP：快速从稀疏视图中重建三维物体并估计姿态

摘要

近来，开放式三维生成引起了广泛关注。虽然许多单图像到三维的方法产生了视觉上令人满意的结果，但它们通常缺乏足够的可控性，往往会产生与用户期望不符的虚构区域。本文探讨了一个重要场景，即输入由一个或几个未摆姿的单个物体的二维图像组成，几乎没有重叠。我们提出了一种新颖的方法，即SpaRP，用于重建三维纹理网格并估计这些稀疏视图图像的相对摄像机姿势。SpaRP从二维扩散模型中提炼知识，并对其进行微调，以隐式推断稀疏视图之间的三维空间关系。扩散模型经过训练，共同预测摄像机姿势的替代表示以及在已知姿势下物体的多视图图像，整合了来自输入稀疏视图的所有信息。然后利用这些预测来完成三维重建和姿势估计，重建的三维模型可用于进一步优化输入视图的摄像机姿势。通过对三个数据集进行大量实验，我们证明了我们的方法不仅在三维重建质量和姿势预测准确性方面明显优于基线方法，而且表现出很强的效率。它仅需要约20秒即可为输入视图生成纹理网格和摄像机姿势。项目页面：https://chaoxu.xyz/sparp。

English

Open-world 3D generation has recently attracted considerable attention. While many single-image-to-3D methods have yielded visually appealing outcomes, they often lack sufficient controllability and tend to produce hallucinated regions that may not align with users' expectations. In this paper, we explore an important scenario in which the input consists of one or a few unposed 2D images of a single object, with little or no overlap. We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for these sparse-view images. SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views. The diffusion model is trained to jointly predict surrogate representations for camera poses and multi-view images of the object under known poses, integrating all information from the input sparse views. These predictions are then leveraged to accomplish 3D reconstruction and pose estimation, and the reconstructed 3D model can be used to further refine the camera poses of input views. Through extensive experiments on three datasets, we demonstrate that our method not only significantly outperforms baseline methods in terms of 3D reconstruction quality and pose prediction accuracy but also exhibits strong efficiency. It requires only about 20 seconds to produce a textured mesh and camera poses for the input views. Project page: https://chaoxu.xyz/sparp.

SpaRP：快速从稀疏视图中重建三维物体并估计姿态

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

摘要

Support