使用受控多视角编辑的通用3D扩散适配器

摘要

由于数据有限和计算复杂性较高，开放领域的3D物体合成落后于图像合成。为了弥补这一差距，最近的研究作品探讨了多视角扩散，但往往在3D一致性、视觉质量或效率方面存在不足。本文提出了MVEdit，作为SDEdit的3D对应物，采用祖先采样来联合去噪多视角图像并输出高质量纹理网格。基于现成的2D扩散模型，MVEdit通过无需训练的3D适配器实现了3D一致性，该适配器将最后一个时间步的2D视图提升为连贯的3D表示，然后使用渲染视图来调整下一个时间步的2D视图，同时不影响视觉质量。在仅需2-5分钟的推断时间内，该框架在质量和速度之间实现了比分数蒸馏更好的权衡。MVEdit非常灵活和可扩展，具有广泛的应用，包括文本/图像到3D生成、3D到3D编辑和高质量纹理合成。特别是，评估表明在图像到3D和文本引导纹理生成任务中表现出最先进的性能。此外，我们介绍了一种方法，可以在资源有限的情况下对小型3D数据集上的2D潜在扩散模型进行微调，从而实现快速低分辨率文本到3D的初始化。

English

Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.

使用受控多视角编辑的通用3D扩散适配器

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

摘要

Support