ChatPaper.aiChatPaper

在容量和可扩展性方面推动自回归模型用于3D形状生成

Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

February 19, 2024
作者: Xuelin Qian, Yu Wang, Simian Luo, Yinda Zhang, Ying Tai, Zhenyu Zhang, Chengjie Wang, Xiangyang Xue, Bo Zhao, Tiejun Huang, Yunsheng Wu, Yanwei Fu
cs.AI

摘要

自回归模型通过在网格空间建模联合分布,在2D图像生成方面取得了令人印象深刻的成果。本文将自回归模型扩展到3D领域,并通过同时提高容量和可扩展性来寻求更强大的3D形状生成能力。首先,我们利用一组公开可用的3D数据集来促进大规模模型的训练。该数据集包含大约 900,000 个对象的全面收集,具有网格、点、体素、渲染图像和文本标题的多种属性。这个多样化的标记数据集,被称为 Objaverse-Mix,使我们的模型能够从各种对象变化中学习。然而,直接应用3D自回归在体积网格上遇到了计算需求高和沿网格维度模糊的自回归顺序等关键挑战,导致3D形状质量较低。因此,我们提出了一个名为 Argus3D 的新框架来提高容量。具体而言,我们的方法引入了基于潜在向量的离散表示学习,而不是基于体积网格,这不仅降低了计算成本,还通过以更易处理的顺序学习联合分布来保留基本几何细节。条件生成的容量可以通过简单地将各种条件输入连接到潜在向量上来实现,例如点云、类别、图像和文本。此外,由于我们模型架构的简单性,我们自然地将我们的方法扩展到一个具有惊人 36 亿参数的更大模型,进一步提高了多功能 3D 生成的质量。对四个生成任务的大量实验表明,Argus3D 能够在多个类别中合成多样且忠实的形状,取得了显著的性能。
English
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.
PDF91December 15, 2024