ShapeLLM-Omni：原生多模态大语言模型，专攻3D生成与理解

摘要

近期，ChatGPT-4o强大的文本生成图像能力，使得原生多模态大语言模型备受瞩目。然而，其多模态能力仍局限于图像与文本领域。实际上，在图像之外，理解与生成三维内容的能力同样至关重要。为填补这一空白，我们提出了ShapeLLM-Omni——一款原生三维大语言模型，能够理解和生成任意序列的三维资产与文本。首先，我们训练了一个三维向量量化变分自编码器（VQVAE），将三维对象映射至离散潜在空间，以实现高效且准确的形状表示与重建。基于这些三维感知的离散标记，我们创新性地构建了一个名为3D-Alpaca的大规模连续训练数据集，涵盖生成、理解与编辑任务，为未来研究与训练提供了丰富的资源。最后，通过在3D-Alpaca数据集上对Qwen-2.5-vl-7B-Instruct模型进行指令微调，我们的工作为扩展具备基础三维能力的多模态模型提供了有效尝试，为未来三维原生AI的研究做出了贡献。项目页面：https://github.com/JAMESYJL/ShapeLLM-Omni。

English

Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni

ShapeLLM-Omni：原生多模态大语言模型，专攻3D生成与理解

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

摘要

Support