ShapeLLM-Omni: 3D 생성 및 이해를 위한 네이티브 멀티모달 LLM

초록

최근 ChatGPT-4o의 강력한 텍스트-이미지 변환 능력으로 인해 네이티브 멀티모달 대형 언어 모델에 대한 관심이 높아지고 있습니다. 그러나 현재의 멀티모달 능력은 여전히 이미지와 텍스트에 국한되어 있습니다. 이미지를 넘어서, 3D 콘텐츠를 이해하고 생성하는 능력 역시 매우 중요합니다. 이러한 격차를 해결하기 위해, 우리는 ShapeLLM-Omni를 제안합니다. 이는 3D 자산과 텍스트를 임의의 순서로 이해하고 생성할 수 있는 네이티브 3D 대형 언어 모델입니다. 먼저, 우리는 3D 벡터 양자화 변이형 오토인코더(VQVAE)를 훈련시켜 3D 객체를 이산 잠재 공간으로 매핑함으로써 효율적이고 정확한 형태 표현과 재구성을 달성합니다. 3D 인식 이산 토큰을 기반으로, 우리는 혁신적으로 3D-Alpaca라는 대규모 연속 훈련 데이터셋을 구축했습니다. 이 데이터셋은 생성, 이해, 편집을 포함하며, 향후 연구와 훈련을 위한 풍부한 자원을 제공합니다. 마지막으로, 3D-Alpaca 데이터셋에서 Qwen-2.5-vl-7B-Instruct 모델의 지시 기반 훈련을 수행합니다. 우리의 작업은 기본적인 3D 능력을 갖춘 멀티모달 모델을 확장하는 효과적인 시도를 제공하며, 이는 3D 네이티브 AI의 미래 연구에 기여합니다. 프로젝트 페이지: https://github.com/JAMESYJL/ShapeLLM-Omni

English

Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni

ShapeLLM-Omni: 3D 생성 및 이해를 위한 네이티브 멀티모달 LLM

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

초록

Support