GPT4Point：一种统一的框架，用于点语言理解和生成。

摘要

多模态大型语言模型（MLLMs）在2D图像文本理解和图像生成方面表现出色，但它们对3D世界的理解明显不足，限制了3D语言理解和生成的进展。为了解决这一问题，我们引入了GPT4Point，这是一种创新的突破性点语言多模态模型，专为在MLLM框架内实现统一的3D对象理解和生成而设计。GPT4Point作为一种强大的3D MLLM，可以无缝执行各种点文本参考任务，如点云字幕和问答。此外，GPT4Point具备先进的可控3D生成能力，可以通过保持几何形状和颜色的低质量点文本特征获得高质量的结果。为了支持对3D对象文本对的广泛需求，我们开发了Pyramid-XL，一种点语言数据集注释引擎。它在Objaverse-XL数据集的基础上构建了一个包含100万个不同文本粒度级别对象的大规模数据库，这对于训练GPT4Point至关重要。我们提出了一个全面的基准测试来评估3D点语言理解能力。在广泛的评估中，GPT4Point展现出了优越的理解和生成性能。

English

Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation, but their understanding of the 3D world is notably deficient, limiting progress in 3D language understanding and generation. To solve this problem, we introduce GPT4Point, an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point is equipped with advanced capabilities for controllable 3D generation, it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs, we develop Pyramid-XL, a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset, essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations, GPT4Point has demonstrated superior performance in understanding and generation.

GPT4Point：一种统一的框架，用于点语言理解和生成。

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

摘要

Support