GPT4Point:一個統一的框架,用於點-語言理解和生成。
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
December 5, 2023
作者: Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, Hengshuang Zhao
cs.AI
摘要
多模式大型語言模型(MLLMs)在2D圖像文本理解和圖像生成方面表現出色,但它們對於3D世界的理解明顯不足,限制了3D語言理解和生成的進展。為了解決這個問題,我們引入了GPT4Point,這是一個創新的突破性點語言多模式模型,專門設計用於在MLLM框架內統一的3D物體理解和生成。GPT4Point作為一個強大的3D MLLM,可以無縫執行各種點文本參考任務,如點雲字幕和問答。此外,GPT4Point還具備先進的可控3D生成能力,可以通過低質量的點文本特徵獲得高質量的結果,同時保持幾何形狀和顏色。為了支持對3D物體文本對的廣泛需求,我們開發了Pyramid-XL,一個點語言數據集標註引擎。它在Objaverse-XL數據集上構建了一個包含100萬個對象的各種文本粒度級別的大規模數據庫,這對於訓練GPT4Point至關重要。我們提出了一個全面的基準來評估3D點語言理解能力。在廣泛的評估中,GPT4Point展示了出色的理解和生成性能。
English
Multimodal Large Language Models (MLLMs) have excelled in 2D image-text
comprehension and image generation, but their understanding of the 3D world is
notably deficient, limiting progress in 3D language understanding and
generation. To solve this problem, we introduce GPT4Point, an innovative
groundbreaking point-language multimodal model designed specifically for
unified 3D object understanding and generation within the MLLM framework.
GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text
reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point
is equipped with advanced capabilities for controllable 3D generation, it can
get high-quality results through a low-quality point-text feature maintaining
the geometric shapes and colors. To support the expansive needs of 3D
object-text pairs, we develop Pyramid-XL, a point-language dataset annotation
engine. It constructs a large-scale database over 1M objects of varied text
granularity levels from the Objaverse-XL dataset, essential for training
GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D
point-language understanding capabilities. In extensive evaluations, GPT4Point
has demonstrated superior performance in understanding and generation.