LLaVA-3D:一种简单而有效的方法,赋予LMMs 3D 感知能力。
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
September 26, 2024
作者: Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, Xihui Liu
cs.AI
摘要
最近大型多模态模型(LMMs)的进展极大地增强了它们在2D视觉理解任务中的熟练程度,使其能够有效地处理和理解图像和视频。然而,具有3D感知能力以进行3D场景理解的LMMs的发展受到了缺乏大规模3D视觉-语言数据集和强大的3D编码器的阻碍。在本文中,我们介绍了一个简单而有效的框架,名为LLaVA-3D。利用来自LLaVA的强大2D理解先验知识,我们的LLaVA-3D可以在不损害2D理解能力的情况下,有效地将LLaVA调整为3D场景理解。为了实现这一目标,我们采用了一种简单而有效的表示方法,即3D Patch,它将2D CLIP patch特征与它们在3D空间中的对应位置连接起来。通过将3D Patch集成到2D LMMs中,并采用联合的2D和3D视觉-语言指导调整,我们建立了一个统一的架构,既用于2D图像理解,又用于3D场景理解。实验结果表明,当在3D视觉-语言数据集上训练时,LLaVA-3D的收敛速度比现有的3D LMMs快3.5倍。此外,LLaVA-3D不仅在各种3D任务中实现了最先进的性能,而且在2D图像理解和视觉-语言对话能力方面与LLaVA保持了可比的水平。
English
Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced
their proficiency in 2D visual understanding tasks, enabling them to
effectively process and understand images and videos. However, the development
of LMMs with 3D-awareness for 3D scene understanding has been hindered by the
lack of large-scale 3D vision-language datasets and powerful 3D encoders. In
this paper, we introduce a simple yet effective framework called LLaVA-3D.
Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D
efficiently adapts LLaVA for 3D scene understanding without compromising 2D
understanding capabilities. To achieve this, we employ a simple yet effective
representation, 3D Patch, which connects 2D CLIP patch features with their
corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs
and employing joint 2D and 3D vision-language instruction tuning, we establish
a unified architecture for both 2D image understanding and 3D scene
understanding. Experimental results show that LLaVA-3D converges 3.5x faster
than existing 3D LMMs when trained on 3D vision-language datasets. Moreover,
LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks
but also maintains comparable 2D image understanding and vision-language
conversation capabilities with LLaVA.Summary
AI-Generated Summary