Robin3D:通过强大的指导调整改进3D大型语言模型
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
September 30, 2024
作者: Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan
cs.AI
摘要
最近对3D大型语言模型(3DLLMs)的进展突显了它们在构建通用代理在3D真实世界中的潜力,然而由于缺乏高质量的强大指令遵循数据,导致3DLLMs的辨别能力和泛化能力受限,挑战依然存在。在本文中,我们介绍了Robin3D,这是一个强大的3DLLM,它是在我们的新颖数据引擎Robust Instruction Generation(RIG)引擎生成的大规模指令遵循数据上进行训练的。RIG生成了两种关键指令数据:1)对抗指令遵循数据,其中包含混合的负面和正面样本,以增强模型的辨别理解能力。2)多样化指令遵循数据,其中包含各种指令风格,以增强模型的泛化能力。因此,我们构建了100万条指令遵循数据,包括344K个对抗样本、508K个多样化样本和165K个基准训练集样本。为了更好地处理这些复杂指令,Robin3D首先引入了关系增强投影仪以增强空间理解,然后通过ID-Feature Bonding加强了对象引用和定位能力。Robin3D在五个广泛使用的3D多模态学习基准测试中始终优于先前的方法,而无需进行特定任务的微调。值得注意的是,我们在定位任务(Multi3DRefer)中实现了7.8\%的改进,在字幕任务(Scan2Cap)中实现了6.9\%的改进。
English
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted
their potential in building general-purpose agents in the 3D real world, yet
challenges remain due to the lack of high-quality robust instruction-following
data, leading to limited discriminative power and generalization of 3DLLMs. In
this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale
instruction-following data generated by our novel data engine, Robust
Instruction Generation (RIG) engine. RIG generates two key instruction data: 1)
the Adversarial Instruction-following data, which features mixed negative and
positive samples to enhance the model's discriminative understanding. 2) the
Diverse Instruction-following data, which contains various instruction styles
to enhance model's generalization. As a result, we construct 1 million
instruction-following data, consisting of 344K Adversarial samples, 508K
Diverse samples, and 165K benchmark training set samples. To better handle
these complex instructions, Robin3D first incorporates Relation-Augmented
Projector to enhance spatial understanding, and then strengthens the object
referring and grounding ability through ID-Feature Bonding. Robin3D
consistently outperforms previous methods across five widely-used 3D multimodal
learning benchmarks, without the need for task-specific fine-tuning. Notably,
we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\%
improvement in the captioning task (Scan2Cap).Summary
AI-Generated Summary