Robin3D：通过强大的指导调整改进3D大型语言模型

摘要

最近对3D大型语言模型（3DLLMs）的进展突显了它们在构建通用代理在3D真实世界中的潜力，然而由于缺乏高质量的强大指令遵循数据，导致3DLLMs的辨别能力和泛化能力受限，挑战依然存在。在本文中，我们介绍了Robin3D，这是一个强大的3DLLM，它是在我们的新颖数据引擎Robust Instruction Generation（RIG）引擎生成的大规模指令遵循数据上进行训练的。RIG生成了两种关键指令数据：1）对抗指令遵循数据，其中包含混合的负面和正面样本，以增强模型的辨别理解能力。2）多样化指令遵循数据，其中包含各种指令风格，以增强模型的泛化能力。因此，我们构建了100万条指令遵循数据，包括344K个对抗样本、508K个多样化样本和165K个基准训练集样本。为了更好地处理这些复杂指令，Robin3D首先引入了关系增强投影仪以增强空间理解，然后通过ID-Feature Bonding加强了对象引用和定位能力。Robin3D在五个广泛使用的3D多模态学习基准测试中始终优于先前的方法，而无需进行特定任务的微调。值得注意的是，我们在定位任务（Multi3DRefer）中实现了7.8\%的改进，在字幕任务（Scan2Cap）中实现了6.9\%的改进。

English

Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap).

Robin3D：通过强大的指导调整改进3D大型语言模型

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

摘要

Support