Robin3D：ロバストなインストラクションチューニングを通じた3D大規模言語モデルの改善

要旨

最近の3D大規模言語モデル（3DLLMs）の進歩は、3Dの現実世界で汎用エージェントを構築する可能性を示していますが、高品質で頑健な命令に従うデータが不足しているため、3DLLMsの限られた識別力と汎化能力に課題が残っています。本論文では、私たちの新しいデータエンジンで生成された大規模な命令に従うデータでトレーニングされた強力な3DLLMであるRobin3Dを紹介します。RIGは、2つの主要な命令データを生成します。1つは、モデルの識別理解を向上させるためにネガティブとポジティブなサンプルを混在させたAdversarial Instruction-followingデータです。もう1つは、モデルの汎化を向上させるためにさまざまな命令スタイルを含むDiverse Instruction-followingデータです。その結果、344KのAdversarialサンプル、508KのDiverseサンプル、165Kのベンチマークトレーニングセットサンプルからなる100万の命令に従うデータを構築します。これらの複雑な命令をよりよく処理するために、Robin3Dはまず、Relation-Augmented Projectorを組み込んで空間理解を向上させ、次にID-Feature Bondingを介してオブジェクトの参照と接地能力を強化します。Robin3Dは、タスク固有の微調整を必要とせずに、広く使用されている5つの3Dマルチモーダル学習ベンチマーク全体で以前の手法を一貫して上回ります。特に、接地タスク（Multi3DRefer）で7.8％の改善とキャプション付けタスク（Scan2Cap）で6.9％の改善を達成しています。

English

Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap).

Robin3D：ロバストなインストラクションチューニングを通じた3D大規模言語モデルの改善

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

要旨

Support