3D-GRAND: より優れた接地性と低い幻覚を実現する100万規模の3D-LLM向けデータセット

要旨

言語と3D知覚の統合は、物理世界を理解し相互作用するエンボディエージェントやロボットの開発において極めて重要です。大規模言語モデル（LLM）は、言語理解と生成能力において目覚ましい成果を示していますが、3D環境への適応（3D-LLM）はまだ初期段階にあります。主な課題は、言語と3Dシーンを密接に結びつける大規模データセットの欠如です。本論文では、40,087の家庭用シーンと620万の密接に結びついたシーン言語指示を組み合わせた先駆的な大規模データセット、3D-GRANDを紹介します。我々の結果は、3D-GRANDを用いた指示チューニングが、3D-LLMの接地能力を大幅に向上させ、幻覚を減少させることを示しています。貢献の一環として、3D-LLMの幻覚を体系的に評価し、将来のモデル間の公平な比較を可能にする包括的なベンチマーク3D-POPEを提案します。我々の実験は、データセットの規模と3D-LLMの性能の間にスケーリング効果があることを強調し、大規模な3DテキストデータセットがエンボディAI研究を進める上で重要な役割を果たすことを示しています。特に、我々の結果は、大規模な合成データで訓練されたモデルが実世界の3Dスキャンでも良好に機能することを示す、効果的なシミュレーションから現実への転移の初期兆候を示しています。3D-GRANDと3D-POPEを通じて、我々はエンボディAIコミュニティに不可欠なリソースと洞察を提供し、より信頼性が高く、より良く接地された3D-LLMの基盤を築くことを目指しています。プロジェクトウェブサイト: https://3d-grand.github.io

English

The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

3D-GRAND: より優れた接地性と低い幻覚を実現する100万規模の3D-LLM向けデータセット

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

要旨

Support