ChatPaper.aiChatPaper

3D-GRAND:一个百万规模的用于具有更好 grounding 和更少幻觉的 3D-LLMs 的数据集

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

June 7, 2024
作者: Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai
cs.AI

摘要

语言和三维感知的整合对于开发理解并与物理世界互动的具身代理和机器人至关重要。尽管大型语言模型(LLMs)展示了令人印象深刻的语言理解和生成能力,但它们在适应三维环境的过程中(3D-LLMs)仍处于早期阶段。主要挑战之一是缺乏提供语言和三维场景之间密集基础的大规模数据集。在本文中,我们介绍了3D-GRAND,这是一个开创性的大规模数据集,包括40,087个家庭场景,配对了620万个密集基础的场景语言指令。我们的结果表明,使用3D-GRAND进行指令调整显著增强了基础能力,并减少了3D-LLMs中的幻觉。作为我们的贡献的一部分,我们提出了一个全面的基准测试3D-POPE,以系统评估3D-LLMs中的幻觉,从而使未来模型之间能够进行公平比较。我们的实验突显了数据集大小与3D-LLM性能之间的扩展效应,强调了大规模三维文本数据集在推动具身人工智能研究中的关键作用。值得注意的是,我们的结果显示了有效的从模拟到真实的转移的早期信号,表明在大规模合成数据上训练的模型可以在真实世界的三维扫描中表现良好。通过3D-GRAND和3D-POPE,我们旨在为具身人工智能社区提供必要的资源和见解,为更可靠和更有基础的3D-LLMs奠定基础。项目网站:https://3d-grand.github.io
English
The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

Summary

AI-Generated Summary

PDF312December 8, 2024