ChatPaper.aiChatPaper

3D-GRAND:一個百萬規模的 3D-LLM 資料集,具有更好的 grounding 與較少的幻覺。

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

June 7, 2024
作者: Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai
cs.AI

摘要

語言與3D感知的整合對於發展理解並與物理世界互動的具體代理和機器人至關重要。儘管大型語言模型(LLMs)展示了令人印象深刻的語言理解和生成能力,但它們適應3D環境(3D-LLMs)仍處於早期階段。主要挑戰之一是缺乏提供語言與3D場景之間密集基礎的大規模數據集。在本文中,我們介紹了3D-GRAND,一個開創性的大規模數據集,包括40,087個家庭場景,配對了620萬個密集基礎的場景語言指令。我們的結果顯示,使用3D-GRAND進行指令調整顯著增強了基礎能力並減少了3D-LLMs中的幻覺。作為我們的貢獻的一部分,我們提出了一個全面的基準3D-POPE,以系統地評估3D-LLMs中的幻覺,從而實現未來模型之間的公平比較。我們的實驗突出了數據集大小與3D-LLM性能之間的規模效應,強調了大規模3D文本數據集在推動具體AI研究中的關鍵作用。值得注意的是,我們的結果顯示了有效的從模擬到真實的轉移的早期信號,表明在大規模合成數據上訓練的模型可以在真實世界的3D掃描上表現良好。通過3D-GRAND和3D-POPE,我們旨在為具體AI社區提供必要的資源和見解,為更可靠和更有基礎的3D-LLMs奠定基礎。項目網站:https://3d-grand.github.io
English
The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

Summary

AI-Generated Summary

PDF312December 8, 2024