3D-GRAND: 더 나은 근거와 더 적은 환각을 위한 3D-LLM용 백만 규모 데이터셋

초록

언어와 3D 인식의 통합은 물리적 세계를 이해하고 상호작용하는 구현형 에이전트와 로봇을 개발하는 데 중요합니다. 대규모 언어 모델(LLM)은 인상적인 언어 이해 및 생성 능력을 보여주었지만, 3D 환경(3D-LLM)에의 적용은 아직 초기 단계에 머물러 있습니다. 주요 과제 중 하나는 언어와 3D 장면 간의 밀집된 연결을 제공하는 대규모 데이터셋의 부재입니다. 본 논문에서는 40,087개의 가정용 장면과 6.2백만 개의 밀집된 장면-언어 지침으로 구성된 선구적인 대규모 데이터셋인 3D-GRAND를 소개합니다. 우리의 결과는 3D-GRAND를 사용한 지침 튜닝이 3D-LLM의 연결 능력을 크게 향상시키고 환각 현상을 줄이는 것을 보여줍니다. 또한, 3D-LLM의 환각 현상을 체계적으로 평가하기 위한 포괄적인 벤치마크인 3D-POPE를 제안하여 향후 모델 간의 공정한 비교를 가능하게 합니다. 우리의 실험은 데이터셋 크기와 3D-LLM 성능 간의 스케일링 효과를 강조하며, 대규모 3D-텍스트 데이터셋이 구현형 AI 연구를 발전시키는 데 중요한 역할을 한다는 점을 강조합니다. 특히, 대규모 합성 데이터로 훈련된 모델이 실제 3D 스캔에서도 잘 작동할 수 있다는 효과적인 시뮬레이션-투-리얼 전이의 초기 신호를 보여줍니다. 3D-GRAND와 3D-POPE를 통해, 우리는 구현형 AI 커뮤니티에 필수적인 리소스와 통찰력을 제공하여 더 신뢰할 수 있고 잘 연결된 3D-LLM의 기반을 마련하고자 합니다. 프로젝트 웹사이트: https://3d-grand.github.io

English

The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

3D-GRAND: 더 나은 근거와 더 적은 환각을 위한 3D-LLM용 백만 규모 데이터셋

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

초록

Support