SpatialLM:訓練大型語言模型用於結構化室內建模
SpatialLM: Training Large Language Models for Structured Indoor Modeling
June 9, 2025
作者: Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou
cs.AI
摘要
SpatialLM 是一個專為處理 3D 點雲數據並生成結構化 3D 場景理解輸出而設計的大型語言模型。這些輸出包括牆壁、門窗等建築元素,以及帶有語義類別的定向物體框。與以往利用特定任務網絡設計的方法不同,我們的模型遵循標準的多模態 LLM 架構,並直接從開源 LLM 進行微調。
為了訓練 SpatialLM,我們收集了一個大規模、高質量的合成數據集,包含 12,328 個室內場景(54,778 個房間)的點雲數據及其真實 3D 註釋,並對各種建模和訓練決策進行了深入研究。在公開基準測試中,我們的模型在佈局估計方面達到了最先進的性能,並在 3D 物體檢測方面取得了具有競爭力的結果。由此,我們展示了增強現代 LLM 空間理解能力的可行路徑,適用於增強現實、具身機器人等應用領域。
English
SpatialLM is a large language model designed to process 3D point cloud data
and generate structured 3D scene understanding outputs. These outputs include
architectural elements like walls, doors, windows, and oriented object boxes
with their semantic categories. Unlike previous methods which exploit
task-specific network designs, our model adheres to the standard multimodal LLM
architecture and is fine-tuned directly from open-source LLMs.
To train SpatialLM, we collect a large-scale, high-quality synthetic dataset
consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with
ground-truth 3D annotations, and conduct a careful study on various modeling
and training decisions. On public benchmarks, our model gives state-of-the-art
performance in layout estimation and competitive results in 3D object
detection. With that, we show a feasible path for enhancing the spatial
understanding capabilities of modern LLMs for applications in augmented
reality, embodied robotics, and more.