SpatialLM:面向结构化室内建模的大规模语言模型训练
SpatialLM: Training Large Language Models for Structured Indoor Modeling
June 9, 2025
作者: Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou
cs.AI
摘要
SpatialLM 是一款专为处理三维点云数据并生成结构化三维场景理解输出而设计的大型语言模型。这些输出包括墙体、门、窗等建筑元素,以及带有语义类别的定向物体框。与以往依赖特定任务网络设计的方法不同,我们的模型遵循标准的多模态LLM架构,并直接从开源LLM进行微调。
为训练SpatialLM,我们收集了一个大规模、高质量的合成数据集,包含12,328个室内场景(54,778个房间)的点云及其对应的三维标注真值,并对多种建模和训练决策进行了细致研究。在公开基准测试中,我们的模型在布局估计任务上达到了最先进的性能,在三维物体检测方面也取得了具有竞争力的结果。由此,我们展示了一条可行的路径,即通过增强现代LLM的空间理解能力,以应用于增强现实、具身机器人等领域。
English
SpatialLM is a large language model designed to process 3D point cloud data
and generate structured 3D scene understanding outputs. These outputs include
architectural elements like walls, doors, windows, and oriented object boxes
with their semantic categories. Unlike previous methods which exploit
task-specific network designs, our model adheres to the standard multimodal LLM
architecture and is fine-tuned directly from open-source LLMs.
To train SpatialLM, we collect a large-scale, high-quality synthetic dataset
consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with
ground-truth 3D annotations, and conduct a careful study on various modeling
and training decisions. On public benchmarks, our model gives state-of-the-art
performance in layout estimation and competitive results in 3D object
detection. With that, we show a feasible path for enhancing the spatial
understanding capabilities of modern LLMs for applications in augmented
reality, embodied robotics, and more.