SpatialLM: 구조화된 실내 모델링을 위한 대규모 언어 모델 학습

초록

SpatialLM은 3D 포인트 클라우드 데이터를 처리하고 구조화된 3D 장면 이해 출력을 생성하도록 설계된 대규모 언어 모델입니다. 이러한 출력에는 벽, 문, 창과 같은 건축 요소와 의미론적 카테고리가 포함된 방향성 객체 박스가 포함됩니다. 이전 방법들이 작업별 네트워크 설계를 활용한 것과 달리, 우리 모델은 표준 다중모달 LLM 아키텍처를 준수하며 오픈소스 LLM에서 직접 미세 조정되었습니다. SpatialLM을 학습시키기 위해, 우리는 12,328개의 실내 장면(54,778개의 방)의 포인트 클라우드와 정확한 3D 주석으로 구성된 대규모 고품질 합성 데이터셋을 수집하고, 다양한 모델링 및 학습 결정에 대한 신중한 연구를 수행했습니다. 공개 벤치마크에서, 우리 모델은 레이아웃 추정에서 최첨단 성능을 보였으며 3D 객체 탐지에서도 경쟁력 있는 결과를 보여주었습니다. 이를 통해, 증강 현실, 구현된 로보틱스 등에서 현대 LLM의 공간 이해 능력을 향상시키는 실현 가능한 경로를 제시합니다.

English

SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

SpatialLM: 구조화된 실내 모델링을 위한 대규모 언어 모델 학습

SpatialLM: Training Large Language Models for Structured Indoor Modeling

초록

Support