SpatialLM: 構造化された屋内モデリングのための大規模言語モデルのトレーニング

要旨

SpatialLMは、3D点群データを処理し、構造化された3Dシーン理解の出力を生成するように設計された大規模言語モデルです。これらの出力には、壁、ドア、窓などの建築要素や、意味的カテゴリを持つ向き付けられた物体ボックスが含まれます。従来のタスク固有のネットワーク設計を利用する手法とは異なり、本モデルは標準的なマルチモーダルLLMアーキテクチャに準拠し、オープンソースのLLMから直接ファインチューニングされています。 SpatialLMを訓練するために、12,328の室内シーン（54,778の部屋）の点群と、それに対応する3Dアノテーションを含む大規模で高品質な合成データセットを収集し、さまざまなモデリングと訓練の決定について慎重に研究を行いました。公開ベンチマークにおいて、本モデルはレイアウト推定で最先端の性能を示し、3D物体検出でも競争力のある結果を達成しました。これにより、拡張現実、具現化ロボティクスなどのアプリケーションにおける現代のLLMの空間理解能力を向上させるための実現可能な道筋を示しました。

English

SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

SpatialLM: 構造化された屋内モデリングのための大規模言語モデルのトレーニング

SpatialLM: Training Large Language Models for Structured Indoor Modeling

要旨

Support