MetaSpatial: メタバースにおけるVLMの3D空間推論能力の強化

要旨

我々はMetaSpatialを提案する。これは、ビジョン言語モデル（VLM）における3D空間推論を強化するための初の強化学習（RL）ベースのフレームワークであり、ハードコードされた最適化を必要とせずにリアルタイムの3Dシーン生成を可能にする。MetaSpatialは、以下の2つの核心的な課題に対処する：（i）VLMにおける内部化された3D空間推論の欠如、これにより現実的なレイアウト生成能力が制限されること、（ii）レイアウト生成タスクにおける従来の教師あり微調整（SFT）の非効率性、完全な正解アノテーションが利用できないためである。我々の主要な革新は、物理を意識した制約とレンダリングされた画像評価を統合した多段階RLベースの最適化メカニズムであり、生成された3Dレイアウトが一貫性があり、物理的に妥当で、美的に整合性を持つことを保証する。方法論的に、MetaSpatialは適応的で反復的な推論プロセスを導入し、VLMがレンダリングされた出力を分析することで、シーンの一貫性を段階的に向上させる。実証評価により、MetaSpatialが様々なスケールモデルの空間的一貫性とフォーマットの安定性を大幅に向上させることが示された。トレーニング後、オブジェクトの配置はより現実的で整列され、機能的に一貫しており、メタバース、AR/VR、デジタルツイン、ゲーム開発アプリケーションにおける3D空間推論のためのRLの有効性が検証された。我々のコード、データ、トレーニングパイプラインはhttps://github.com/PzySeere/MetaSpatialで公開されている。

English

We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications. Our code, data, and training pipeline are publicly available at https://github.com/PzySeere/MetaSpatial.

MetaSpatial: メタバースにおけるVLMの3D空間推論能力の強化

MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

要旨

Support