CrossOver: 3Dシーンのクロスモーダルアラインメント

要旨

マルチモーダル3Dオブジェクト理解は大きな注目を集めているが、現在のアプローチでは、すべてのモダリティにおいて完全なデータの可用性と厳密なアラインメントを前提とすることが多い。本論文では、柔軟なシーンレベルのモダリティアラインメントを介したクロスモーダル3Dシーン理解のための新しいフレームワーク「CrossOver」を提案する。従来の手法では、各オブジェクトインスタンスに対してアラインメントされたモダリティデータが必要であったが、CrossOverは、RGB画像、ポイントクラウド、CADモデル、フロアプラン、テキスト記述といったモダリティを緩やかな制約下で、明示的なオブジェクトセマンティクスなしにアラインメントすることで、シーンに対する統一されたモダリティ非依存の埋め込み空間を学習する。次元固有のエンコーダ、多段階のトレーニングパイプライン、および創発的なクロスモーダル挙動を活用することで、CrossOverはモダリティが欠落している場合でも、堅牢なシーン検索とオブジェクトローカライゼーションをサポートする。ScanNetおよび3RScanデータセットでの評価では、多様なメトリクスにわたる優れた性能を示し、3Dシーン理解における実世界アプリケーションへの適応性を強調している。

English

Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting adaptability for real-world applications in 3D scene understanding.

CrossOver: 3Dシーンのクロスモーダルアラインメント

CrossOver: 3D Scene Cross-Modal Alignment

要旨

Support