CrossOver:三維場景跨模態對齊
CrossOver: 3D Scene Cross-Modal Alignment
February 20, 2025
作者: Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni
cs.AI
摘要
多模態三維物體理解已獲得顯著關注,然而當前方法通常假設數據完整可用且所有模態之間嚴格對齊。我們提出了CrossOver,這是一種通過靈活的場景級模態對齊來實現跨模態三維場景理解的新框架。與傳統方法需要為每個物體實例提供對齊的模態數據不同,CrossOver通過放寬約束條件且無需顯式物體語義,將RGB圖像、點雲、CAD模型、平面圖和文本描述等模態對齊,學習到一個統一的、模態無關的場景嵌入空間。利用特定維度的編碼器、多階段訓練流程以及新興的跨模態行為,CrossOver即使在模態缺失的情況下,也能支持穩健的場景檢索和物體定位。在ScanNet和3RScan數據集上的評估顯示,其在多樣化指標上均表現優異,突顯了其在三維場景理解實際應用中的適應性。
English
Multi-modal 3D object understanding has gained significant attention, yet
current approaches often assume complete data availability and rigid alignment
across all modalities. We present CrossOver, a novel framework for cross-modal
3D scene understanding via flexible, scene-level modality alignment. Unlike
traditional methods that require aligned modality data for every object
instance, CrossOver learns a unified, modality-agnostic embedding space for
scenes by aligning modalities - RGB images, point clouds, CAD models,
floorplans, and text descriptions - with relaxed constraints and without
explicit object semantics. Leveraging dimensionality-specific encoders, a
multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver
supports robust scene retrieval and object localization, even with missing
modalities. Evaluations on ScanNet and 3RScan datasets show its superior
performance across diverse metrics, highlighting adaptability for real-world
applications in 3D scene understanding.Summary
AI-Generated Summary