CrossOver: 3D 장면 간 크로스모달 정렬

초록

다중 모달 3D 객체 이해는 상당한 관심을 받고 있지만, 현재의 접근 방식들은 종종 모든 모달리티에 대해 완전한 데이터 가용성과 엄격한 정렬을 가정합니다. 우리는 유연한 장면 수준의 모달리티 정렬을 통해 교차 모달 3D 장면 이해를 위한 새로운 프레임워크인 CrossOver를 제안합니다. 모든 객체 인스턴스에 대해 정렬된 모달리티 데이터를 요구하는 전통적인 방법과 달리, CrossOver는 RGB 이미지, 포인트 클라우드, CAD 모델, 평면도, 텍스트 설명과 같은 모달리티를 완화된 제약 조건과 명시적인 객체 의미 없이 정렬함으로써 통합된 모달리티-불가지론적 임베딩 공간을 학습합니다. 차원별 인코더, 다단계 학습 파이프라인, 그리고 발생적 교차 모달 행동을 활용하여 CrossOver는 모달리티가 누락된 경우에도 강력한 장면 검색 및 객체 위치 파악을 지원합니다. ScanNet 및 3RScan 데이터셋에 대한 평가는 다양한 메트릭에서 우수한 성능을 보여주며, 3D 장면 이해를 위한 실제 응용에서의 적응성을 강조합니다.

English

Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting adaptability for real-world applications in 3D scene understanding.

CrossOver: 3D 장면 간 크로스모달 정렬

CrossOver: 3D Scene Cross-Modal Alignment

초록

Support