协奏曲：联合2D-3D自监督学习涌现空间表征

摘要

人类通过多模态协同学习抽象概念，一旦形成概念表征，往往能通过单一模态进行回忆。受此原理启发，我们提出Concerto——一种极简化的空间认知概念学习模拟框架，融合了3D模态内自蒸馏与2D-3D跨模态联合嵌入。尽管架构简洁，但如零样本可视化所示，Concerto能学习到更具连贯性和信息量的空间特征。在3D场景感知的线性探测任务中，其性能分别超越独立的SOTA级2D和3D自监督模型14.2%与4.8%，也优于它们的特征拼接组合。经过完整微调后，Concerto在多个场景理解基准测试中创下新纪录（如ScanNet数据集上达80.7% mIoU）。我们还提出了专为视频升维点云空间理解定制的Concerto变体，以及将Concerto表征线性投影至CLIP语言空间的翻译器，从而实现开放世界感知。这些成果表明，Concerto衍生的空间表征具有更优越的细粒度几何与语义一致性。

English

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

协奏曲：联合2D-3D自监督学习涌现空间表征

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

摘要

Support