协奏曲:联合2D-3D自监督学习涌现空间表征
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
October 27, 2025
作者: Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
cs.AI
摘要
人类通过多模态协同学习抽象概念,一旦形成概念表征,往往能通过单一模态进行回忆。受此原理启发,我们提出Concerto——一种极简化的空间认知概念学习模拟框架,融合了3D模态内自蒸馏与2D-3D跨模态联合嵌入。尽管架构简洁,但如零样本可视化所示,Concerto能学习到更具连贯性和信息量的空间特征。在3D场景感知的线性探测任务中,其性能分别超越独立的SOTA级2D和3D自监督模型14.2%与4.8%,也优于它们的特征拼接组合。经过完整微调后,Concerto在多个场景理解基准测试中创下新纪录(如ScanNet数据集上达80.7% mIoU)。我们还提出了专为视频升维点云空间理解定制的Concerto变体,以及将Concerto表征线性投影至CLIP语言空间的翻译器,从而实现开放世界感知。这些成果表明,Concerto衍生的空间表征具有更优越的细粒度几何与语义一致性。
English
Humans learn abstract concepts through multisensory synergy, and once formed,
such representations can often be recalled from a single modality. Inspired by
this principle, we introduce Concerto, a minimalist simulation of human concept
learning for spatial cognition, combining 3D intra-modal self-distillation with
2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more
coherent and informative spatial features, as demonstrated by zero-shot
visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised
models by 14.2% and 4.8%, respectively, as well as their feature concatenation,
in linear probing for 3D scene perception. With full fine-tuning, Concerto sets
new SOTA results across multiple scene understanding benchmarks (e.g., 80.7%
mIoU on ScanNet). We further present a variant of Concerto tailored for
video-lifted point cloud spatial understanding, and a translator that linearly
projects Concerto representations into CLIP's language space, enabling
open-world perception. These results highlight that Concerto emerges spatial
representations with superior fine-grained geometric and semantic consistency.