ChatPaper.aiChatPaper

协奏曲:联合2D-3D自监督学习涌现空间表征

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

October 27, 2025
作者: Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
cs.AI

摘要

人类通过多模态协同学习抽象概念,一旦形成概念表征,往往能通过单一模态进行回忆。受此原理启发,我们提出Concerto——一种极简化的空间认知概念学习模拟框架,融合了3D模态内自蒸馏与2D-3D跨模态联合嵌入。尽管架构简洁,但如零样本可视化所示,Concerto能学习到更具连贯性和信息量的空间特征。在3D场景感知的线性探测任务中,其性能分别超越独立的SOTA级2D和3D自监督模型14.2%与4.8%,也优于它们的特征拼接组合。经过完整微调后,Concerto在多个场景理解基准测试中创下新纪录(如ScanNet数据集上达80.7% mIoU)。我们还提出了专为视频升维点云空间理解定制的Concerto变体,以及将Concerto表征线性投影至CLIP语言空间的翻译器,从而实现开放世界感知。这些成果表明,Concerto衍生的空间表征具有更优越的细粒度几何与语义一致性。
English
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
PDF1774December 31, 2025