協奏曲:二維與三維聯合自監督學習湧現空間表徵
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
October 27, 2025
作者: Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
cs.AI
摘要
人類透過多感官協同作用學習抽象概念,一旦形成,這類表徵往往能從單一模態中被喚起。受此原理啟發,我們提出Concerto——一種極簡化的人類空間認知概念學習模擬框架,它結合了3D模態內自蒸餾與2D-3D跨模態聯合嵌入。儘管設計簡潔,Concerto通過零樣本可視化證明能學習到更具連貫性和信息量的空間特徵。在3D場景感知的線性探測任務中,其性能分別超越獨立的最先進2D和3D自監督模型14.2%與4.8%,同時也優於它們的特徵拼接組合。經過完整微調後,Concerto在多個場景理解基準測試中創下新紀錄(例如在ScanNet上達到80.7% mIoU)。我們進一步提出專為視頻提升點雲空間理解定制的Concerto變體,以及能將Concerto表徵線性投影至CLIP語言空間的轉譯器,從而實現開放世界感知。這些成果凸顯出Concerto所湧現的空間表徵具有卓越的細粒度幾何與語義一致性。
English
Humans learn abstract concepts through multisensory synergy, and once formed,
such representations can often be recalled from a single modality. Inspired by
this principle, we introduce Concerto, a minimalist simulation of human concept
learning for spatial cognition, combining 3D intra-modal self-distillation with
2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more
coherent and informative spatial features, as demonstrated by zero-shot
visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised
models by 14.2% and 4.8%, respectively, as well as their feature concatenation,
in linear probing for 3D scene perception. With full fine-tuning, Concerto sets
new SOTA results across multiple scene understanding benchmarks (e.g., 80.7%
mIoU on ScanNet). We further present a variant of Concerto tailored for
video-lifted point cloud spatial understanding, and a translator that linearly
projects Concerto representations into CLIP's language space, enabling
open-world perception. These results highlight that Concerto emerges spatial
representations with superior fine-grained geometric and semantic consistency.