Konzert: Gemeinsames 2D-3D-selbstüberwachtes Lernen entwickelt räumliche Repräsentationen

papers.abstract

Menschen erlernen abstrakte Konzepte durch multisensorische Synergie, und einmal gebildet können solche Repräsentationen oft aus einer einzelnen Modalität abgerufen werden. Inspiriert von diesem Prinzip führen wir Concerto ein, eine minimalistische Simulation menschlichen Begriffslernens für räumliche Kognition, die 3D-intramodale Selbst-Distillation mit 2D-3D-kreuzmodaler Joint Embedding kombiniert. Trotz seiner Einfachheit erlernt Concerto kohärentere und informativer räumliche Merkmale, wie durch Zero-Shot-Visualisierungen demonstriert wird. Es übertrifft sowohl eigenständige SOTA-2D- als auch 3D-selbstüberwachte Modelle um 14,2 % bzw. 4,8 % sowie deren Merkmalsverkettung bei linearer Abtastung für 3D-Szenenwahrnehmung. Mit vollständigem Fine-Tuning erzielt Concerto neue SOTA-Ergebnisse über mehrere Szenenverständnis-Benchmarks hinweg (z.B. 80,7 % mIoU auf ScanNet). Wir präsentieren weiterhin eine Variante von Concerto, die für räumliches Verständnis von videobasierten Punktwolken maßgeschneidert ist, sowie einen Translator, der Concerto-Repräsentationen linear in den Sprachraum von CLIP projiziert und damit Open-World-Wahrnehmung ermöglicht. Diese Ergebnisse unterstreichen, dass Concerto räumliche Repräsentationen mit überlegener feinkörniger geometrischer und semantischer Konsistenz hervorbringt.

English

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Konzert: Gemeinsames 2D-3D-selbstüberwachtes Lernen entwickelt räumliche Repräsentationen

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

papers.abstract

Support