3D CoCa v2：采用测试时搜索的可对比学习器实现通用空间智能

摘要

空间智能指在三维环境中感知、推理并描述物体及其相互关系的能力，是具身感知与场景理解的基础。三维描述任务旨在用自然语言描述三维场景，但由于点云的稀疏性与不规则性，以及现有描述器在室内外等差异显著环境中的弱 grounding 能力和有限分布外泛化能力，该任务仍面临挑战。为此，我们提出通用化三维描述框架 3D CoCa v2，通过统一对比式视觉语言学习与三维描述生成，并引入不更新描述器参数的无参数测试时搜索机制提升鲁棒性。该框架基于冻结的 CLIP 语义先验、具备空间感知能力的几何编码器和多模态解码器，通过对比学习与描述生成联合优化，无需外部检测器或人工提案。推理时，测试时搜索生成多样化描述候选，并基于紧凑场景摘要进行奖励引导的选择。实验显示：在 ScanRefer 和 Nr3D 数据集上 CIDEr@0.5IoU 分别提升 1.50 和 1.61 分，在 TOD3Cap 的零样本分布外评估中 CIDEr@0.25 提升 3.8 分。代码将发布于 https://github.com/AIGeeksGroup/3DCoCav2。

English

Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at https://github.com/AIGeeksGroup/3DCoCav2.

3D CoCa v2：采用测试时搜索的可对比学习器实现通用空间智能

3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

摘要

Support