DINeMo：无需3D标注学习神经网格模型

摘要

类别级别的3D/6D姿态估计是实现全面3D场景理解的关键步骤，这将推动机器人和具身AI领域的广泛应用。近期研究探索了从分析合成视角出发的神经网格模型，用于处理一系列2D和3D任务。尽管这些方法在应对部分遮挡和领域转移方面显著增强了鲁棒性，但它们严重依赖3D标注进行部分对比学习，这限制了其适用类别范围，并阻碍了高效扩展。在本研究中，我们提出了DINeMo，一种无需3D标注即可训练的新型神经网格模型，它通过利用大型视觉基础模型获得的伪对应关系进行学习。我们采用了一种双向伪对应生成方法，该方法结合局部外观特征与全局上下文信息来生成伪对应。在汽车数据集上的实验结果表明，DINeMo在零样本和少样本3D姿态估计任务上大幅超越先前方法，将与全监督方法的差距缩小了67.3%。此外，DINeMo在训练过程中融入更多未标注图像时，展现出高效且有效的扩展能力，凸显了其相较于依赖3D标注的监督学习方法的优势。我们的项目页面可通过https://analysis-by-synthesis.github.io/DINeMo/访问。

English

Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at https://analysis-by-synthesis.github.io/DINeMo/.

DINeMo：无需3D标注学习神经网格模型

DINeMo: Learning Neural Mesh Models with no 3D Annotations

摘要

Support