多視角等變性通過最小特徵微調改進了對3D對應理解

摘要

視覺基礎模型，尤其是ViT家族，通過提供豐富的語義特徵，已經徹底改變了圖像理解。然而，儘管它們在2D理解方面取得成功，但它們對於把握3D空間關係的能力仍然不清楚。在這項工作中，我們評估並增強基於ViT的模型的3D意識。我們首先系統地評估它們學習3D等變特徵的能力，特別是檢查在不同視角下語義嵌入的一致性。我們的研究結果表明，改進的3D等變性能夠在各種下游任務中取得更好的表現，包括姿勢估計、跟踪和語義轉移。基於這一見解，我們提出了一種簡單而有效的基於3D對應的微調策略，顯著增強了現有視覺模型對3D對應的理解。值得注意的是，即使僅對單個對象進行一次迭代的微調，也會帶來顯著的性能提升。所有代碼和資源將公開提供，以支持對3D感知視覺模型的進一步改進。我們的代碼可在https://github.com/qq456cvb/3DCorrEnhance找到。

English

Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.

多視角等變性通過最小特徵微調改進了對3D對應理解

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

摘要

Support