多視角等變性通過最小特徵微調改進了對3D對應理解
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning
November 29, 2024
作者: Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas
cs.AI
摘要
視覺基礎模型,尤其是ViT家族,通過提供豐富的語義特徵,已經徹底改變了圖像理解。然而,儘管它們在2D理解方面取得成功,但它們對於把握3D空間關係的能力仍然不清楚。在這項工作中,我們評估並增強基於ViT的模型的3D意識。我們首先系統地評估它們學習3D等變特徵的能力,特別是檢查在不同視角下語義嵌入的一致性。我們的研究結果表明,改進的3D等變性能夠在各種下游任務中取得更好的表現,包括姿勢估計、跟踪和語義轉移。基於這一見解,我們提出了一種簡單而有效的基於3D對應的微調策略,顯著增強了現有視覺模型對3D對應的理解。值得注意的是,即使僅對單個對象進行一次迭代的微調,也會帶來顯著的性能提升。所有代碼和資源將公開提供,以支持對3D感知視覺模型的進一步改進。我們的代碼可在https://github.com/qq456cvb/3DCorrEnhance找到。
English
Vision foundation models, particularly the ViT family, have revolutionized
image understanding by providing rich semantic features. However, despite their
success in 2D comprehension, their abilities on grasping 3D spatial
relationships are still unclear. In this work, we evaluate and enhance the 3D
awareness of ViT-based models. We begin by systematically assessing their
ability to learn 3D equivariant features, specifically examining the
consistency of semantic embeddings across different viewpoints. Our findings
indicate that improved 3D equivariance leads to better performance on various
downstream tasks, including pose estimation, tracking, and semantic transfer.
Building on this insight, we propose a simple yet effective finetuning strategy
based on 3D correspondences, which significantly enhances the 3D correspondence
understanding of existing vision models. Remarkably, even finetuning on a
single object for just one iteration results in substantial performance gains.
All code and resources will be made publicly available to support further
advancements in 3D-aware vision models. Our code is available at
https://github.com/qq456cvb/3DCorrEnhance.Summary
AI-Generated Summary