マルチビュー同変性は、最小限の特徴微調整で3D対応理解を向上させます。

要旨

ビジョン基盤モデル、特にViTファミリーは、豊かな意味的特徴を提供することで画像理解を革新しました。しかし、2D理解における成功にも関わらず、3D空間関係の把握能力は依然として不明確です。本研究では、ViTベースのモデルの3D認識を評価および向上させます。まず、彼らが3D同変特徴を学習する能力を体系的に評価し、特に異なる視点間での意味的埋め込みの一貫性を検討します。我々の調査結果は、改善された3D同変性が、ポーズ推定、トラッキング、および意味転送を含むさまざまな下流タスクでのパフォーマンス向上につながることを示しています。この洞察を基に、既存のビジョンモデルの3D対応理解を著しく向上させる、3D対応に基づくシンプルで効果的なファインチューニング戦略を提案します。驚くべきことに、たった1つのオブジェクトに対して1回のイテレーションでのファインチューニングでも、大幅なパフォーマンス向上が得られます。すべてのコードとリソースは、3D認識を向上させるためにさらなる進歩を支援するために公開されます。我々のコードはhttps://github.com/qq456cvb/3DCorrEnhanceで入手可能です。

English

Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.

マルチビュー同変性は、最小限の特徴微調整で3D対応理解を向上させます。

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

要旨

Support