幾何学が重要：セマンティック対応学習のための3D基盤事前知識

要旨

自己教師ありビジョンモデルおよびテキストから画像への拡散モデルから得られる基盤特徴量は、意味的対応推定において有効であることが示されている。しかし、これらの特徴量は主に2D画像の目的関数から学習されるため、明示的な3D認識を欠いており、対象の対称な側面、繰り返し部分、および3Dでは異なるが視覚的に類似した構造をしばしば混同する。そこで本稿では、3D基盤モデルからの事前知識を組み込むことで、既存の2D基盤特徴を超える、3D認識可能なポストトレーニングフレームワークを提案する。本手法は、入力画像に対しSAM3Dを用いてオブジェクトの形状と姿勢を推定し、レンダリングと比較による最適化を通じて姿勢を精緻化する。続いて、推定されたオブジェクト姿勢に基づき、再構成された形状からPartField記述子を画像平面上にレンダリングする。得られた形状認識特徴マップはDINOおよびStable Diffusionの特徴を補完し、再構成形状上の測地距離により対応候補の信頼性のあるフィルタリングが可能となる。フィルタリングされた対応点を教師信号として、DINOとStable Diffusionの上部に軽量アダプタを学習し、意味的対応を実現する。姿勢アノテーションを必要とし粗い球面幾何学に依存する従来のポストトレーニング手法とは対照的に、本手法はインスタンス固有の3D構造を自動的に取得し、それを用いて対応学習を導く。実験により、本手法は手動の幾何学的教師を削減しつつ、従来手法よりも意味的対応を改善することを示す。コードとモデルはhttps:/github.com/GenIntel/3D-SCで入手可能である。

English

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.