幾何至關重要:用於學習語義對應的3D基礎先驗
Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence
May 28, 2026
作者: Artur Jesslen, Olaf Dünkel, Adam Kortylewski
cs.AI
摘要
自監督視覺模型與文字到圖像擴散模型中的基礎特徵已被證實能有效應用於語義對應估計。然而,由於這些特徵主要從二維圖像目標中學習而得,它們缺乏明確的三維感知能力,經常混淆物體的對稱側面、重複部位,以及在三維空間中截然不同但視覺上相似的結構。我們提出一個三維感知的後訓練框架,藉由納入三維基礎模型的先驗知識,突破現有二維基礎特徵的限制。對於給定圖像,我們的方法利用 SAM3D 估計物體幾何與姿態,並透過渲染與比較優化流程來修正姿態。隨後,根據估計出的物體姿態,我們將重建幾何結構中的 PartField 描述符渲染至圖像平面。所產生的幾何感知特徵圖能與 DINO 及 Stable Diffusion 特徵相輔相成,而重建形狀上的測地距離則可有效篩選候選對應點。我們將篩選後的匹配結果作為監督信號,在 DINO 與 Stable Diffusion 之上訓練一個輕量級適配器,以進行語義對應。與先前需要姿態標註且依賴粗略球體幾何的後訓練方法不同,我們的方法能自動獲取實例專屬的三維結構,並以此引導對應學習。實驗結果顯示,我們的方法在改善語義對應的同時,減少了對人工幾何監督的需求。程式碼與模型可於 https://github.com/GenIntel/3D-SC 取得。
English
Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.