3D 凝聚：野外的 3D 感知影像對齊

摘要

我們提出了3D凝聚，這是一個針對捕捉具有語義相似物體的2D圖像進行3D感知對齊的新問題。給定一組未標記的互聯網圖像，我們的目標是從輸入中關聯共享的語義部分，並將來自2D圖像的知識聚合到共享的3D標準空間中。我們引入了一個通用框架，該框架處理該任務，而無需假設形狀模板、姿勢或任何攝像機參數。其核心是一個包含幾何和語義信息的標準3D表示。該框架針對每個輸入圖像的標準表示進行優化，以及一個逐圖像坐標映射，將2D像素坐標轉換為3D標準幀，以考慮形狀匹配。優化過程融合了來自預先訓練的圖像生成模型的先前知識和來自輸入圖像的語義信息。前者為這個受限任務提供了強大的知識指導，而後者提供了必要的信息，以減輕來自預先訓練模型的訓練數據偏差。我們的框架可用於各種任務，如對應匹配、姿勢估計和圖像編輯，在具有挑戰性的照明條件下以及在野外在線圖像集合上實現強大的結果。

English

We propose 3D Congealing, a novel problem of 3D-aware alignment for 2D images capturing semantically similar objects. Given a collection of unlabeled Internet images, our goal is to associate the shared semantic parts from the inputs and aggregate the knowledge from 2D images to a shared 3D canonical space. We introduce a general framework that tackles the task without assuming shape templates, poses, or any camera parameters. At its core is a canonical 3D representation that encapsulates geometric and semantic information. The framework optimizes for the canonical representation together with the pose for each input image, and a per-image coordinate map that warps 2D pixel coordinates to the 3D canonical frame to account for the shape matching. The optimization procedure fuses prior knowledge from a pre-trained image generative model and semantic information from input images. The former provides strong knowledge guidance for this under-constraint task, while the latter provides the necessary information to mitigate the training data bias from the pre-trained model. Our framework can be used for various tasks such as correspondence matching, pose estimation, and image editing, achieving strong results on real-world image datasets under challenging illumination conditions and on in-the-wild online image collections.

3D 凝聚：野外的 3D 感知影像對齊

3D Congealing: 3D-Aware Image Alignment in the Wild

摘要

Support