SceneAligner：在真實環境中基於3D的平面圖定位

摘要

许多公共建築物會提供帶有「您在此處」標示的平面圖，以協助訪客辨識方位。平面圖定位旨在透過計算方式重現此功能，判斷視覺觀測資料在平面圖中的擷取位置。然而，現有方法通常假設受控制的小型環境與精確的向量化平面圖，限制了其在大型建築物及柵格化平面圖中的應用能力。本研究提出一種實際場景下的平面圖定位方法，將該任務建立在重建的三維場景表示上。給定無限制的影像集合，我們的方法會重建重力對齊的三維場景，並將其投影為二維密度圖，作為平面圖的替代表示。接著，平面圖定位被形式化為透過二維相似性轉換，將此替代表示與輸入的平面圖進行對齊。為填補密度圖與建築平面圖之間的外觀差距，我們調整二維基礎模型以學習跨模態對應，並引入一種微調機制，在保持結構一致性的同時促進語義對齊的匹配。大量實驗證明，我們的方法相較於既有方法有顯著改進，即便在極稀疏的設定下（僅使用單一輸入影像）也表現優異。我們的程式碼與資料將公開提供。

English

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.