SceneAligner：基于三维的真实场景平面图定位

摘要

许多公共建筑会提供带有“您在此处”标识的楼层平面图，帮助访客确定方向。平面图定位旨在通过计算方式复制这一能力，即确定视觉观察在平面图中的拍摄位置。然而，现有方法通常假设环境受控且规模较小，并依赖精确的矢量化平面图，这限制了其在大型建筑和栅格化平面图中的适用性。在这项工作中，我们提出了一种在现实场景中执行平面图定位的方法，通过将该任务基于场景重建的3D表示来实现。给定一个无约束的图像集合，我们的方法首先重建一个重力对齐的3D场景，并将其投影为二维密度图，作为平面图的替代表示。随后，平面图定位被转化为通过二维相似变换将该替代表示与输入平面图对齐的问题。为弥合密度图与建筑平面图之间的外观差异，我们采用一个2D基础模型来学习跨模态对应关系，并引入了一种微调策略，该策略在保持结构一致性的同时促进语义对齐的匹配。大量实验表明，我们的方法相比先前方法有显著提升，即使在极端稀疏的设置中（例如仅有单张输入图像）也是如此。我们的代码和数据将公开提供。

English

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.