SceneAligner: 실제 환경에서의 3D 기반 평면도 위치 추정

초록

많은 공공 건물에서는 방문객의 방향 인식을 돕기 위해 '현재 위치' 표시가 있는 평면도를 제공한다. 평면도 위치 추정(floorplan localization)은 시각적 관측 데이터가 평면도 내 어디에서 촬영되었는지를 결정함으로써 이러한 기능을 계산적으로 재현하고자 한다. 그러나 기존 방법들은 일반적으로 통제된 소규모 환경과 정밀한 벡터화된 평면도를 가정하므로, 대규모 건물 및 래스터화된 평면도에서의 적용 능력이 제한된다. 본 연구에서는 장면의 재구성된 3차원 표현에 작업을 기반하여 실제 환경(in the wild)에서 평면도 위치 추정을 수행하는 접근법을 제시한다. 제약 없는 이미지 컬렉션이 주어지면, 우리 방법은 중력 정렬된 3차원 장면을 재구성하고 이를 평면도 대리(proxy) 역할을 하는 2차원 밀도 맵으로 투영한다. 그런 다음 평면도 위치 추정은 2차원 유사 변환을 통해 이 대리 맵과 입력 평면도를 정렬하는 것으로 정식화된다. 밀도 맵과 건축 평면도 간의 외관 차이를 극복하기 위해, 2차원 기초 모델을 활용하여 교차 모달 대응 관계를 학습하도록 적응시키고, 구조적 일관성을 유지하면서 의미적으로 정렬된 대응을 장려하는 미세 조정 기법을 도입한다. 광범위한 실험을 통해 단일 입력 이미지만으로도 극도로 희소한 설정을 포함하여 기존 방법들 대비 상당한 성능 향상을 입증한다. 우리의 코드와 데이터는 공개될 예정이다.

English

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.