OpenCity3D: 비전-언어 모델은 도시 환경에 대해 무엇을 알고 있는가?

초록

비전-언어 모델(VLMs)은 3D 장면 이해에 있어 큰 잠재력을 보여주지만, 주로 실내 공간이나 자율 주행에 적용되며 세분화와 같은 저수준 작업에 초점을 맞추고 있습니다. 본 연구는 다중 시점 항공 이미지에서 얻은 3D 재구성을 활용하여 이러한 모델의 활용 범위를 도시 규모 환경으로 확장합니다. 우리는 OpenCity3D라는 접근 방식을 제안하며, 이는 인구 밀도 추정, 건물 연령 분류, 부동산 가격 예측, 범죄율 평가, 소음 오염 평가와 같은 고수준 작업을 다룹니다. 우리의 연구 결과는 OpenCity3D의 인상적인 제로샷 및 퓨샷 능력을 강조하며, 새로운 맥락에 대한 적응력을 보여줍니다. 이 연구는 언어 기반 도시 분석을 위한 새로운 패러다임을 정립하여 계획, 정책, 환경 모니터링 분야에서의 응용을 가능하게 합니다. 프로젝트 페이지를 참조하세요: opencity3d.github.io

English

Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving, focusing on low-level tasks like segmentation. This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery. We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation. Our findings highlight OpenCity3D's impressive zero-shot and few-shot capabilities, showcasing adaptability to new contexts. This research establishes a new paradigm for language-driven urban analytics, enabling applications in planning, policy, and environmental monitoring. See our project page: opencity3d.github.io