OpenCity3D: ビジョン・ランゲージモデルは都市環境について何を知っているのか？

要旨

ビジョン・ランゲージモデル（VLM）は3Dシーン理解において大きな可能性を示していますが、主に屋内空間や自動運転に適用され、セグメンテーションなどの低レベルタスクに焦点が当てられています。本研究では、マルチビュー航空画像からの3D再構成を活用することで、その適用範囲を都市規模の環境に拡張します。我々はOpenCity3Dというアプローチを提案し、人口密度推定、建物の築年数分類、不動産価格予測、犯罪率評価、騒音汚染評価などの高レベルタスクに対応します。我々の研究結果は、OpenCity3Dの印象的なゼロショットおよび少数ショットの能力を強調し、新しい文脈への適応性を示しています。この研究は、言語駆動型の都市分析の新たなパラダイムを確立し、計画、政策、環境モニタリングにおける応用を可能にします。プロジェクトページはこちら：opencity3d.github.io

English

Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving, focusing on low-level tasks like segmentation. This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery. We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation. Our findings highlight OpenCity3D's impressive zero-shot and few-shot capabilities, showcasing adaptability to new contexts. This research establishes a new paradigm for language-driven urban analytics, enabling applications in planning, policy, and environmental monitoring. See our project page: opencity3d.github.io

OpenCity3D: ビジョン・ランゲージモデルは都市環境について何を知っているのか？

OpenCity3D: What do Vision-Language Models know about Urban Environments?

要旨

Support