城市之鏡:評測大型語言視覺模型於城市社會經濟感知之應用
CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing
May 31, 2025
作者: Tianhui Liu, Jie Feng, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Yong Li
cs.AI
摘要
透過視覺數據理解城市社會經濟狀況,對於可持續城市發展與政策規劃而言,是一項既具挑戰性又至關重要的任務。本研究介紹了CityLens,這是一個全面的基準測試,旨在評估大型語言視覺模型(LLVMs)從衛星和街景圖像中預測社會經濟指標的能力。我們構建了一個多模態數據集,涵蓋了全球分佈的17個城市,跨越經濟、教育、犯罪、交通、健康與環境六大關鍵領域,反映了城市生活的多面性。基於此數據集,我們定義了11項預測任務,並採用三種評估範式:直接指標預測、標準化指標估計及基於特徵的回歸分析。我們對17種最先進的LLVMs在這些任務上進行了基準測試。結果顯示,儘管LLVMs展現出有前景的感知與推理能力,但在預測城市社會經濟指標方面仍存在侷限。CityLens提供了一個統一框架,用於診斷這些侷限並指導未來利用LLVMs理解和預測城市社會經濟模式的研究方向。我們的代碼與數據集已通過https://github.com/tsinghua-fib-lab/CityLens開源。
English
Understanding urban socioeconomic conditions through visual data is a
challenging yet essential task for sustainable urban development and policy
planning. In this work, we introduce CityLens, a comprehensive
benchmark designed to evaluate the capabilities of large language-vision models
(LLVMs) in predicting socioeconomic indicators from satellite and street view
imagery. We construct a multi-modal dataset covering a total of 17 globally
distributed cities, spanning 6 key domains: economy, education, crime,
transport, health, and environment, reflecting the multifaceted nature of urban
life. Based on this dataset, we define 11 prediction tasks and utilize three
evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation,
and Feature-Based Regression. We benchmark 17 state-of-the-art LLVMs across
these tasks. Our results reveal that while LLVMs demonstrate promising
perceptual and reasoning capabilities, they still exhibit limitations in
predicting urban socioeconomic indicators. CityLens provides a unified
framework for diagnosing these limitations and guiding future efforts in using
LLVMs to understand and predict urban socioeconomic patterns. Our codes and
datasets are open-sourced via https://github.com/tsinghua-fib-lab/CityLens.