CityLens:面向城市社会经济感知的大规模语言-视觉模型基准测试
CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing
May 31, 2025
作者: Tianhui Liu, Jie Feng, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Yong Li
cs.AI
摘要
通过视觉数据理解城市社会经济状况,对于可持续城市发展和政策规划而言,是一项既具挑战性又至关重要的任务。本研究引入了CityLens,一个旨在评估大规模语言视觉模型(LLVMs)从卫星和街景图像预测社会经济指标能力的综合基准。我们构建了一个多模态数据集,覆盖全球17个城市,涵盖经济、教育、犯罪、交通、健康和环境六大关键领域,全面反映了城市生活的多维度特征。基于此数据集,我们定义了11项预测任务,并采用三种评估范式:直接指标预测、标准化指标估计和基于特征的回归。我们对17种前沿的LLVMs进行了基准测试。结果表明,尽管LLVMs展现出良好的感知与推理能力,但在预测城市社会经济指标方面仍存在局限。CityLens为诊断这些局限提供了一个统一框架,并指导未来利用LLVMs理解和预测城市社会经济模式的研究方向。我们的代码和数据集已通过https://github.com/tsinghua-fib-lab/CityLens开源。
English
Understanding urban socioeconomic conditions through visual data is a
challenging yet essential task for sustainable urban development and policy
planning. In this work, we introduce CityLens, a comprehensive
benchmark designed to evaluate the capabilities of large language-vision models
(LLVMs) in predicting socioeconomic indicators from satellite and street view
imagery. We construct a multi-modal dataset covering a total of 17 globally
distributed cities, spanning 6 key domains: economy, education, crime,
transport, health, and environment, reflecting the multifaceted nature of urban
life. Based on this dataset, we define 11 prediction tasks and utilize three
evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation,
and Feature-Based Regression. We benchmark 17 state-of-the-art LLVMs across
these tasks. Our results reveal that while LLVMs demonstrate promising
perceptual and reasoning capabilities, they still exhibit limitations in
predicting urban socioeconomic indicators. CityLens provides a unified
framework for diagnosing these limitations and guiding future efforts in using
LLVMs to understand and predict urban socioeconomic patterns. Our codes and
datasets are open-sourced via https://github.com/tsinghua-fib-lab/CityLens.Summary
AI-Generated Summary