CityRiSE:基于强化学习的视觉语言模型城市社会经济状况推理
CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
October 25, 2025
作者: Tianhui Liu, Hetian Pang, Xin Zhang, Jie Feng, Yong Li, Pan Hui
cs.AI
摘要
利用街景和卫星影像等公开的大规模网络数据,城市社会经济感知对实现全球可持续发展目标具有至关重要的意义。随着大视觉语言模型(LVLM)的出现,通过将此类任务视为多模态感知与理解问题,为解决该任务创造了新的机遇。然而近期研究表明,LVLM在处理视觉数据的社会经济预测时,仍存在准确性和可解释性方面的局限。为突破这些限制并充分释放LVLM的潜力,我们提出CityRiSE创新框架——通过纯强化学习(RL)实现大视觉语言模型中的城市社会经济状况推理。借助精心构建的多模态数据和可验证的奖励机制设计,我们的方法能引导LVLM聚焦具有语义意义的视觉线索,实现面向通用型社会经济状况预测的结构化目标推理。实验表明,具备涌现推理能力的CityRiSE框架显著优于现有基线模型,在预测精度和跨城市泛化能力(特别是对未见过城市和未见过指标的预测)方面均有提升。这项研究揭示了强化学习与大视觉语言模型相结合在可解释通用型城市社会经济感知领域的广阔前景。
English
Harnessing publicly available, large-scale web data, such as street view and
satellite imagery, urban socio-economic sensing is of paramount importance for
achieving global sustainable development goals. With the emergence of Large
Vision-Language Models (LVLMs), new opportunities have arisen to solve this
task by treating it as a multi-modal perception and understanding problem.
However, recent studies reveal that LVLMs still struggle with accurate and
interpretable socio-economic predictions from visual data. To address these
limitations and maximize the potential of LVLMs, we introduce
CityRiSE, a novel framework for Reasoning urban
Socio-Economic status in LVLMs through pure reinforcement
learning (RL). With carefully curated multi-modal data and verifiable reward
design, our approach guides the LVLM to focus on semantically meaningful visual
cues, enabling structured and goal-oriented reasoning for generalist
socio-economic status prediction. Experiments demonstrate that CityRiSE with
emergent reasoning process significantly outperforms existing baselines,
improving both prediction accuracy and generalization across diverse urban
contexts, particularly for prediction on unseen cities and unseen indicators.
This work highlights the promise of combining RL and LVLMs for interpretable
and generalist urban socio-economic sensing.