ChatPaper.aiChatPaper

城市社会经济状况视觉推理:基于强化学习的视觉语言模型研究

CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

October 25, 2025
作者: Tianhui Liu, Hetian Pang, Xin Zhang, Jie Feng, Yong Li, Pan Hui
cs.AI

摘要

利用街景和卫星影像等公开的大规模网络数据,城市社会经济感知对实现全球可持续发展目标具有至关重要的意义。随着大视觉语言模型(LVLMs)的出现,通过将此类任务视为多模态感知与理解问题,为解决该任务带来了新的机遇。然而近期研究表明,LVLMs在基于视觉数据实现精准可解释的社会经济预测方面仍存在困难。为突破这些局限并最大化LVLMs的潜力,我们提出CityRiSE创新框架——通过纯强化学习(RL)实现大视觉语言模型的城市社会经济状况推理。借助精心构建的多模态数据和可验证的奖励设计,我们的方法引导LVLM聚焦于具有语义意义的视觉线索,从而实现面向通用社会经济状况预测的结构化目标导向推理。实验表明,具有涌现推理能力的CityRiSE显著优于现有基线模型,在预测精度和跨城市泛化能力(特别是对未见城市和未测指标的预测)方面均实现提升。这项工作彰显了强化学习与大视觉语言模型相结合在可解释通用化城市社会经济感知领域的应用前景。
English
Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce CityRiSE, a novel framework for Reasoning urban Socio-Economic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
PDF22December 2, 2025