UrbanLLaVA:具備空間推理與理解能力的城市智慧多模態大語言模型
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
June 29, 2025
作者: Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li
cs.AI
摘要
城市研究涉及多種情境與任務,這些任務需要對多模態數據進行深入理解。現有方法往往專注於特定數據類型,缺乏一個統一框架來全面處理這些數據。近期,多模態大語言模型(MLLMs)的成功為克服這一限制提供了新的契機。本文介紹了UrbanLLaVA,這是一種多模態大語言模型,旨在同時處理四類數據,並在與通用MLLMs相比的多樣化城市任務中展現出卓越性能。在UrbanLLaVA中,我們首先構建了一個涵蓋單模態與跨模態城市數據的多樣化城市指令數據集,從局部視角到城市環境的全局視角均有涉及。此外,我們提出了一種多階段訓練框架,將空間推理能力的提升與領域知識的學習分離,從而提高了UrbanLLaVA在多樣化城市任務中的兼容性與下游性能。最後,我們還擴展了現有的城市研究基準,以評估MLLMs在廣泛城市任務中的表現。來自三個城市的實驗結果表明,UrbanLLaVA在單模態任務和複雜的跨模態任務中均優於開源及專有的MLLMs,並展現出跨城市的強大泛化能力。源代碼與數據已通過https://github.com/tsinghua-fib-lab/UrbanLLaVA向研究社區公開。
English
Urban research involves a wide range of scenarios and tasks that require the
understanding of multi-modal data. Current methods often focus on specific data
types and lack a unified framework in urban field for processing them
comprehensively. The recent success of multi-modal large language models
(MLLMs) presents a promising opportunity to overcome this limitation. In this
paper, we introduce UrbanLLaVA, a multi-modal large language model
designed to process these four types of data simultaneously and achieve strong
performance across diverse urban tasks compared with general MLLMs. In
UrbanLLaVA, we first curate a diverse urban instruction dataset
encompassing both single-modal and cross-modal urban data, spanning from
location view to global view of urban environment. Additionally, we propose a
multi-stage training framework that decouples spatial reasoning enhancement
from domain knowledge learning, thereby improving the compatibility and
downstream performance of UrbanLLaVA across diverse urban tasks.
Finally, we also extend existing benchmark for urban research to assess the
performance of MLLMs across a wide range of urban tasks. Experimental results
from three cities demonstrate that UrbanLLaVA outperforms
open-source and proprietary MLLMs in both single-modal tasks and complex
cross-modal tasks and shows robust generalization abilities across cities.
Source codes and data are openly accessible to the research community via
https://github.com/tsinghua-fib-lab/UrbanLLaVA.