UrbanLLaVA：具備空間推理與理解能力的城市智慧多模態大語言模型

摘要

城市研究涉及多種情境與任務，這些任務需要對多模態數據進行深入理解。現有方法往往專注於特定數據類型，缺乏一個統一框架來全面處理這些數據。近期，多模態大語言模型（MLLMs）的成功為克服這一限制提供了新的契機。本文介紹了UrbanLLaVA，這是一種多模態大語言模型，旨在同時處理四類數據，並在與通用MLLMs相比的多樣化城市任務中展現出卓越性能。在UrbanLLaVA中，我們首先構建了一個涵蓋單模態與跨模態城市數據的多樣化城市指令數據集，從局部視角到城市環境的全局視角均有涉及。此外，我們提出了一種多階段訓練框架，將空間推理能力的提升與領域知識的學習分離，從而提高了UrbanLLaVA在多樣化城市任務中的兼容性與下游性能。最後，我們還擴展了現有的城市研究基準，以評估MLLMs在廣泛城市任務中的表現。來自三個城市的實驗結果表明，UrbanLLaVA在單模態任務和複雜的跨模態任務中均優於開源及專有的MLLMs，並展現出跨城市的強大泛化能力。源代碼與數據已通過https://github.com/tsinghua-fib-lab/UrbanLLaVA向研究社區公開。

English

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In UrbanLLaVA, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of UrbanLLaVA across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that UrbanLLaVA outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

UrbanLLaVA：具備空間推理與理解能力的城市智慧多模態大語言模型

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

摘要

Support