UrbanLLaVA：具备空间推理与理解能力的多模态大语言模型，赋能城市智能

摘要

城市研究涉及广泛场景与任务，这些任务要求对多模态数据进行深入理解。现有方法往往局限于特定数据类型，缺乏一个统一框架来全面处理城市领域的数据。近期，多模态大语言模型（MLLMs）的成功为解决这一局限提供了契机。本文中，我们介绍了UrbanLLaVA，这是一款专为同时处理四类数据而设计的多模态大语言模型，相较于通用MLLMs，在多样化的城市任务中展现出卓越性能。在UrbanLLaVA中，我们首先构建了一个涵盖单模态与跨模态城市数据的多样化城市指令数据集，范围从局部视角延伸至城市环境的全局视角。此外，我们提出了一种多阶段训练框架，将空间推理增强与领域知识学习解耦，从而提升了UrbanLLaVA在各类城市任务中的兼容性与下游表现。最后，我们还扩展了现有的城市研究基准，以评估MLLMs在广泛城市任务中的表现。来自三个城市的实验结果表明，UrbanLLaVA在单模态任务及复杂的跨模态任务上均优于开源与专有MLLMs，并展现出跨城市的强大泛化能力。源代码与数据已通过https://github.com/tsinghua-fib-lab/UrbanLLaVA向研究社区公开。

English

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In UrbanLLaVA, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of UrbanLLaVA across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that UrbanLLaVA outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

UrbanLLaVA：具备空间推理与理解能力的多模态大语言模型，赋能城市智能

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

摘要

Support