UrbanLLaVA: 空間推論と理解を備えた都市インテリジェンスのためのマルチモーダル大規模言語モデル

要旨

都市研究は、多様なシナリオとタスクを含み、多モーダルデータの理解を必要とします。現在の手法は特定のデータタイプに焦点を当てることが多く、都市分野における包括的な処理のための統一的なフレームワークが不足しています。最近のマルチモーダル大規模言語モデル（MLLMs）の成功は、この制限を克服する有望な機会を提供しています。本論文では、これら4種類のデータを同時に処理し、一般的なMLLMsと比較して多様な都市タスクで優れた性能を発揮するように設計されたマルチモーダル大規模言語モデル、UrbanLLaVAを紹介します。UrbanLLaVAでは、まず、単一モーダルおよびクロスモーダルの都市データを含む多様な都市指示データセットをキュレーションし、都市環境のロケーション視点からグローバル視点までを網羅します。さらに、空間推論の強化とドメイン知識の学習を分離する多段階トレーニングフレームワークを提案し、UrbanLLaVAの互換性と下流タスクでの性能を向上させます。最後に、既存の都市研究のベンチマークを拡張し、MLLMsの多様な都市タスクにおける性能を評価します。3つの都市での実験結果は、UrbanLLaVAがオープンソースおよびプロプライエタリのMLLMsを単一モーダルタスクと複雑なクロスモーダルタスクの両方で上回り、都市間での堅牢な汎化能力を示すことを実証しています。ソースコードとデータは、https://github.com/tsinghua-fib-lab/UrbanLLaVA を通じて研究コミュニティに公開されています。

English

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In UrbanLLaVA, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of UrbanLLaVA across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that UrbanLLaVA outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

UrbanLLaVA: 空間推論と理解を備えた都市インテリジェンスのためのマルチモーダル大規模言語モデル

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

要旨

Support