Urban-ImageNet:大規模多模態數據集與城市空間感知評估框架
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
May 11, 2026
作者: Yiwei Ou, Chung Ching Cheung, Jun Yang Ang, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini
cs.AI
摘要
我們提出Urban-ImageNet,這是一個大規模多模態資料集與評估基準,用於從使用者生成的社群媒體影像中感知城市空間。該語料庫包含超過200萬張公開社群媒體影像及其配對的文字貼文,這些資料來自2019至2025年間中國24個城市中61個城市站點的微博。基準測試子集分為1K、10K和100K三種受控規模,並提供完整的200萬級語料庫,以供大規模訓練與評估。Urban-ImageNet依循HUSIC(分層城市空間影像分類框架)進行組織,該框架定義了一套基於城市理論的10類分類法。此分類法旨在區分活化與非活化的公共空間、外部與內部城市環境、住宿空間、消費內容、人物肖像,以及非空間性的社群媒體內容。Urban-ImageNet不將城市影像視為一般場景資料,而是評估機器感知模型能否捕捉城市研究中至關重要的空間、社會與功能區別。該基準在一個標準化函式庫中支援三項任務:(T1)城市場景語義分類、(T2)跨模態影像-文字檢索,以及(T3)實例分割。我們的實驗評估了代表性的視覺、視覺語言與分割模型,結果顯示在有監督場景分類上表現強勁,但在跨模態檢索與實例級別的城市物體分割上則更具挑戰性。一項多尺度研究進一步探討了模型性能如何隨著平衡訓練資料從1K、10K增加到100K張影像而變化。Urban-ImageNet提供了一個統一、基於理論且多城市的基準,用以評估AI系統如何跨模態、跨尺度與跨任務形式感知與解讀當代城市空間。資料集與基準可於以下網址取得:huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet 與 github.com/yiasun/dataset-2。
English
We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.