Urban-ImageNet:大規模マルチモーダルデータセットと都市空間認識のための評価フレームワーク
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
May 11, 2026
著者: Yiwei Ou, Chung Ching Cheung, Jun Yang Ang, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini
cs.AI
要旨
我们提出了Urban-ImageNet——一个用于从用户生成的社交媒体图像中感知城市空间的大规模多模态数据集与评估基准。该语料库包含2019至2025年间从微博收集的超过200万张公开社交媒体图像及其配对的文本帖子,覆盖中国24个城市的61个城市区域,并设有1K、10K和100K规模的受控基准子集,以及完整的200万数据集用于大规模训练与评估。Urban-ImageNet基于HUSIC(层次化城市空间图像分类框架)进行组织,该框架定义了根植于城市理论的10类分类法。该分类法旨在区分激活与非激活公共空间、外部与内部城市环境、住宿空间、消费内容、肖像以及非空间性社交媒体内容。Urban-ImageNet并非将城市图像视为通用场景数据,而是评估机器感知模型能否捕捉城市研究所关注的空间、社会与功能差异。该基准在一个标准化库中支持三个任务:(T1)城市场景语义分类;(T2)跨模态图文检索;(T3)实例分割。我们的实验评估了具有代表性的视觉模型、视觉-语言模型和分割模型,结果显示在有监督场景分类上表现强劲,而在跨模态检索和实例级城市物体分割上则更具挑战性。一项多尺度研究进一步考察了当均衡训练数据从1K、10K增加到100K图像时模型性能的变化。Urban-ImageNet提供了一个统一的、基于理论的多城市基准,用于评估AI系统如何在多模态、多尺度以及多任务设定下感知和诠释当代城市空间。数据集与基准获取地址:huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet 和 github.com/yiasun/dataset-2。
English
We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.