Urban-ImageNet：面向城市空间感知的大规模多模态数据集与评估框架

摘要

我们提出Urban-ImageNet，这是一个基于用户生成社交媒体图像的城市空间感知大规模多模态数据集与评估基准。该语料库包含2019至2025年间从微博收集的来自24个中国城市61个城市地点的超过200万张公开社交媒体图像及配对的文本帖子，并设有1K、10K和100K规模的受控基准子集，以及完整200万规模的训练与评估语料。Urban-ImageNet依据HUSIC（分层城市空间图像分类框架）组织，该框架定义了基于城市理论的10类分类体系，旨在区分活化与非活化公共空间、城市内外环境、住宿空间、消费内容、人物肖像以及非空间性社交媒体内容。Urban-ImageNet不将城市图像视为通用场景数据，而是评估机器感知模型能否捕捉城市研究核心的空间、社会与功能差异。该基准在单一标准化库内支持三项任务：(T1) 城市场景语义分类、(T2) 跨模态图文检索、(T3) 实例分割。我们的实验评估了代表性视觉、视觉-语言及分割模型，结果显示在监督式场景分类上性能强劲，但在跨模态检索与城市物体实例分割层面更具挑战性。一项多尺度研究进一步探究了模型性能如何随平衡训练数据从1K、10K增至100K张图像而变化。Urban-ImageNet为评估AI系统如何跨模态、尺度与任务形式感知与解读当代城市空间，提供了一个统一、有理论根基且覆盖多城市的基准。数据集与基准详见：huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet 和 github.com/yiasun/dataset-2。

English

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

Urban-ImageNet：面向城市空间感知的大规模多模态数据集与评估框架

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

摘要

Support