Urban-ImageNet: 도시 공간 인식을 위한 대규모 다중 모드 데이터셋 및 평가 프레임워크

초록

본 논문에서는 사용자 생성 소셜 미디어 이미지로부터 도시 공간 인식을 위한 대규모 다중 모드 데이터셋 및 평가 벤치마크인 Urban-ImageNet을 제시한다. 이 코퍼스는 2019년부터 2025년까지 중국 24개 도시의 61개 도시 사이트에서 Weibo를 통해 수집된 200만 개 이상의 공개 소셜 미디어 이미지와 짝을 이루는 텍스트 게시물을 포함하며, 1K, 10K, 100K 규모의 통제된 벤치마크 하위 집합과 대규모 학습 및 평가를 위한 전체 200만 개 코퍼스로 구성된다. Urban-ImageNet은 도시 이론에 기반한 10개 클래스 분류 체계를 정의하는 계층적 도시 공간 이미지 분류 프레임워크인 HUSIC에 따라 구성된다. 이 분류 체계는 활성화된 공공 공간과 비활성화된 공공 공간, 외부 및 내부 도시 환경, 숙박 공간, 소비 콘텐츠, 인물 사진, 비공간적 소셜 미디어 콘텐츠를 구별하도록 설계되었다. Urban-ImageNet은 도시 이미지를 일반적인 장면 데이터로 취급하지 않고, 기계 지각 모델이 도시 연구의 핵심인 공간적, 사회적, 기능적 구분을 포착할 수 있는지 평가한다. 이 벤치마크는 하나의 표준화된 라이브러리 내에서 세 가지 작업, 즉 (T1) 도시 장면 의미 분류, (T2) 교차 모달 이미지-텍스트 검색, (T3) 인스턴스 분할을 지원한다. 실험을 통해 대표적인 비전, 비전-언어 및 분할 모델을 평가한 결과, 지도 학습 기반 장면 분류에서는 강력한 성능을 보였으나 교차 모달 검색 및 인스턴스 수준의 도시 객체 분할에서는 더 까다로운 양상을 확인했다. 또한 다중 규모 연구를 통해 균형 잡힌 학습 데이터가 1K, 10K에서 100K 이미지로 증가함에 따라 모델 성능이 어떻게 변화하는지 분석한다. Urban-ImageNet은 AI 시스템이 다양한 모달리티, 규모 및 작업 형식에 걸쳐 현대 도시 공간을 어떻게 지각하고 해석하는지 평가하기 위한 통합적이고 이론에 기반한 다중 도시 벤치마크를 제공한다. 데이터셋과 벤치마크는 huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet 및 github.com/yiasun/dataset-2에서 확인할 수 있다.

English

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

Urban-ImageNet: 도시 공간 인식을 위한 대규모 다중 모드 데이터셋 및 평가 프레임워크

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

초록

Support