Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Summary
Urban-ImageNet is a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, supporting scene classification, cross-modal retrieval, and instance segmentation tasks across 61 urban sites in 24 Chinese cities.
View Cached Full Text
Cached at: 05/13/26, 08:14 PM
Paper page - Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Source: https://huggingface.co/papers/2605.09936
Abstract
Urban-ImageNet presents a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, organized under a hierarchical taxonomy for scene classification, cross-modal retrieval, and instance segmentation tasks.
We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, aHierarchical Urban Space Image Classificationframework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1)urban scene semantic classification, (T2)cross-modal image-text retrieval, and (T3)instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes asbalanced training dataincreases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.09936
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09936 in a model README.md to link it from this page.
Datasets citing this paper1
#### Yiwei-Ou/Urban-ImageNet Viewer• Updatedabout 3 hours ago • 3.67M • 204
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09936 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Benchmarking Composed Image Retrieval for Applied Earth Observation
This paper presents a unified benchmark for composed image retrieval in Earth observation, evaluating vision-language backbones and introducing a change-centric dataset (xView2-CIR) for disaster monitoring, highlighting distinct challenges compared to attribute-based retrieval.
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
This paper introduces PixVerve-95K, a large-scale open-source dataset of 95K ultra-high-resolution (100MP) images with annotations, and PixVerve-Bench, a benchmark for evaluating native 100MP text-to-image generation, extending existing T2I models to unprecedented resolutions.
@drfeifei: I’m very excited by this new benchmark dataset for visual generation that is suitable for the modern era of large scale…
Introducing GPIC (Giant Permissive Image Corpus), a large-scale dataset of 100M VLM-captioned image-text pairs for training and 1M pairs for benchmarking, fully permissive for research and commercial use.
Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models
This paper investigates using large vision-language models for built environment reasoning tasks, such as design suggestions and risk identification, leveraging remote sensing imagery. It evaluates models like InternVL and Qwen, highlighting their potential for supporting smart city decision-making and quantitative reasoning.
CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage
This paper proposes COVER, a training-free method for converting 3D assets into sparse panoramic RGB-D-pose data with complete scene coverage and low redundancy, and introduces the CM-EVS dataset containing 36,373 curated frames from indoor and outdoor scenes.