Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

large-scale multi-modal dataset urban-space scene-classification cross-modal-retrieval instance-segmentation

Summary

Urban-ImageNet is a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, supporting scene classification, cross-modal retrieval, and instance segmentation tasks across 61 urban sites in 24 Chinese cities.

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

Original Article

View Cached Full Text

Cached at: 05/13/26, 08:14 PM

Paper page - Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Source: https://huggingface.co/papers/2605.09936

Abstract

Urban-ImageNet presents a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, organized under a hierarchical taxonomy for scene classification, cross-modal retrieval, and instance segmentation tasks.

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, aHierarchical Urban Space Image Classificationframework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1)urban scene semantic classification, (T2)cross-modal image-text retrieval, and (T3)instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes asbalanced training dataincreases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

View arXiv page View PDF Project page GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2605\.09936

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09936 in a model README.md to link it from this page.

Datasets citing this paper1

#### Yiwei-Ou/Urban-ImageNet Viewer• Updatedabout 3 hours ago • 3.67M • 204

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09936 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Paper page - Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

WildCity: A Real-World City-Scale Testbed for Rendering, Simulation, and Spatial Intelligence

Benchmarking Composed Image Retrieval for Applied Earth Observation

Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

@yanawei_: Thanks AK for featuring our work! More details: http://weiyana.github.io/PerceptionRubrics…

Submit Feedback

Similar Articles

WildCity: A Real-World City-Scale Testbed for Rendering, Simulation, and Spatial Intelligence

Benchmarking Composed Image Retrieval for Applied Earth Observation

Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

@yanawei_: Thanks AK for featuring our work! More details: http://weiyana.github.io/PerceptionRubrics…