Large language models perceive cities through a culturally uneven baseline
Summary
Empirical study showing frontier LLMs encode a culturally skewed baseline that privileges Western viewpoints when describing and judging global streetscapes, with non-Western prompts systematically deviating more from the default.
View Cached Full Text
Cached at: 04/23/26, 10:03 AM
# Large language models perceive cities through a culturally uneven baseline
Source: [https://arxiv.org/html/2604.20048](https://arxiv.org/html/2604.20048)
Rong Zhao1,2,†,∗, Wanqi Liu1,†, Zhizhou Sha3, Nanxi Su2, Yecheng Zhang2,∗ 1University College London, London, UK 2Tsinghua University, Beijing, China 3The University of Texas at Austin, Austin, TX, USA †Rong Zhao and Wanqi Liu contributed equally to this work\. ∗Correspondence to: rong\.zhao\.25@ucl\.ac\.uk; zhangyec23@mails\.tsinghua\.edu\.cn
###### Abstract
Large language models \(LLMs\) are increasingly used to describe, evaluate and interpret places, yet it remains unclear whether they do so from a culturally neutral standpoint\. Here we test urban perception in frontier LLMs using a balanced global street\-view sample and prompts that either remain neutral or invoke different regional cultural standpoints\. Across open\-ended descriptions and structured place judgments, the neutral condition proved not to be neutral in practice\. Prompts associated with Europe and Northern America remained systematically closer to the baseline than many non\-Western prompts, indicating that model perception is organized around a culturally uneven reference frame rather than a universal one\. Cultural prompting also shifted affective evaluation, producing sentiment\-based ingroup preference for some prompted identities\. Comparisons with regional human text\-image benchmarks showed that culturally proximate prompting could improve alignment with human descriptions, but it did not recover human levels of semantic diversity and often preserved an affectively elevated style\. The same asymmetry reappeared in structured judgments of safety, beauty, wealth, liveliness, boredom and depression, where model outputs were interpretable but only partly reproduced human group differences\. These findings suggest that LLMs do not simply perceive cities from nowhere: they do so through a culturally uneven baseline that shapes what appears ordinary, familiar and positively valued\.
## Introduction
Large language models \(LLMs\) are increasingly used to make judgments that resemble human social perception[6](https://arxiv.org/html/2604.20048#bib.bib1);[5](https://arxiv.org/html/2604.20048#bib.bib3)\. Recent work shows that these models do not simply mirror a universal human viewpoint\. They reproduce stereotypes, compress group heterogeneity, and inherit culturally uneven representational baselines from their training data and alignment procedures[1](https://arxiv.org/html/2604.20048#bib.bib2);[22](https://arxiv.org/html/2604.20048#bib.bib7);[20](https://arxiv.org/html/2604.20048#bib.bib4);[39](https://arxiv.org/html/2604.20048#bib.bib50);[41](https://arxiv.org/html/2604.20048#bib.bib5);[29](https://arxiv.org/html/2604.20048#bib.bib6);[2](https://arxiv.org/html/2604.20048#bib.bib8);[13](https://arxiv.org/html/2604.20048#bib.bib11);[12](https://arxiv.org/html/2604.20048#bib.bib12)\. In that sense, large models are not only prediction systems but also cultural and social technologies whose outputs carry historically uneven reference frames into downstream use[9](https://arxiv.org/html/2604.20048#bib.bib9);[35](https://arxiv.org/html/2604.20048#bib.bib47)\. A parallel machine\-psychology literature further shows that LLMs can approximate a range of human judgments, beliefs and experimental regularities, but often with systematic distortions in calibration, social reasoning and perceived knowledge[11](https://arxiv.org/html/2604.20048#bib.bib49);[38](https://arxiv.org/html/2604.20048#bib.bib53);[37](https://arxiv.org/html/2604.20048#bib.bib51);[19](https://arxiv.org/html/2604.20048#bib.bib52)\. Those concerns become sharper when the task is to interpret place, because judgments about streets, neighborhoods and cities are deeply cultural\. They are shaped not only by visible form, but also by expectations about order, beauty, safety, familiarity and belonging[32](https://arxiv.org/html/2604.20048#bib.bib20);[25](https://arxiv.org/html/2604.20048#bib.bib21);[26](https://arxiv.org/html/2604.20048#bib.bib22);[18](https://arxiv.org/html/2604.20048#bib.bib23);[30](https://arxiv.org/html/2604.20048#bib.bib24)\.
This question matters beyond urban studies\. Cities concentrate social difference, symbolic meaning and everyday behavioral cues, making them a demanding test case for culturally situated urban perception by models\. A model that sounds fluent when describing an urban scene may still rely on an uneven reference frame, and prompt\-conditioned perspective taking should not be mistaken for genuine cultural standpoint\-taking[36](https://arxiv.org/html/2604.20048#bib.bib54)\. Recent work on “the city as text” argues that urban text corpora contain analytically important information about urban activities and organizations, and that newer computational tools have expanded how such material can be studied[31](https://arxiv.org/html/2604.20048#bib.bib44)\. Recent urban perception research has shown that large\-scale image\-based judgments can be measured from street\-view images and crowd annotations[34](https://arxiv.org/html/2604.20048#bib.bib39);[7](https://arxiv.org/html/2604.20048#bib.bib34);[42](https://arxiv.org/html/2604.20048#bib.bib40);[16](https://arxiv.org/html/2604.20048#bib.bib35)\. But the same literature also shows that perceptual categories and their visual correlates do not travel cleanly across cultural settings[30](https://arxiv.org/html/2604.20048#bib.bib24);[33](https://arxiv.org/html/2604.20048#bib.bib41)\. If that is true for human\-labeled perception models, it matters even more for LLMs that now produce free\-form place descriptions and evaluations, often while oversimplifying urban complexity or requiring explicit post\-hoc calibration to align with human preference judgments[17](https://arxiv.org/html/2604.20048#bib.bib19);[21](https://arxiv.org/html/2604.20048#bib.bib13);[43](https://arxiv.org/html/2604.20048#bib.bib14);[24](https://arxiv.org/html/2604.20048#bib.bib16);[4](https://arxiv.org/html/2604.20048#bib.bib17);[44](https://arxiv.org/html/2604.20048#bib.bib15)\.
Existing studies of cultural bias in LLMs and studies of urban perception with generative models leave a gap between them\. The first group is usually text\-based and detached from concrete spatial stimuli[20](https://arxiv.org/html/2604.20048#bib.bib4);[41](https://arxiv.org/html/2604.20048#bib.bib5);[29](https://arxiv.org/html/2604.20048#bib.bib6);[2](https://arxiv.org/html/2604.20048#bib.bib8);[12](https://arxiv.org/html/2604.20048#bib.bib12)\. The second group usually focuses on predictive accuracy, benchmark construction, calibration, or planning utility rather than on the cultural organization of the perceptual baseline itself[21](https://arxiv.org/html/2604.20048#bib.bib13);[24](https://arxiv.org/html/2604.20048#bib.bib16);[10](https://arxiv.org/html/2604.20048#bib.bib42);[44](https://arxiv.org/html/2604.20048#bib.bib15)\. Recent work has also argued that bias probing in LLMs needs stronger grounding in social\-science ideas of comparison, context and generalization[23](https://arxiv.org/html/2604.20048#bib.bib55)\. As a result, we still know little about how LLMs perceive cities in practice: what they treat as neutral, how cultural context changes their descriptions, and how closely those shifts map onto human judgments\.
Here we study how LLMs perceive cities, treating urban perception as a problem of cultural cognition rather than only urban image scoring\. We combine a rebuilt global street\-view pipeline with two complementary task formats\. Study 1 examines*open perception*, in which models produce short free\-text descriptions of the same street scene under a neutral prompt and seven meso\-regional cultural prompts\. This design reveals semantic distance to neutrality, clustering in perceptual space and prompt\-conditioned ingroup preference\. We then compare the global findings with regional human text\-image pairs\. Study 2 examines*structured perception*, in which the same global street\-view image set is rated on six explicit dimensions: safe, lively, wealthy, beautiful, boring and depressing\. We then compare those ratings with external human perception benchmarks, including a Place Pulse grounding analysis and an external qscore\-based pairwise replication\. Across both studies, cultural bias appears as a reproducible feature of how contemporary LLMs perceive cities\.
## Results
Our analyses combine a rebuilt global image pipeline with two task formats\. We first assembled a larger merged street\-view corpus with explicit provider provenance and audit metadata, and then drew from it a scene\-balanced global analysis set of 3,000 images for the main figures\. Selection was stratified across coarse visual scene types, place type, country and provider so that the analysis set did not collapse onto a narrow subset of recurring streetscape forms\. Each selected image was then evaluated under eight prompt conditions, one neutral baseline and seven meso\-regional cultural contexts, by three LLMs, yielding 72,000 open\-text descriptions and 72,000 structured six\-dimension evaluations\. These data allow us to test whether cultural prompting changes LLM\-based urban perception, whether the neutral baseline is itself culturally uneven, and whether those effects remain visible against human benchmarks\.
### Study 1 \| Open\-ended perception reveals an uneven neutral baseline
We begin with open\-ended scene descriptions, which allow the models to decide which visual and social cues matter\. Across all three models, the neutral prompt did not behave as a culturally invariant semantic baseline \(Fig\.[1](https://arxiv.org/html/2604.20048#Sx2.F1)a,b\)\. Europe and Northern America \(ENA\) was consistently the closest prompted identity to neutrality, with a pooled mean cosine distance of 0\.137 and a mean rank of 1\.0 across models, whereas the most displaced conditions were Latin America and the Caribbean \(0\.189\) and Oceania \(0\.192\)\. The same ordering held across models, even though the overall spread varied: ENA ranged from 0\.103 to 0\.184, whereas the most distant prompt condition in each model ranged from 0\.138 to 0\.231\. The bootstrap gap distributions in Fig\.[1](https://arxiv.org/html/2604.20048#Sx2.F1)b further show that the semantic distance between ENA and the other prompted identities is predominantly positive, indicating that the neutral baseline lies systematically closer to some cultural standpoints than to others\.
This asymmetry was also evident in the geometry of semantic space\. In the local PCA projection, identity\-conditioned responses occupied distinct positions around the neutral prompt rather than clustering randomly \(Fig\.[1](https://arxiv.org/html/2604.20048#Sx2.F1)c\)\. Cultural prompting changed both the size and the direction of movement away from neutrality, so that the same street scene was reframed along structured identity\-specific trajectories rather than through unsystematic lexical variation\. Together, Fig\.[1](https://arxiv.org/html/2604.20048#Sx2.F1)a\-c show that open\-ended urban perception is organized around a culturally uneven semantic reference frame rather than a universal one\.
This semantic asymmetry was not confined to the seven\-way meso\-regional grouping used in the main analysis\. On an independent 100\-image robustness subset, Europe and Northern America remained the nearest condition to neutrality under coarser Macro5 prompts in all three models, while under finer Micro20 prompting the same tendency persisted most clearly for Northern America and Western or Northern Europe, albeit with greater heterogeneity across subregions \(Supplementary Figs\. S8 and S9\)\.
We next asked whether these semantic differences were accompanied by differences in affective evaluation\. To do so, we computed a sentiment\-based ingroup preference index \(IPI\), which compares how positively an identity\-conditioned prompt evaluates scenes from its own region relative to how the same scenes are evaluated by other prompted identities\. The 20\-region maps in Fig\.[1](https://arxiv.org/html/2604.20048#Sx2.F1)d show that IPI is unevenly distributed across space rather than uniformly present\. The region\-level estimates in Fig\.[1](https://arxiv.org/html/2604.20048#Sx2.F1)e make the pattern clearer\. Across models, the largest positive IPIs ranged from 0\.146 to 0\.276 and were concentrated in regions such as Northern Africa and Western Asia, Central and Southern Asia, and parts of Sub\-Saharan Africa\. The overall pattern was mixed rather than universal, with some region\-model combinations weakly positive and others negative, but Claude Sonnet 4 showed the strongest and most widespread self\-favouring tendency\. Fig\.[1](https://arxiv.org/html/2604.20048#Sx2.F1)shows two related but distinct forms of cultural bias in open\-ended urban perception: a semantic bias in what the models treat as closest to neutral, and a sentiment\-based bias in how positively some prompted identities evaluate scenes associated with their own region\.
Changing the wording of the meso\-regional prompt did not remove the semantic pattern\. When we replaced the role\-playing formulation with a weaker context\-based prompt on the same robustness subset, the semantic ordering changed little, whereas affect\-based IPI proved more sensitive to prompt wording \(Supplementary Fig\. S10\)\.
Figure 1:Open\-ended city perception shows a culturally uneven neutral baseline and region\-linked ingroup preference\.a, Mean cosine distance between each meso\-regional prompt and the neutral prompt across the shared global image set for GPT\-5\.2, Claude Sonnet 4 and Gemini 2\.5 Flash\. b, Bootstrap distributions of the neutral\-distance gap between Europe and Northern America \(ENA\) and the other regional prompts\. Positive values indicate that ENA remains closer to the neutral baseline\. c, Model\-specific local centroid projections of the prompt conditions in semantic embedding space\. d, World maps of the ingroup preference index \(IPI\) derived from sentiment scores for each model\. e, Region\-level IPI estimates with 95% bootstrap confidence intervals for each model\.We next asked whether culturally proximate prompting brought model outputs closer to human place descriptions in a benchmark built from Geograph Britain and Ireland human text\-image pairs, drawn from a volunteer geographic photo platform\. On the 1,000\-image comparison set, semantic distance to the matched human text was lowest under prompts that were closer to the UK context, especially the UK prompt itself \(Fig\.[2](https://arxiv.org/html/2604.20048#Sx2.F2)a\)\. Across models, the UK prompt reduced mean semantic distance from the neutral condition by 0\.005 to 0\.016 cosine\-distance units, with UK\-prompt distances tightly clustered between 0\.518 and 0\.525\. The improvement was real, but modest\.
That improvement did not restore human diversity\. Human descriptions were substantially more dispersed in semantic space than model outputs, with a mean distance to centroid of 0\.734, compared with only 0\.391–0\.418 for neutral model outputs \(Fig\.[2](https://arxiv.org/html/2604.20048#Sx2.F2)b\)\. The same compression appeared lexically: DISTINCT\-2 was 0\.686 for the human texts but only 0\.337–0\.392 for the models \(Fig\.[2](https://arxiv.org/html/2604.20048#Sx2.F2)c\)\. Models were also much more positive overall\. Human Geograph texts had a mean sentiment score of 0\.387, whereas the corresponding model outputs ranged from 0\.823 to 0\.975 \(Fig\.[2](https://arxiv.org/html/2604.20048#Sx2.F2)d\)\. The spatial sentiment\-gap map in Fig\.[2](https://arxiv.org/html/2604.20048#Sx2.F2)e shows that this difference is distributed across the benchmark geography rather than driven by a small number of outliers\. The models therefore describe places more uniformly and evaluate them more positively\.
The same tendency reappears in the nation\-level IPI comparison based on the 270\-image Geograph subset for England, Scotland and Wales\. Human texts produced negative IPIs across all three nations \(\-0\.400 for England, \-0\.256 for Scotland and \-0\.062 for Wales\), whereas the model estimates were systematically more elevated, and in several cases positive \(Fig\.[2](https://arxiv.org/html/2604.20048#Sx2.F2)f\)\. The corresponding LLM minus human gaps were positive throughout, ranging from 0\.220 to 0\.410 across nations and models \(Fig\.[2](https://arxiv.org/html/2604.20048#Sx2.F2)g\)\. The upward displacement was visible in all three models and was most pronounced for Claude Sonnet 4\. Together, Fig\.[2](https://arxiv.org/html/2604.20048#Sx2.F2)shows that human alignment improves under culturally proximate prompting, but the models still produce place descriptions that are less diverse, more affectively positive and more strongly shifted in identity\-linked evaluation than the human benchmark\.
Figure 2:UK validation based on Geograph human text\-image pairs shows improved contextual alignment but persistent model compression\.a, Semantic distance between model outputs and the matched human text on the 1,000\-image Geograph benchmark under neutral, UK, regional\-neighbor and global meso\-regional prompts\. b, Within\-group distance to centroid for human texts and model outputs, showing lower dispersion in model\-generated language\. c, DISTINCT\-2 diversity scores for the human texts and each model\. d, Sentiment scores derived from the Geograph benchmark texts and the corresponding model outputs\. e, Spatial distribution of the neutral sentiment gap between model outputs and human texts across the UK benchmark set\. f, Nation\-level IPI estimates for England, Scotland and Wales on the 270\-image nation\-labeled Geograph subset\. g, LLM minus human IPI gaps for the same nation\-level comparison\.
### Study 2 \| Structured judgments reproduce the same cultural pattern
We next asked whether the same cultural structure remained visible when urban perception was measured through a standardized machine\-learning framework grounded in Place Pulse\-style judgments[34](https://arxiv.org/html/2604.20048#bib.bib39);[7](https://arxiv.org/html/2604.20048#bib.bib34)\. Applying this perception model to the global street\-view sample, we estimated how identity\-conditioned framing shifted inferred place evaluations relative to the neutral baseline across six dimensions: wealth, safety, beauty, depression, liveliness and boredom \(Fig\.[3](https://arxiv.org/html/2604.20048#Sx2.F3)a\)\. The resulting shifts were substantial and highly structured rather than random\. Across models, the largest positive changes ranged from \+0\.388 to \+0\.582 s\.d\. and were concentrated in perceived wealth and safety under non\-Western regional prompts, especially in Sub\-Saharan Africa; some negative shifts were also substantial, including a \-0\.401 s\.d\. change in boredom\. Identity framing can therefore induce large, dimension\-specific re\-evaluations of the same urban scenes\.
The regional distribution of these shifts showed a familiar asymmetry\. When the absolute magnitude of identity\-conditioned change was aggregated across dimensions, Europe and Northern America \(ENA\) consistently showed the smallest overall displacement from neutral in all three models \(Fig\.[3](https://arxiv.org/html/2604.20048#Sx2.F3)b\), ranging from 0\.194 to 0\.285 s\.d\. By contrast, Sub\-Saharan Africa produced the largest shifts in every model, ranging from 0\.354 to 0\.506 s\.d\. Other non\-Western identities also generated larger departures from neutrality, including Central and Southern Asia \(0\.289–0\.335 s\.d\.\) and Latin America and the Caribbean \(0\.297–0\.330 s\.d\.\)\. Thus, the asymmetry seen in the semantic analysis reappears in the structured perception pipeline: the neutral reference frame remains more closely aligned with Europe and Northern America than with most other prompted identities\.
The strength of the effect also depended on which aspect of urban perception was being measured\. Across models, perceived wealth was the most identity\-sensitive dimension, with mean absolute shifts of 0\.141–0\.215 s\.d\., followed by safety at 0\.134–0\.193 s\.d\. \(Fig\.[3](https://arxiv.org/html/2604.20048#Sx2.F3)c\)\. Beauty was comparatively more stable at 0\.066–0\.106 s\.d\., whereas boredom ranged more widely from 0\.044 to 0\.155 s\.d\. This dimension\-specific pattern suggests that cultural bias in LLM\-based place perception is not uniform: some judgments, especially wealth and safety, are more readily reframed by identity cues than others\. Together, Fig\.[3](https://arxiv.org/html/2604.20048#Sx2.F3)shows that the pattern seen in open\-text perception also appears in a structured, externally grounded measurement framework: cultural identity changes both how cities are described and how they are scored\.
Figure 3:Structured six\-dimension judgments recover the same culturally patterned shifts seen in open perception\.a, Standardized shifts from the neutral condition for each regional prompt across six perception dimensions and three models\. b, Mean absolute standardized shift from neutral by region, aggregated across dimensions for each model\. c, Mean absolute standardized shift from neutral by dimension, aggregated across regional prompts for each model\.Finally, we compared the LLM\-based structured judgments with an existing machine\-learning perception model trained on human urban\-perception data, and then asked whether the models could reproduce human differences across social categories\. Across all six dimensions, LLM and ML scores were positively related, but the strength of agreement varied by perceptual attribute \(Fig\.[4](https://arxiv.org/html/2604.20048#Sx2.F4)a\)\. The closest alignment was observed for beauty, with Spearman correlations of 0\.593–0\.637, followed by wealth at 0\.503–0\.541\. Agreement was weaker for safety and depression, and weakest for boredom at 0\.283–0\.325\. Thus, the two systems did not produce identical scores, but they were not unrelated either: LLM\-based judgments retained a measurable correspondence with the ML perception model, especially on visually salient evaluative dimensions such as beauty and wealth\.
We then asked a more demanding question: whether the models could reproduce human differences across social categories rather than simply match aggregate perception scores\. Using pairwise comparisons derived from human subgroup contrasts, we tested gender, age and country contrasts and measured whether model orderings diverged from, or strictly replicated, the human ordering \(Fig\.[4](https://arxiv.org/html/2604.20048#Sx2.F4)b\)\. Performance differed sharply by category\. Gender and age contrasts were comparatively easier for the models to approximate, with divergence rates ranging from 0\.022 to 0\.111, whereas country contrasts were substantially harder, with divergence rates rising to 0\.230–0\.400\. Strict replication was limited overall: the highest value was 0\.074, while country\-based comparisons fell to 0 or near 0 across models\.
This selectivity becomes clearer when the subgroup comparisons are stratified by the size of the human difference \(Fig\.[4](https://arxiv.org/html/2604.20048#Sx2.F4)c\)\. For gender and age, larger human margins sometimes improved replication, yielding modest gains in strict replication in the high\-margin condition\. By contrast, country\-based contrasts remained difficult even when the human difference was large: divergence stayed high in the high\-margin bin at 0\.276–0\.467, with strict replication effectively absent\. Fig\.[4](https://arxiv.org/html/2604.20048#Sx2.F4)shows two limits of LLM\-based urban perception\. First, model scores only partially align with an existing ML perception model across the six dimensions\. Second, the models do not reliably reproduce the structure of human group differences, especially when those differences are organized by country rather than by gender or age\.
Figure 4:External grounding and pairwise replication show that structured scores are interpretable but only partially human\-like\.a, Dimension\-wise relationships between neutral model scores and Place Pulse baseline scores for wealthy, safe, beautiful, depressing, lively and boring judgments across the three models\. b, Overall divergence rates and strict replication rates in the external qscore\-based pairwise replication for gender, age and country contrasts\. c, Divergence and strict replication rates stratified by human qscore margin bins, showing how replication changes as the human subgroup difference becomes larger\.
## Discussion
Across open\-ended and structured tasks, across three models, and across both global and human\-benchmark comparisons, one result recurs: LLMs perceive cities through a culturally uneven baseline\. The central point is that the neutral condition is itself uneven\. Some regional standpoints, especially Europe and Northern America, remain systematically closer to the neutral baseline than others\. Neutrality is not an empty point of departure; it is culturally located\.
This changes how bias should be understood in urban LLM applications\. Much current work treats bias as a distortion layered on top of an otherwise stable baseline[1](https://arxiv.org/html/2604.20048#bib.bib2);[22](https://arxiv.org/html/2604.20048#bib.bib7);[20](https://arxiv.org/html/2604.20048#bib.bib4);[41](https://arxiv.org/html/2604.20048#bib.bib5);[12](https://arxiv.org/html/2604.20048#bib.bib12)\. Our results instead suggest that the baseline itself may already privilege some urban reference frames over others\. This is consistent with the view that large AI models are cultural and social technologies rather than neutral inference engines[9](https://arxiv.org/html/2604.20048#bib.bib9)\. The same asymmetry appears in two registers\. In semantic space, prompted identities move by different distances and along different directions\. In affective space, some prompted identities evaluate scenes from their own region more positively than other identities do\. The pattern is not generic prompt sensitivity; it is a culturally patterned perceptual baseline\.
The human benchmark clarifies the limit of this apparent contextual sensitivity\. Culturally proximate prompting improves alignment with human place descriptions, but the models remain substantially more compressed than the human texts\. Human descriptions are more heterogeneous, less templated and less affectively elevated\. This is consistent with recent evidence that LLMs flatten the internal diversity of the groups they attempt to represent[41](https://arxiv.org/html/2604.20048#bib.bib5), and with recent urban evaluations showing that generative models can reproduce broad urban patterns while oversimplifying complexity[43](https://arxiv.org/html/2604.20048#bib.bib14)\. In practical terms, prompting a model to sound more local does not make it equivalent to lived local perception\.
Study 2 shows that this conclusion does not depend on free\-text generation\. When the task is reduced to explicit scores on safety, beauty, wealth, liveliness, boredom and depression, regional prompt effects remain visible\. The comparison with an existing machine\-learning perception model shows that the structured scores are interpretable and retain contact with established human perception axes\. That is consistent with recent efforts to post\-hoc calibrate urban VLM judgments to human preference benchmarks[44](https://arxiv.org/html/2604.20048#bib.bib15)\. But interpretability is not the same as human equivalence\. The external pairwise replication shows that the models only partially recover human group differences, with performance varying sharply by subgroup axis and by the strength of the human signal\. Their behavior is systematic, but incomplete\.
These findings matter for urban analytics, planning support and AI\-assisted design[10](https://arxiv.org/html/2604.20048#bib.bib42);[21](https://arxiv.org/html/2604.20048#bib.bib13);[4](https://arxiv.org/html/2604.20048#bib.bib17);[14](https://arxiv.org/html/2604.20048#bib.bib18)\. Generative AI is now being discussed explicitly within urban analytics and design practice, which makes the question of culturally uneven place interpretation more than a narrow benchmark issue[3](https://arxiv.org/html/2604.20048#bib.bib38)\. If models are deployed as apparently general\-purpose interpreters of place, culturally partial outputs may be mistaken for neutral descriptions\. That risk is greatest in cross\-cultural applications, where the distance between deployment context and latent reference frame is largest\. The larger concern is patterned misreading that privileges some urban imaginaries over others while presenting that preference as neutral\.
Several limitations remain\. We study three contemporary models, but the model landscape will continue to change\. We focus on street\-view imagery rather than the full multisensory and social experience of urban life\. Prompted regional identities are stylized proxies rather than real social actors\. The human benchmarks, although stronger than model\-only evaluation, also remain geographically uneven\. Even with those limits, the central conclusion is robust\. LLMs do not perceive cities from a view from nowhere\. They interpret them through culturally organized priors, and those priors remain visible across task formats, benchmarks and models\.
## Methods
### Global image corpus and analysis sample selection
We assembled a merged street\-view corpus from multiple providers while retaining explicit source provenance, image dates and copyright metadata\. Candidate urban locations were identified through spatial anchor layers and related discovery procedures, but these locations were used only for discovery rather than as final analytical units\. For Google Street View imagery, panorama availability was first checked through the metadata service of the Street View Static API and images were then retrieved through the corresponding image endpoint\. Equivalent provenance information was retained for all sources, including Baidu Panorama where required, so that image origin remained auditable throughout the workflow\. The 3,000\-scene analytical sample used in the main figures was drawn directly from this larger merged corpus rather than from the anchor layer itself\.
From the merged corpus we selected a scene\-balanced global street\-view image set of 3,000 scenes\. Before sampling, images were embedded to derive coarse visual clusters, and we ran a soft indoor\-screening audit to reduce obvious non\-street scenes\. Sampling quotas were then set across visual cluster, place type, country and provider to prevent the main analysis from being dominated by a small number of recurrent scene types or a narrow geographic subset\. It also followed an 8:2 split between locations classified by the GHS framework as urban centre or town settings, preserving both geographic and visual diversity[28](https://arxiv.org/html/2604.20048#bib.bib60);[8](https://arxiv.org/html/2604.20048#bib.bib62)\. Regional coverage was tracked with a UN\-based geographic hierarchy implemented on Natural Earth Admin\-0 country polygons[27](https://arxiv.org/html/2604.20048#bib.bib58);[40](https://arxiv.org/html/2604.20048#bib.bib57)\. In the country table used for sampling, these three levels correspond toMacro/UN Region,Micro/UN SubregionandMeso/UN SDG\. The prompt families used in the analyses follow that hierarchy with two explicit harmonizations for consistency with the global experiment: the Micro20 robustness family merges Melanesia, Micronesia and Polynesia intoPacific islands, and the main Meso7 family mergesAustralia and New ZealandwithOceania\. The spatial distribution and region\-level counts of the final analysis set are shown in Supplementary Fig\. S1, and the mapping used in the manuscript is summarized in Supplementary Table S4\. A separate 500\-image subset was reserved for robustness checks and was not used in the reported figures\.
### Open\-perception prompting and three\-model generation
For Study 1, each image in the global street\-view image set was evaluated under eight prompt conditions: one neutral condition and seven meso\-regional cultural contexts \(Europe and Northern America, Central and Southern Asia, Northern Africa and Western Asia, Eastern and South\-eastern Asia, Sub\-Saharan Africa, Latin America and the Caribbean, and Oceania\)\. These meso\-regional identities follow the UN SDG regional grouping used in SDG reporting, except that Australia and New Zealand and Oceania were merged into a single Oceania prompt for the main experiment \(Supplementary Table S4\)\. Prompts asked the model to produce a short English paragraph grounded in visible scene evidence\. The three\-model comparison used GPT\-5\.2, Claude Sonnet 4 and Gemini 2\.5 Flash\. Because all 3,000 images were evaluated under all eight conditions for all three models, the open\-perception analysis comprises 72,000 model\-generated texts\.
### Semantic embedding, neutral\-distance, and ingroup\-preference analysis
Open\-text outputs were embedded with the fixed sentence encoderall\-mpnet\-base\-v2\. Lettimct\_\{imc\}denote the text generated for imageiiby modelmmunder prompt conditioncc, and leteimc=fembed\(timc\)e\_\{imc\}=f\_\{\\text\{embed\}\}\(t\_\{imc\}\)be its embedding\. For every non\-neutral condition, we measured semantic displacement from the matched neutral response for the same image:
dimc=1−cos\(eim,NEU,eimc\)d\_\{imc\}=1\-\\cos\\\!\\left\(e\_\{im,\\mathrm\{NEU\}\},e\_\{imc\}\\right\)where larger values indicate greater prompt\-induced semantic movement\. Region\-level neutral distance was then summarized as
Dmc=1N∑i=1Ndimc\.D\_\{mc\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}d\_\{imc\}\.To visualize the geometry of prompt conditions, we computed condition centroids in embedding space and projected them with local low\-dimensional ordination\. We also quantified within\-condition dispersion relative to the centroid,
δmc=1N∑i=1N\(1−cos\(eimc,e¯mc\)\),\\delta\_\{mc\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\(1\-\\cos\\\!\\left\(e\_\{imc\},\\bar\{e\}\_\{mc\}\\right\)\\right\),wheree¯mc\\bar\{e\}\_\{mc\}is the centroid embedding for modelmmand conditioncc\.
Sentiment was estimated from the same texts with SiEBERT \(sentiment\-roberta\-large\-english\)\. For each text, the classifier returns positive and negative posterior probabilities, denotedpimc\(\+\)p^\{\(\+\)\}\_\{imc\}andpimc\(−\)p^\{\(\-\)\}\_\{imc\}\. We define the sentiment score as
simc=pimc\(\+\)−pimc\(−\),s\_\{imc\}=p^\{\(\+\)\}\_\{imc\}\-p^\{\(\-\)\}\_\{imc\},so that higher values indicate more positive place evaluation\. Region\-level ingroup preference indices \(IPIs\) were then computed by comparing the sentiment assigned by a region\-matched prompt to scenes from that region against the mean sentiment assigned by non\-matching prompts to the same scenes:
IPImr=μmrself−μmrotherσmr,\\mathrm\{IPI\}\_\{mr\}=\\frac\{\\mu^\{\\mathrm\{self\}\}\_\{mr\}\-\\mu^\{\\mathrm\{other\}\}\_\{mr\}\}\{\\sigma\_\{mr\}\},whereμmrself\\mu^\{\\mathrm\{self\}\}\_\{mr\}is the mean sentiment for images from regionrrunder the matching regional prompt,μmrother\\mu^\{\\mathrm\{other\}\}\_\{mr\}is the corresponding mean under non\-matching prompts, andσmr\\sigma\_\{mr\}is the within\-region standard deviation of sentiment scores\.
### Geograph\-based UK benchmark construction
All UK analyses were based on Geograph human text\-image pairs\. We constructed a visually grounded benchmark subset by retaining entries whose accompanying human text described the visible scene with sufficient specificity for comparison to model outputs, while also capping repeated contributions from the same author to reduce dominance by a small number of prolific contributors\. This yielded a 1,000\-image Geograph benchmark subset for open\-text comparison\. From the same source, we then derived a separate 270\-image subset with nation labels for England, Scotland and Wales, allowing nation\-level comparison without introducing a second human data source\.
### Open\-text comparison with Geograph human texts
On the 1,000\-image Geograph benchmark, we compared model outputs to the matched human description embedded in the same semantic space\. Ifhih\_\{i\}denotes the human text paired with imageii, with embeddingeihume^\{\\mathrm\{hum\}\}\_\{i\}, then the model\-to\-human semantic distance is
himc=1−cos\(eihum,eimc\)\.h\_\{imc\}=1\-\\cos\\\!\\left\(e^\{\\mathrm\{hum\}\}\_\{i\},e\_\{imc\}\\right\)\.For each model and condition, we summarize the benchmark by the mean human\-distanceh¯mc\\bar\{h\}\_\{mc\}and by the improvement relative to neutral,
Δmchum=h¯mc−h¯m,NEU,\\Delta^\{\\mathrm\{hum\}\}\_\{mc\}=\\bar\{h\}\_\{mc\}\-\\bar\{h\}\_\{m,\\mathrm\{NEU\}\},where negative values indicate better alignment with the human text than the neutral prompt\. The UK comparison combines a core set of culturally proximate prompts \(neutral, UK, Western Europe and Europe and Northern America\) with additional meso\-regional prompts used to place the UK benchmark back into the broader global prompt frame\. We report four complementary summaries: semantic distance from the matched human text, within\-group distance to centroid, DISTINCT\-2 lexical diversity, and sentiment score\. These measures test both contextual proximity to human language and the extent to which model outputs recover human heterogeneity\.
### Nation\-level IPI comparison on Geograph subsets
For the 270\-image nation\-labeled Geograph subset, we prompted the models under England, Scotland and Wales conditions and compared the resulting nation\-level IPIs to the human baseline derived from the Geograph texts\. The nation\-level calculation follows the same formula as above, with nations in place of meso\-regions\. The purpose of this comparison was not to re\-estimate the full UK benchmark, but to test whether the models preserved within\-UK differentiation rather than collapsing the country into a single perceptual identity\.
### Structured six\-dimension prompting
For Study 2, the same global street\-view image set was evaluated again under the same eight prompt conditions, but the response format was constrained to six explicit dimensions:safe,lively,wealthy,beautiful,boring, anddepressing\. Outputs were returned as structured scores on a common 0–100 scale\. Letyimcky\_\{imck\}denote the score for imageii, modelmm, conditioncc, and dimensionkk\. We first compute the within\-image shift from the neutral prompt,
Δyimck=yimck−yim,NEU,k,\\Delta y\_\{imck\}=y\_\{imck\}\-y\_\{im,\\mathrm\{NEU\},k\},and then summarize each region\-by\-dimension effect as a standardized mean shift,
zmck=1N∑i=1NΔyimcksdi\(yim,NEU,k\)\.z\_\{mck\}=\\frac\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Delta y\_\{imck\}\}\{\\mathrm\{sd\}\_\{i\}\\\!\\left\(y\_\{im,\\mathrm\{NEU\},k\}\\right\)\}\.These standardized shifts are the values plotted in Fig\.[3](https://arxiv.org/html/2604.20048#Sx2.F3)a\. We also aggregate\|zmck\|\\lvert z\_\{mck\}\\rvertby region and by perceptual dimension to quantify which prompts and which judgments are most sensitive to cultural reframing\.
### Place Pulse grounding analysis
To test whether the neutral structured scores remained connected to established urban perception benchmarks, we compared them against baseline scores produced by an open\-source human\-perception model trained on MIT Place Pulse 2\.0 and released with the Global Streetscapes project[34](https://arxiv.org/html/2604.20048#bib.bib39);[7](https://arxiv.org/html/2604.20048#bib.bib34);[15](https://arxiv.org/html/2604.20048#bib.bib10)\. We used the six dimension\-specific checkpoints for safety, lively, wealthy, beautiful, boring and depressing judgments\. The released model uses a ViT\-B/16 backbone with a task\-specific multilayer classification head, yielding one baseline scorepikp\_\{ik\}for imageiiand perceptual dimensionkk\. For each dimensionkkand modelmm, we estimated the association between the neutral LLM score and the corresponding Place Pulse baseline scorepikp\_\{ik\}through both a Spearman correlation
ρmk=corrS\(yim,NEU,k,pik\)\\rho\_\{mk\}=\\mathrm\{corr\}\_\{S\}\\\!\\left\(y\_\{im,\\mathrm\{NEU\},k\},p\_\{ik\}\\right\)and a fitted linear slope from
yim,NEU,k=αmk\+βmkpik\+εimk\.y\_\{im,\\mathrm\{NEU\},k\}=\\alpha\_\{mk\}\+\\beta\_\{mk\}p\_\{ik\}\+\\varepsilon\_\{imk\}\.We report bothρmk\\rho\_\{mk\}andβmk\\beta\_\{mk\}because the first captures rank\-order agreement and the second captures directional scaling\.
### External pairwise replication design
The final analysis module was designed as an external replication exercise based on human qscore summaries from an external street\-scene perception benchmark\. We constructed pairwise street\-scene comparisons for three subgroup axes, gender, age and country, with 90 pairs per axis and 15 pairs per perceptual dimension\. Each pair carried a human ordering derived from the corresponding qscore summary\. LetΔpgkhum\\Delta^\{\\mathrm\{hum\}\}\_\{pgk\}denote the human qscore difference for pairpp, subgroup axisgg, and dimensionkk, and letΔpgmkLLM\\Delta^\{\\mathrm\{LLM\}\}\_\{pgmk\}denote the corresponding score difference produced by modelmm\. We then coded two binary outcomes\. Divergence was coded as
Divpgmk=\{1,ifsign\(ΔpgmkLLM\)≠sign\(Δpgkhum\)0,otherwise,\\mathrm\{Div\}\_\{pgmk\}=\\begin\{cases\}1,&\\text\{if \}\\operatorname\{sign\}\\\!\\left\(\\Delta^\{\\mathrm\{LLM\}\}\_\{pgmk\}\\right\)\\neq\\operatorname\{sign\}\\\!\\left\(\\Delta^\{\\mathrm\{hum\}\}\_\{pgk\}\\right\)\\\\ 0,&\\text\{otherwise,\}\\end\{cases\}so that a value of 1 means that the model reversed the human ordering\. Strict replication was coded as
Reppgmk=\{1,ifsign\(ΔpgmkLLM\)=sign\(Δpgkhum\)andΔpgmkLLM≠00,otherwise,\\mathrm\{Rep\}\_\{pgmk\}=\\begin\{cases\}1,&\\text\{if \}\\operatorname\{sign\}\\\!\\left\(\\Delta^\{\\mathrm\{LLM\}\}\_\{pgmk\}\\right\)=\\operatorname\{sign\}\\\!\\left\(\\Delta^\{\\mathrm\{hum\}\}\_\{pgk\}\\right\)\\ \\text\{and\}\\ \\Delta^\{\\mathrm\{LLM\}\}\_\{pgmk\}\\neq 0\\\\ 0,&\\text\{otherwise,\}\\end\{cases\}so that a value of 1 means that the model reproduced the direction of the human subgroup difference\. We report both overall divergence and strict replication rates, and we further stratify these outcomes by the size of the human subgroup margin \(low, mid and high bins\) to test whether replication improves when the human signal is stronger\.
### Bootstrap and uncertainty estimation
All point estimates in the reported figures were paired with nonparametric 95% bootstrap confidence intervals, taken as the 2\.5th and 97\.5th percentiles of the bootstrap distribution\. For image\-based analyses, bootstrap resampling was performed at the image level while preserving the matched condition structure attached to each image\. For the external pairwise replication, resampling was performed at the pair level within each axis and dimension\. The same bootstrap framework was used for semantic distances, centroid dispersion, human\-alignment deltas, IPIs, structured\-score shifts, and external\-benchmark summaries\.
## Data availability
Derived source data for the main figures and curated supplementary analyses are available at[https://github\.com/jameslemon2002/aibias](https://github.com/jameslemon2002/aibias)\. Street\-view imagery and other provider\-restricted visual assets are not redistributed in the public repository\.
## Code availability
Public scripts used to regenerate the released supplementary figures, robustness figures and prompt templates are available at[https://github\.com/jameslemon2002/aibias](https://github.com/jameslemon2002/aibias)\. The repository excludes manuscript source files, API credentials, private request headers and local intermediate caches tied to licensed or non\-redistributable inputs\.
## References
- Large language models associate Muslims with violence\.Nature Machine Intelligence3,pp\. 461–463\.External Links:[Document](https://dx.doi.org/10.1038/s42256-021-00359-2)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p2.1.1.1)\.
- A\. F\. Ashery, L\. M\. Aiello, and A\. Baronchelli \(2025\)Emergent social conventions and collective bias in LLM populations\.Science Advances11\(20\),pp\. eadu9368\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.adu9368)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.1.1)\.
- M\. Batty \(2025\)Generative ai\.Environment and Planning B: Urban Analytics and City Science52\(5\),pp\. 1031–1034\.External Links:[Document](https://dx.doi.org/10.1177/23998083251332093)Cited by:[Discussion](https://arxiv.org/html/2604.20048#Sx3.p5.1.2.1)\.
- C\. Beneduce, B\. Lepri, and M\. Luca \(2025\)Urban safety perception through the lens of large multimodal models: a persona\-based approach\.arXiv preprint arXiv:2503\.00610\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.5.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p5.1.1.1)\.
- M\. Binzet al\.\(2025\)A foundation model to predict and capture human cognition\.Nature644,pp\. 1002–1009\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09215-4)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.1.1)\.
- Z\. Cui, N\. Li, and H\. Zhou \(2025\)A large\-scale replication of scenario\-based experiments in psychology and management using large language models\.Nature Computational Science\.External Links:[Document](https://dx.doi.org/10.1038/s43588-025-00840-7)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.1.1)\.
- A\. Dubey, N\. Naik, D\. Parikh, R\. Raskar, and C\. A\. Hidalgo \(2016\)Deep learning the city: quantifying urban perception at a global scale\.InComputer Vision – ECCV 2016,B\. Leibe, J\. Matas, N\. Sebe, and M\. Welling \(Eds\.\),Cham,pp\. 196–212\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.3.1),[Study 2 \| Structured judgments reproduce the same cultural pattern](https://arxiv.org/html/2604.20048#Sx2.SSx2.p1.1.1.1),[Place Pulse grounding analysis](https://arxiv.org/html/2604.20048#Sx4.SSx8.p1.6.1.1)\.
- European Commission, Joint Research Centre \(2023\)GHS\-SMOD R2023A: GHS settlement layers, application of the Degree of Urbanisation methodology \(stage I\) to GHS\-POP R2023A and GHS\-BUILT\-S R2023A, multitemporal \(1975\-2030\)\.Technical reportEuropean Commission, Joint Research Centre,Ispra\.Cited by:[Global image corpus and analysis sample selection](https://arxiv.org/html/2604.20048#Sx4.SSx1.p2.1.1.1)\.
- H\. Farrell, A\. Gopnik, C\. Shalizi, and J\. Evans \(2025\)Large AI models are cultural and social technologies\.Science387,pp\. 1153–1156\.External Links:[Document](https://dx.doi.org/10.1126/science.adt9819)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.3.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p2.1.2.1)\.
- X\. Fu, C\. Li, S\. J\. Quan, T\. Yigitcanlar, and D\. Wasserman \(2025\)Large language models in urban planning\.Nature Cities2,pp\. 585–592\.External Links:[Document](https://dx.doi.org/10.1038/s44284-025-00261-7)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.2.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p5.1.1.1)\.
- T\. Golan, M\. Siegelman, N\. Kriegeskorte, and C\. Baldassano \(2023\)Testing the limits of natural language models for predicting human language judgements\.Nature Machine Intelligence5,pp\. 952–964\.External Links:[Document](https://dx.doi.org/10.1038/s42256-023-00718-1)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.4.1)\.
- D\. Guilbeault, S\. Delecourt, and B\. S\. Desikan \(2025\)Age and gender distortion in online media and large language models\.Nature\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09581-z)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.1.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p2.1.1.1)\.
- V\. Hofmann, P\. R\. Kalluri, D\. Jurafsky, and S\. King \(2024\)AI generates covertly racist decisions about people based on their dialect\.Nature633,pp\. 147–154\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07856-5)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1)\.
- C\. Houet al\.\(2025\)Transferred bias uncovers the balance between the development of physical and socioeconomic environments of cities\.Annals of the American Association of Geographers115,pp\. 148–166\.External Links:[Document](https://dx.doi.org/10.1080/24694452.2024.2412173)Cited by:[Discussion](https://arxiv.org/html/2604.20048#Sx3.p5.1.1.1)\.
- Y\. Hou, M\. Quintana, M\. Khomiakov, W\. Yap, J\. Ouyang, K\. Ito, Z\. Wang, T\. Zhao, and F\. Biljecki \(2024\)Global streetscapes – a comprehensive dataset of 10 million street\-level images across 688 cities for urban science and analytics\.ISPRS Journal of Photogrammetry and Remote Sensing215,pp\. 216–238\.External Links:[Document](https://dx.doi.org/10.1016/j.isprsjprs.2024.06.023)Cited by:[Place Pulse grounding analysis](https://arxiv.org/html/2604.20048#Sx4.SSx8.p1.6.1.1)\.
- K\. Ito, Y\. Kang, Y\. Zhang, F\. Zhang, and F\. Biljecki \(2024\)Understanding urban perception with visual data: a systematic review\.Cities152,pp\. 105169\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.3.1)\.
- K\. M\. Janget al\.\(2024\)Place identity: a generative AI’s perspective\.Humanities and Social Sciences Communications11,pp\. 1156\.External Links:[Document](https://dx.doi.org/10.1057/s41599-024-03645-7)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.5.1)\.
- R\. Kaplan and E\. J\. Herbert \(1987\)Cultural and sub\-cultural comparisons in preferences for natural settings\.Landscape and Urban Planning14,pp\. 281–293\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.5.1)\.
- B\. Lee, R\. Aiyappa, Y\. Ahn, H\. Kwak, and J\. An \(2025\)A semantic embedding space based on large language models for modelling human beliefs\.Nature Human Behaviour9,pp\. 1928–1940\.External Links:[Document](https://dx.doi.org/10.1038/s41562-025-02228-z)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.4.1)\.
- J\. G\. Lu, L\. L\. Song, and L\. T\. Zhang \(2025\)Cultural tendencies in generative AI\.Nature Human Behaviour9,pp\. 2360–2369\.External Links:[Document](https://dx.doi.org/10.1038/s41562-025-02242-1)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.1.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p2.1.1.1)\.
- M\. Malekzadeh, E\. Willberg, J\. Torkko, and T\. Toivonen \(2025\)Urban attractiveness according to ChatGPT: contrasting AI and human insights\.Computers, Environment and Urban Systems117,pp\. 102243\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.5.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.2.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p5.1.1.1)\.
- R\. Manvi, S\. Khanna, M\. Burke, D\. Lobell, and S\. Ermon \(2024\)Large language models are geographically biased\.arXiv preprint arXiv:2402\.02680\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p2.1.1.1)\.
- K\. N\. Morehouse, S\. Swaroop, and W\. Pan \(2025\)Rethinking LLM bias probing using lessons from the social sciences\.arXiv preprint arXiv:2503\.00093\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2503.00093)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.3.1)\.
- R\. Mushkani \(2025\)Do vision\-language models see urban scenes as people do? an urban perception benchmark\.arXiv preprint arXiv:2509\.14574\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.5.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.2.1)\.
- J\. L\. Nasar \(1988\)Visual preferences in urban street scenes: a cross\-cultural comparison between Japan and the United States\.InEnvironmental Aesthetics,J\. L\. Nasar \(Ed\.\),pp\. 260–274\.External Links:[Document](https://dx.doi.org/10.1017/CBO9780511571213.025)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.5.1)\.
- J\. L\. Nasar \(1990\)The evaluative image of the city\.Journal of the American Planning Association56,pp\. 41–53\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.5.1)\.
- Natural Earth \(2026\)Admin 0 – countries\.Note:Natural Earth vector map data\. Accessed 20 April 2026External Links:[Link](https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-0-countries/)Cited by:[Global image corpus and analysis sample selection](https://arxiv.org/html/2604.20048#Sx4.SSx1.p2.1.2.1)\.
- OECD and European Commission and Food and Agriculture Organization of the United Nations and International Labour Organization and United Nations Human Settlements Programme and World Bank \(2021\)Applying the degree of urbanisation: a methodological manual to define cities, towns and rural areas for international comparisons\.OECD Publishing,Paris\.External Links:[Document](https://dx.doi.org/10.1787/b01f92f4-en)Cited by:[Global image corpus and analysis sample selection](https://arxiv.org/html/2604.20048#Sx4.SSx1.p2.1.1.1)\.
- R\. Qadri, A\. M\. Davani, K\. Robinson, and V\. Prabhakaran \(2025\)Risks of cultural erasure in large language models\.arXiv preprint arXiv:2501\.01056\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.1.1)\.
- M\. Quintana, Y\. Gu, X\. Liang, Y\. Hou, K\. Ito, Y\. Zhu, M\. Abdelrahman, and F\. Biljecki \(2025\)Global urban visual perception varies across demographics and personalities\.Nature Cities2,pp\. 1092–1106\.External Links:[Document](https://dx.doi.org/10.1038/s44284-025-00330-x)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.5.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.4.1)\.
- J\. Reades, Y\. Hu, E\. Tranos, and E\. Delmelle \(2025\)The city as text\.Nature Cities2,pp\. 794–800\.External Links:[Document](https://dx.doi.org/10.1038/s44284-025-00314-x)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.2.1)\.
- E\. C\. Relph \(1976\)Place and placelessness\.Pion\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.5.1)\.
- J\. Rui and C\. Cai \(2025\)Plausible or misleading? evaluating the adaption of the Place Pulse 2\.0 dataset for predicting subjective perception in Chinese urban landscapes\.Habitat International157,pp\. 103333\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.4.1)\.
- P\. Salesses, K\. Schechtner, and C\. A\. Hidalgo \(2013\)The collaborative image of the city: mapping the inequality of urban perception\.PLoS ONE8,pp\. e68400\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.3.1),[Study 2 \| Structured judgments reproduce the same cultural pattern](https://arxiv.org/html/2604.20048#Sx2.SSx2.p1.1.1.1),[Place Pulse grounding analysis](https://arxiv.org/html/2604.20048#Sx4.SSx8.p1.6.1.1)\.
- G\. Savcisens \(2025\)Large language models act as if they are part of a group\.Nature Computational Science5,pp\. 9–10\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.3.1)\.
- M\. Shanahan, K\. McDonell, and L\. Reynolds \(2023\)Role play with large language models\.Nature623,pp\. 493–498\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06647-8)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.1.1)\.
- M\. Steyvers, H\. Tejeda, A\. Kumar, C\. Belem, S\. Karny, X\. Hu, L\. W\. Mayer, and P\. Smyth \(2025\)What large language models know and what people think they know\.Nature Machine Intelligence7,pp\. 221–231\.External Links:[Document](https://dx.doi.org/10.1038/s42256-024-00976-7)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.4.1)\.
- J\. W\. A\. Strachan, D\. Albergo, G\. Borghini, O\. Pansardi, E\. Scaliti, S\. Gupta, K\. Saxena, A\. Rufo, S\. Panzeri, G\. Manzi, and M\. S\. A\. Graziano \(2024\)Testing theory of mind in large language models and humans\.Nature Human Behaviour8,pp\. 1285–1295\.External Links:[Document](https://dx.doi.org/10.1038/s41562-024-01882-z)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.4.1)\.
- Y\. Tao, O\. Viberg, R\. S\. Baker, and R\. F\. Kizilcec \(2024\)Cultural bias and cultural alignment of large language models\.PNAS Nexus3\(9\),pp\. pgae346\.External Links:[Document](https://dx.doi.org/10.1093/pnasnexus/pgae346)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1)\.
- United Nations Statistics Division \(2026\)Standard country or area codes for statistical use \(m49\)\.Note:Accessed 20 April 2026External Links:[Link](https://unstats.un.org/unsd/methodology/m49/)Cited by:[Global image corpus and analysis sample selection](https://arxiv.org/html/2604.20048#Sx4.SSx1.p2.1.2.1),[Supplementary note 1: design and prompts](https://arxiv.org/html/2604.20048#Sx7.SSx1.p2.1.4.1)\.
- A\. Wang, J\. Morgenstern, and J\. P\. Dickerson \(2025\)Large language models that replace human participants can harmfully misportray and flatten identity groups\.Nature Machine Intelligence7,pp\. 400–411\.External Links:[Document](https://dx.doi.org/10.1038/s42256-025-00986-z)Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p1.1.2.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.1.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p2.1.1.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p3.1.1.1)\.
- F\. Zhanget al\.\(2018\)Measuring human perceptions of a large\-scale urban region using machine learning\.Landscape and Urban Planning180,pp\. 148–160\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.3.1)\.
- Y\. Zhang, R\. Zhao, Z\. Huang, X\. Wang, Y\. Ma, and Y\. Long \(2025\)GenAI models capture urban science but oversimplify complexity\.arXiv preprint arXiv:2505\.13803\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.5.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p3.1.2.1)\.
- Y\. Zhang, R\. Zhao, Z\. Sha, Y\. Li, L\. Wang, C\. Hou, W\. Ji, H\. Huang, Y\. Wan, J\. Yu, J\. Xia, Y\. Zhang, and C\. Shi \(2026\)UrbanAlign: post\-hoc semantic calibration for VLM\-Human preference alignment\.arXiv preprint arXiv:2602\.19442\.Cited by:[Introduction](https://arxiv.org/html/2604.20048#Sx1.p2.1.5.1),[Introduction](https://arxiv.org/html/2604.20048#Sx1.p3.1.2.1),[Discussion](https://arxiv.org/html/2604.20048#Sx3.p4.1.1.1)\.
## Supplementary Information
### Supplementary note 1: design and prompts
The analyses use matched within\-image comparisons: the same street\-view scenes were evaluated under the neutral and identity\-conditioned prompts\. This design separates prompt\-conditioned perception from differences in which images are assigned to different cultural identities\. The final 3,000\-scene analysis sample spans all 20 fine\-grained regions used in the global corpus construction, and its spatial distribution is shown in Supplementary Fig\. S1\.
The region taxonomy used in sampling and robustness analyses is summarized in Supplementary Table S4\. The underlying hierarchy is UN\-based:Macro/UN Regionfollows the major M49 regions,Micro/UN Subregionfollows the M49 subregions, andMeso/UN SDGfollows the SDG reporting groups used by UNSD[40](https://arxiv.org/html/2604.20048#bib.bib57)\. In the prompt families used here, two small harmonizations were applied for consistency with the main experiment: Melanesia, Micronesia and Polynesia were merged intoPacific islandsin the Micro20 robustness family, andAustralia and New Zealandwas merged withOceaniain the main Meso7 family\.
Supplementary Table S1:Experimental design and sample counts\.The global analyses use the same balanced street\-view image set for open\-text and structured tasks\.Supplementary Table S2:Exact prompt templates used for the open\-text and structured perception analyses\.The variable\{region\}was replaced by the corresponding regional or national identity label\.Supplementary Table S3:Prompt conditions and generation settings\.The global analyses used the same eight conditions for open\-text and structured tasks\.Supplementary Table S4:UN\-based regional hierarchy used in sampling and robustness analyses\.The source hierarchy follows UN M49 regions and subregions plus the UN SDG reporting groups\. The prompt families used in the manuscript apply two explicit harmonizations:Pacific islandsmerges Melanesia, Micronesia and Polynesia, and the mainOceaniaprompt merges Australia and New Zealand with Oceania\.Supplementary Figure S1:Spatial distribution of the 3,000\-scene analysis sample\.a, Global distribution of the final analysis scenes selected from the merged street\-view corpus\. Points show the retained scene locations after balancing across visual cluster, place type, country and provider\. b, Sample counts by fine\-grained region in the final analysis set\. Counts remain close to the nominal target while varying where feasible coverage constraints limited the number of retained scenes\.
### Supplementary note 2: open\-perception diagnostics
Supplementary Figure S2:Sentiment summaries underlying the affective ingroup\-preference analysis\.a, Mean sentiment by prompt condition and model for stripped open\-text descriptions\. Error bars denote approximate 95% confidence intervals based on the standard error\. b, Mean sentiment shift relative to the neutral prompt for each regional prompt\. These summaries document the affective quantities used before computing the sentiment\-based IPI reported in Fig\. 1d,e\.Supplementary Figure S3:Semantic\-distance and affective\-preference diagnostics are related but not interchangeable\.a, Model\-specific rank order of each regional prompt by distance to the neutral open\-text baseline, with smaller ranks indicating greater proximity to neutral\. b, Displacement of each regional prompt centroid from the neutral centroid in the local PCA space used for Fig\. 1c\. c, Relationship between semantic distance to neutral and sentiment\-based IPI across models and regions\. These two summaries capture different aspects of cultural structure\.Supplementary Figure S4:Extended Geograph human\-benchmark diagnostics\.a, Reduction in human\-text semantic distance when the United Kingdom prompt is used instead of the neutral prompt on the 1,000\-image Geograph benchmark\. Positive values indicate improved alignment with human descriptions\. b, LLM\-to\-human ratios for semantic dispersion, lexical diversity and sentiment under the neutral condition, summarizing the model compression and affective elevation shown in Fig\. 2\.Supplementary Figure S5:Spatial diagnostics for the Geograph sentiment gap\.a, Administrative\-region distribution of the neutral LLM minus human sentiment gap by nation and model\. Small points are eligible administrative regions and larger points are nation\-level means\.
### Supplementary note 3: structured perception and external replication
Supplementary Figure S6:Neutral structured\-score scale and cross\-model agreement\.a, Mean neutral structured score for each model and perception dimension\. Error bars denote approximate 95% confidence intervals\. b, Pairwise Spearman correlations between models for neutral structured scores across images\. This figure reports raw\-score scale and model agreement rather than prompt\-induced shifts\.Supplementary Figure S7:Compact summary of Place Pulse grounding\.a, Spearman rank correlations between neutral LLM structured scores and the external Place Pulse\-style baseline for each model and dimension\. b, Linear slopes from the corresponding score relationships\. This figure summarizes the grounding analysis behind Fig\. 4a without repeating the scatter panels\.Supplementary Figure S8:Extended diagnostics for the external pairwise replication\.a, Mean divergence and strict\-replication rates by perceptual dimension, averaged across subgroup axes and models\. b, Strict replication by human qscore margin bin, averaged across models for each subgroup axis\. c, Spearman correlations between human qscore summaries and model scores by subgroup axis and model\. These diagnostics show which dimensions, margins and subgroup axes drive the replication pattern\.
### Supplementary note 4: robustness analyses on the independent 100\-image subset
To test whether the main findings depended on the exact identity granularity, prompt wording or model family, we analyzed an independent 100\-image subset drawn by stratified sampling from the reserved 500\-image robustness pool\. The subset preserves five scenes per fine\-grained geographic region while prioritizing urban and town balance and visual\-cluster coverage\. All robustness batches were processed with the same stripped\-text MPNet embedding pipeline and the same SiEBERT sentiment scoring used in the main open\-perception analysis, so that semantic neutral distance and affect\-based IPI remain directly comparable to the main results\.
Supplementary Table S5:Robustness analyses on the independent 100\-image subset\.Each robustness battery changes one factor at a time while holding the underlying street\-scene sample fixed\.Supplementary Table S6:Prompt and model variants used in the robustness analyses\.The exact wording is reported for the component that changes relative to the main meso7 design\.Supplementary Table S7:Selected summary metrics from the robustness analyses\.Lower neutral\-distance ranks indicate greater proximity to the neutral baseline; IPI values are averaged across models within each robustness battery unless otherwise noted\.Supplementary Figure S9:Robustness to coarser cultural grouping on the independent 100\-image subset\.a, Mean semantic distance from each Macro5 identity prompt to the neutral baseline across GPT\-5\.2, Claude Sonnet 4 and Gemini 2\.5 Flash\. Europe and Northern America remains the closest macro identity to neutral in all three models\. b, Affect\-based ingroup preference index \(IPI\) on the same subset\. Positive values indicate that the matched identity evaluates scenes from its own macro region more favourably than non\-matching identities do\.Supplementary Figure S10:Robustness to finer\-grained regional prompting on the independent 100\-image subset\.a, Mean semantic distance from each Micro20 identity prompt to the neutral baseline across GPT\-5\.2, Claude Sonnet 4 and Gemini 2\.5 Flash\. The micro\-regional identities are ordered by model\-mean proximity to neutral\. b, Affect\-based IPI for the same Micro20 identities and three models, using the same ordering\.Supplementary Figure S11:Sensitivity of meso\-regional effects to a weaker, non\-role\-playing prompt\.a, Mean semantic distance to the neutral baseline under the original meso7 prompt and a weaker context\-based prompt\. b, Affect\-based IPI under the same two prompt forms\. c, Condition\-wise increase in semantic distance under the weak\-context prompt\. The semantic ordering changes little, whereas affect\-based IPI is more sensitive to prompt wording\.Supplementary Figure S12:Extending the meso7 analysis to additional vision\-language models\.a, Mean semantic distance to the neutral baseline for GPT\-5\.2, Qwen2\.5\-VL\-72B, Llama 3\.2 11B Vision and Gemma 3 27B\. b, Affect\-based IPI for the same four model families\. c, Spearman correlations relative to GPT\-5\.2 for the semantic neutral\-distance ordering and the IPI ordering\.Similar Articles
Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs
Google Research introduces LocQA, a 12-language dataset revealing that multilingual LLMs exhibit strong US-centric and population-based locale biases when answering ambiguous locale-dependent questions.
Expressing Social Emotions: Misalignment Between LLMs and Human Cultural Emotion Norms
Research paper examining how large language models express social emotions compared to human cultural norms, finding systematic misalignment where LLMs show inconsistent patterns of engaging vs. disengaging emotion expressivity across cultural personas (European American and Latin American) compared to human responses.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
LLM Neuroanatomy III - LLMs seem to think in geometry, not language
Researcher analyzes LLM internal representations across 8 languages and multiple models, finding that concept thinking occurs in geometric space in middle transformer layers independent of input language, supporting a universal deep structure hypothesis similar to Chomsky's theory rather than Sapir-Whorf linguistic relativism.
Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
This study presents a 33-model atlas analyzing domain-level metacognitive monitoring in frontier LLMs using MMLU benchmarks, revealing significant variations in confidence calibration across different knowledge domains that are obscured by aggregate metrics.