Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand

arXiv cs.CL Papers

Summary

A thesis investigating geographic dialect alignment in place-based social media communities in New Zealand, examining how Reddit communities reflect patterns of language variation similar to geographic dialect communities through analysis of lexical, morphosyntactic, and semantic features.

arXiv:2604.15744v1 Announce Type: new Abstract: This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# Contents Source: https://arxiv.org/html/2604.15744

**Title:** Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand

**Author:** Sidney Gig-Jan Wong

**Year:** 2026

**Supervisor:** Dr Benjamin Adams, Computer Science and Software Engineering, University of Canterbury

**Supervisor:** Dr Jonathan Dunn, Department of Linguistics, University of Illinois Urbana-Champaign

**Supervisor:** Dr Kong Meng Liew, School of Psychology, Speech and Hearing, University of Canterbury

**Supervisor:** Professor Jen Hay, Department of Linguistics, University of Canterbury

## Preliminary Pages

### Quote Slip

Variation and change are fundamental properties of human language, and the emergent patterns observed within these dynamic systems are systematic and meaningful. While the focus of sociolinguists has traditionally rested on spoken language, the advent of computer-mediated communication (CMC) and social media has demonstrated that written language likewise reveals structured patterns of variation and change. As social networks, social media platforms function as a 'natural laboratory' to investigate the interplay of social identity and language use. While social media dialectology is not a new area of study, there are compelling practical reasons to explore variation and change within the digital sphere. With recent advancements in Natural Language Processing (NLP), there is an increased awareness of how well georeferenced social media language data represents an underlying population. This need is particularly salient for low-resource similar languages, varieties, and dialects where data availability is often limited. As an intra-disciplinary problem within linguistics, I reconceptualise this issue within computational sociolinguistics, aiming to understand the extent to which place-based social media networks align with the linguistic context of underlying geographic dialect communities. I refer to this phenomenon as geographic dialect alignment. Focusing on the sociolinguistic context of New Zealand and communities on Reddit, my primary research question is: to what extent can we observe geographic dialect alignment in place-informed social media communities? More specifically, do digital communities reflect patterns of language variation and change similar to those observed within and across geographically defined dialect communities? Of particular interest to my research is the role of the social construction of space - conceptualised as place - in shaping language variation and change, as well as the perceptions of users on Reddit. Both aspects remain under-explored in social media dialectology. To address this gap, I explore the following secondary research questions: 1) do users in place-based communities associate language-use with a place identity? 2) is there a relationship between geographic dialect communities and place-based communities? 3) do place-related communities form a contiguous speech community? In the first phase, I selected a sample of post submissions from r/newzealand - the primary place-based community associated with New Zealand - and identified two selfposts specifically focused on New Zealand English and local language use. For the qualitative analysis, I employed discourse analysis to determine the situated meanings within these posts. The objective of this analysis was to reconnect the producers of language to their discourse, humanizing the data before quantification. Subsequently, I applied thematic analysis to the associated comment threads to curate a user-informed inventory consisting of 51 lexical, 3 morphosyntactic, and 13 semantic features. In the second phase, I analysed the distribution of these user-informed lexical and morphosyntactic features across six country-level place-based communities on Reddit to evaluate the accuracy of user intuition. The findings indicated that while user intuitions were largely incongruent with the data, the distribution of these features remained systematic and meaningful across country-level communities. Moreover, non-linguistic user behaviour - specifically temporal engagement patterns - emerged as a significant indicator for identifying non-local users, whose presence often correlated with an increase in innovative variants. In the third phase, I explored alternative computational approaches for detecting language variation across place-based communities on Reddit. Consistent with existing literature, traditional text classification methods proved ineffective for identifying latent linguistic variation at both the country and city levels. However, advanced language modelling techniques - specifically Word2Vec embeddings - facilitated the detection of variation across the user-informed semantic variables. By comparing word vector representations trained on discrete place-based communities using cosine similarity, I was able to quantify the degree of semantic shift and geographic alignment across the digital landscape. In the fourth and final phase, I expanded the corpus to include a broader network of New Zealand-related communities. By identifying user informed recommendations from r/NZMetaHub, I incorporated an additional 32 subreddits from the Pushshift Repository. Utilising Computational Construction Grammar, I confirmed that these communities maintain a high degree of grammatical similarity. I then examined diachronic semantic shift within the 13 user-informed semantic variables. Although only three variables exhibited the expected shifts, the results for 'chippy' (transitioning from a 'potato chip' to the nickname of a former Prime Minister) and 'snapper' (shifting from a 'transport card' back to the 'fish species') suggest that the diachronic embedding models successfully captured semantic changes unique to the New Zealand sociolinguistic context. Based on my analysis of user-informed lexical, morphosyntactic, and semantic variables, the findings suggest that geographic dialect alignment is observable within place-informed social media communities for New Zealand-related subreddits. Regarding the secondary research questions, I found that users in place-based communities generally associated specific language use with a distinct place-identity, and that these digital communities tended to form a contiguous speech community. However, the relationship between established geographic dialect communities and their digital counterparts was not straightforward when assessed through user-informed variables, indicating a complex layering of traditional regionalisms and emergent digital norms. Some of the limitations of this study stemmed from the reliance on user-informed variables, which inherently shaped the direction and scope of the analysis. Additional constraints included data sparsity within specific regional subsets and potential model bias introduced during the analytical pipeline. A significant theoretical limitation was the restricted engagement with traditional sociolinguistic frameworks; this reflects a broader historical emphasis on spoken language within the field, which complicates the direct application of established theories to computer-mediated data. To mitigate this, I prioritised methodological rigour and the development of a robust computational pipeline, with the objective of bridging this theoretical gap in future research. The thesis makes several distinct contributions to the field by introducing place-informed social media dialectology and implementing advanced language modelling techniques. By integrating user perceptions to evaluate the degree of language variation and change, this work addresses critical gaps in the existing literature regarding digital vernaculars. Furthermore, this research produced a comprehensive corpus of New Zealand-related Reddit communities - comprising 4.26 billion unprocessed words - providing a substantial and valuable resource for future sociolinguistic and computational inquiry. In terms of future research, there is significant potential to utilise state-of-the-art transformer-based large language models (LLMs) to examine semantic shift through contextual embeddings, though such approaches remain computationally resource-intensive. There is also a critical opportunity to develop a dedicated NLP benchmark for New Zealand English to improve model performance on local varieties. Further directions include extending this methodology to additional platforms - such as Twitter/X - and expanding into multimodal analysis (integrating spoken data). Finally, perceptual dialectology on social media represents a promising avenue for understanding metalinguistic awareness, particularly as researcher access to platform data continues to evolve.

### Public Summary

This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.

## Contents

1. [Introduction](https://arxiv.org/html/2604.15744#Ch1)
   1. [Chapter Outline](https://arxiv.org/html/2604.15744#Ch1.S1)
   2. [Introduction](https://arxiv.org/html/2604.15744#Ch1.S2)
   3. [Aims and Objectives](https://arxiv.org/html/2604.15744#Ch1.S3)
      1. [Broader Applications](https://arxiv.org/html/2604.15744#Ch1.S3.SS1)
   4. [Research Questions](https://arxiv.org/html/2604.15744#Ch1.S4)
      1. [Secondary Research Questions](https://arxiv.org/html/2604.15744#Ch1.S4.SS1)
   5. [Research Phases](https://arxiv.org/html/2604.15744#Ch1.S5)
   6. [System Requirements](https://arxiv.org/html/2604.15744#Ch1.S6)
   7. [Data Availability](https://arxiv.org/html/2604.15744#Ch1.S7)
   8. [Outline](https://arxiv.org/html/2604.15744#Ch1.S8)

2. [Literature Review](https://arxiv.org/html/2604.15744#Ch2)
   1. [Chapter Outline](https://arxiv.org/html/2604.15744#Ch2.S1)
   2. [Introduction](https://arxiv.org/html/2604.15744#Ch2.S2)
   3. [Language Variation and Change](https://arxiv.org/html/2604.15744#Ch2.S3)
      1. [Dialectology](https://arxiv.org/html/2604.15744#Ch2.S3.SS1)
      2. [Sociolinguistics](https://arxiv.org/html/2604.15744#Ch2.S3.SS2)
      3. [Summary](https://arxiv.org/html/2604.15744#Ch2.S3.SS3)
   4. [Natural Language Processing for Social Media](https://arxiv.org/html/2604.15744#Ch2.S4)
      1. [Computational Models of Language](https://arxiv.org/html/2604.15744#Ch2.S4.SS1)
      2. [Twitter: the Digital Town Square](https://arxiv.org/html/2604.15744#Ch2.S4.SS2)
      3. [Summary](https://arxiv.org/html/2604.15744#Ch2.S4.SS3)
   5. [Language in the Construction of Place](https://arxiv.org/html/2604.15744#Ch2.S5)
      1. [Geographic Perspectives](https://arxiv.org/html/2604.15744#Ch2.S5.SS1)
      2. [Linguistic Perspectives](https://arxiv.org/html/2604.15744#Ch2.S5.SS2)
      3. [Sociotheoretical Perspectives](https://arxiv.org/html/2604.15744#Ch2.S5.SS3)
      4. [Summary](https://arxiv.org/html/2604.15744#Ch2.S5.SS4)
   6. [Sociolinguistic Context of New Zealand](https://arxiv.org/html/2604.15744#Ch2.S6)
      1. [Features of New Zealand English](https://arxiv.org/html/2604.15744#Ch2.S6.SS1)
      2. [Languages, Dialects, Accents](https://arxiv.org/html/2604.15744#Ch2.S6.SS2)
      3. [Attitudes and Ideologies](https://arxiv.org/html/2604.15744#Ch2.S6.SS3)
   7. [Chapter Summary](https://arxiv.org/html/2604.15744#Ch2.S7)

3. [Corpus Dimensions](https://arxiv.org/html/2604.15744#Ch3)
   1. [Chapter Outline](https://arxiv.org/html/2604.15744#Ch3.S1)
   2. [Reddit: the Front Page of the Internet](https://arxiv.org/html/2604.15744#Ch3.S2)
      1. [Why Reddit?](https://arxiv.org/html/2604.15744#Ch3.S2.SS1)
      2. [New Zealand Reddit](https://arxiv.org/html/2604.15744#Ch3.S2.SS2)
   3. [Situational Characteristics](https://arxiv.org/html/2604.15744#Ch3.S3)
      1. [Participants](https://arxiv.org/html/2604.15744#Ch3.S3.SS1)
      2. [Relations Among Participants](https://arxiv.org/html/2604.15744#Ch3.S3.SS2)
      3. [Channel](https://arxiv.org/html/2604.15744#Ch3.S3.SS3)
      4. [Production Circumstances](https://arxiv.org/html/2604.15744#Ch3.S3.SS4)
      5. [Setting](https://arxiv.org/html/2604.15744#Ch3.S3.SS5)
      6. [Communicative Purposes](https://arxiv.org/html/2604.15744#Ch3.S3.SS6)
      7. [Topic](https://arxiv.org/html/2604.15744#Ch3.S3.SS7)
   4. [Sources of Data](https://arxiv.org/html/2604.15744#Ch3.S4)
   5. [Data Processing](https://arxiv.org/html/2604.15744#Ch3.S5)
   6. [Chapter Summary](https://arxiv.org/html/2604.15744#Ch3.S6)

4. [User Intuitions and Place Identity](https://arxiv.org/html/2604.15744#Ch4)
   1. [Chapter Outline](https://arxiv.org/html/2604.15744#Ch4.S1)
   2. [Background and Motivation](https://arxiv.org/html/2604.15744#Ch4.S2)
   3. [Sampling Strategy](https://arxiv.org/html/2604.15744#Ch4.S3)
   4. [Discourse Analysis](https://arxiv.org/html/2604.15744#Ch4.S4)
      1. [Methodology](https://arxiv.org/html/2604.15744#Ch4.S4.SS1)
      2. [Selfpost 1](https://arxiv.org/html/2604.15744#Ch4.S4.SS2)
      3. [Selfpost 2](https://arxiv.org/html/2604.15744#Ch4.S4.SS3)
   5. [Interim Summary](https://arxiv.org/html/2604.15744#Ch4.S5)
   6. [Content Analysis](https://arxiv.org/html/2604.15744#Ch4.S6)
      1. [Methodology](https://arxiv.org/html/2604.15744#Ch4.S6.SS1)
      2. [Cultural Cringe and National Pride](https://arxiv.org/html/2604.15744#Ch4.S6.SS2)
      3. [Language Variation and Change](https://arxiv.org/html/2604.15744#Ch4.S6.SS3)
   7. [Discussion](https://arxiv.org/html/2604.15744#Ch4.S7)
   8. [Conclusion and Key Findings](https://arxiv.org/html/2604.15744#Ch4.S8)

5. [User-Informed Sociolinguistic Variables](https://arxiv.org/html/2604.15744#Ch5)
   1. [Chapter Outline](https://arxiv.org/html/2604.15744#Ch5.S1)
   2. [Background and Motivation](https://arxiv.org/html/2604.15744#Ch5.S2)
   3. [Methodology](https://arxiv.org/html/2604.15744#Ch5.S3)
      1. [Data](https://arxiv.org/html/2604.15744#Ch5.S3.SS1)
      2. [Sociolinguistic Variables](https://arxiv.org/html/2604.15744#Ch5.S3.SS2)
      3. [Feature Extraction](https://arxiv.org/html/2604.15744#Ch5.S3.SS3)
      4. [Evaluation](https://arxiv.org/html/2604.15744#Ch5.S3.SS4)
   4. [Results](https://arxiv.org/html/2604.15744#Ch5.S4)
      1. [Lexical Variables](https://arxiv.org/html/2604.15744#Ch5.S4.SS1)
      2. [Morphosyntactic Variables](https://arxiv.org/html/2604.15744#Ch5.S4.SS2)
   5. [Discussion](https://arxiv.org/html/2604.15744#Ch5.S5)
   6. [Chapter Summary](https://arxiv.org/html/2604.15744#Ch5.S6)

6. [Dialect Modelling and Language Embeddings](https://arxiv.org/html/2604.15744#Ch6)
   1. [Chapter Outline](https://arxiv.org/html/2604.15744#Ch6.S1)
   2. [Background and Motivation](https://arxiv.org/html/2604.15744#Ch6.S2)
   3. [Data](https://arxiv.org/html/2604.15744#Ch6.S3)
   4. [Classification Models](https://arxiv.org/html/2604.15744#Ch6.S4)
      1. [Methodology](https://arxiv.org/html/2604.15744#Ch6.S4.SS1)
      2. [Country-level Communities](https://arxiv.org/html/2604.15744#Ch6.S4.SS2)

Similar Articles

Side-by-side Comparison Amplifies Dialect Bias in Language Models

arXiv cs.CL

This research paper finds that language models exhibit increased dialect bias when comparing Standard American English and African-American Vernacular English side-by-side, even after safety fine-tuning. Counterfactual fairness fine-tuning can reduce some biases in isolation but not consistently in contrastive settings.

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

arXiv cs.CL

This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.