No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus
Summary
This paper investigates how politeness and impoliteness in user prompts affect LLM responses across three languages and five major models, finding that politeness effects are language- and model-dependent rather than universal. The authors release PLUM, a multilingual corpus of 1,500 human-validated prompts with politeness annotations, and assess response quality using eight factors.
View Cached Full Text
Cached at: 04/20/26, 08:30 AM
# No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus Source: https://arxiv.org/html/2604.16275 ## Abstract This paper explores how Large Language Models (LLMs) respond to user prompts with varying degrees of politeness and impoliteness. Building on Brown and Levinson's Politeness Theory and Culpeper's Impoliteness Framework, we conduct experiments across three languages (English, Hindi, Spanish), five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, and Llama 3), and three interaction history conditions (raw, polite, and impolite). Our dataset comprises 22,500 prompt-response pairs evaluated using an eight-factor assessment framework: coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability. Results show that model performance is significantly influenced by conversational tone, dialogue history, and language. While polite prompts improve average response quality by up to 11% and impolite tones degrade it, these effects are neither consistent nor universal across languages and models. English benefits most from courtesy or directness, Hindi from deference and indirectness, and Spanish from assertiveness. Among the models, Llama shows the highest tone sensitivity (11.5% range), while GPT demonstrates greater robustness to adversarial tones. These findings indicate that politeness is a quantifiable computational variable affecting LLM behavior, though its effects are language- and model-dependent rather than universal. To support reproducibility and future research, we release PLUM (Politeness Levels in Utterances, Multilingual), a publicly available corpus of 1,500 human-validated prompts across three languages and five politeness categories, along with a formal supplementary analysis of six falsifiable hypotheses derived from politeness theory, empirically evaluated against experimental data. ## I. Introduction The interaction between humans and artificial intelligence through natural language has become a critical technical and philosophical issue. As AI systems like ChatGPT and Gemini become increasingly integrated into daily life, concerns extend beyond efficiency and functionality to more human-like communication skills. With LLMs now central to fields such as education, assistance, and professional support, understanding human-model interactions has become essential. These interactions mirror the rhythm and richness of human communication and extend beyond simple task completion, making linguistic behavior a key area of investigation. Since the emergence of conversational AI like ChatGPT-4.0, over 1.5 billion people worldwide engage with AI-driven systems, with over 180 million users by early 2025. The business sector is rapidly adopting these technologies, with over 80% of Fortune 500 companies incorporating LLMs into their operations. This rapid transformation necessitates not only understanding what these systems can do, but also how they work and what implications they have for users. In human communication, politeness plays a crucial role—it is not merely a moral concern but a communication mechanism that shapes system behavior, reaction perception, and user satisfaction. The expression and perception of politeness can affect trust, rapport, and the effectiveness of human-machine communication. This work builds on Brown and Levinson's Politeness Theory and Culpeper's Impoliteness Theory to examine how various degrees and forms of politeness and impoliteness influence modern LLM outputs. These theoretical frameworks provide systematic approaches to understanding face-threatening behaviors and the strategic language choices models employ when navigating social relations. This study's findings have applications across disciplines including AI ethics, user experience design, computational linguistics, and digital humanities. They help scholars and practitioners understand how machine linguistic behavior reflects, challenges, and potentially redefines norms traditionally associated with human communication. The paper also investigates whether LLMs react differently to varying politeness levels across languages and whether such reactions depend on conversation history. This temporal and conversational memory aspect increases the complexity of explaining how AI systems maintain coherence and social awareness over time. The study further analyzes how response patterns may shape or be influenced by user behavior and tone, creating a small psychological feedback loop. Depending on a model's learned behaviors and design parameters, this loop can reinforce either civil interactions or incivility. Given that LLMs now generate substantial digital content and are embedded in search engines, writing assistants, and customer service bots, the nuances of politeness in such interactions have both practical and ethical importance. As these tools become more capable and pervasive, the stakes of communication quality and the need for conscious attention to dialogue microstructures only increase. The specific contributions of this paper are as follows: First, we present a large-scale empirical study examining how five politeness levels affect LLM response quality across five models, three languages, and three interaction-history conditions, encompassing 22,500 prompt-response pairs evaluated on an eight-dimensional framework. Second, the study demonstrates that no single politeness strategy is universally optimal; the best approach varies by language, model architecture, and conversational history, with polite interactions providing up to 11% quality improvement in English, while assertive styles outperform in Spanish and deferential styles in Hindi. Third, we release PLUM (Politeness Levels in Utterances, Multilingual), a publicly available corpus of 1,500 human-validated prompts spanning five politeness categories across English, Hindi, and Spanish—the first multilingual prompt resource of its kind built on the Brown-Levinson and Culpeper theoretical taxonomy. Fourth, we present a formal supplementary analysis comprising four grounding axioms and six falsifiable hypotheses derived from politeness theory, each empirically assessed against experimental data, providing a theoretically rigorous account of which predictions are supported, partially supported, or refuted by evidence. ## II. Related Work Politeness has long been recognized as essential for maintaining social harmony and regulating interpersonal relations. As artificial intelligence becomes increasingly integrated into everyday life, this social principle has gained prominence in human-AI interaction research, particularly regarding LLMs. This section reviews the theoretical foundations, early computational approaches, and recent empirical studies of how politeness is expressed and perceived in LLM and automated dialogue systems. ### Theoretical Foundations Our research is grounded in Brown and Levinson's Politeness Theory, which introduced the concept of face—a social self-image that people seek to maintain—and classified conversational strategies as positive, negative, or bald-on-record. Our prompt category design is based on this typology. Culpeper's Impoliteness Theory complements this framework by focusing on how strategic rudeness functions as communicative action to achieve desired outcomes. We use the combination of these models to inform our taxonomy of impolite prompts and their social implications. ### Early Computational Approaches Early efforts to operationalize politeness theory in natural language processing (NLP) were largely exploratory. The POLLy system attempted to generate polite speech for language learners using the Brown-Levinson framework, revealing that indirect strategies were sometimes perceived as rude in certain cultural contexts. Danescu-Niculescu-Mizil et al. developed a computational approach to politeness grounded in lexical and syntactic cues, producing one of the earliest large-scale corpora for politeness research in NLP. Their measurement framework established that politeness is a quantifiable textual property—foundational work that informs the automated scoring approach used in constructing our PLUM corpus. As interpretable neural models emerged, methods like convolutional neural networks outperformed feature-based systems at predicting politeness. Visualization tools such as activation clustering and saliency mapping provided intuitive insights into how linguistic cues associate with politeness in neural architectures, contributing to both interpretability and theoretical understanding. ### Voice-Based Agents and User Behavior With politeness modeling development, its implementation in voice assistants grew increasingly common. Bonfert et al. found that gracious corrections by voice assistants enhanced user politeness, though such feedback could provoke resistance from users feeling morally judged. Hu et al. discovered that elderly users preferred face-to-face communication when system performance was low. Aubakirova and Bansal's study of robot embodiments found that mechanical robots were more acceptable than humanoid ones. These studies collectively suggest that politeness perception is contextually and culturally sensitive—a factor abstracted in our text-only framework. More recently, Hu et al. investigated how text and voice interaction modes affect perceived AI consciousness and user politeness toward ChatGPT in a controlled experiment with 25 participants. They report that voice interactions produced significantly higher perceived consciousness scores and more politeness markers in user utterances. However, the study measured only user-side behavior and did not examine how input prompt politeness affects LLM response quality. The small, homogeneous participant pool (university students from a single institution), single-model scope, and confounding of modality with politeness effects limit generalizability. Our work is complementary: we hold modality constant and systematically vary input politeness to measure its causal effect on LLM output quality across five architectures and three languages. Lazebnik et al. conducted a large-scale controlled experiment (n=1,684) studying how user politeness toward AI evolves during sessions. They found that politeness declines more rapidly in human-AI settings than in human-human baselines, and that human-like visual avatars slow this decline. While this study confirms that social norms transfer to AI interactions, it focuses entirely on user behavior without measuring downstream effects on LLM response quality. The experiment used a single unnamed model, so findings may not reflect current instruction-tuned LLM behavior. Their observation that politeness erodes over time is nonetheless relevant to our history-conditioning design, which captures tonal shifts through RAW, POL, and IMP conditions. ### Systematic Reviews and Strategic Insights Ribino's systematic review examined politeness across various AI systems, from digital assistants to autonomous vehicles. Results show that systems exhibiting socially competent and polite behavior inspire greater trust, satisfaction, and acceptance, particularly in sensitive domains like healthcare and education. The review confirms the Computers Are Social Actors (CASA) paradigm, which posits that humans apply social norms to AI systems displaying social cues. It also warns against normalizing rude behavior toward machines, especially among children, and calls for further study of developmental and cultural implications of such interactions. ### Politeness in Dialogue with LLMs The rise of LLMs has intensified interest in politeness's role in interaction outcomes. Firdaus et al. created a politeness-conscious dialogue generator that tailors responses to user demographics, though focusing primarily on tone. Li et al. compared ChatGPT and fine-tuned BERT on politeness classification, finding that BERT performed better. Andersson and McIntyre tested ChatGPT's pragmatic competence with irony, metaphor, and indirect speech, finding moderate success but continuing challenges with subtle social cues. Yin et al. investigated politeness effects across languages, finding that moderate politeness reduced bias while impolite contributions degraded task performance. Their work was limited, however, by a narrow range of benchmark tasks lacking conversational context and linguistic diversity.
Similar Articles
Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts
This paper presents a multi-dimensional analysis of human-like behaviors in LLMs, examining prevalence, effects, and controllability across 21,000 conversations from four models, finding that behaviors vary by model and user factors, with implications for responsible design.
LLMs Can Better Capture Human Judgments--With the Right Prompts
This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
This paper investigates asymmetries in LLMs' pragmatic competence by comparing their performance as judges of linguistic appropriateness versus as generators of pragmatically appropriate language. The study finds that many models perform substantially better as pragmatic listeners than as speakers, suggesting misalignment between evaluation and generation capabilities.
Expressing Social Emotions: Misalignment Between LLMs and Human Cultural Emotion Norms
Research paper examining how large language models express social emotions compared to human cultural norms, finding systematic misalignment where LLMs show inconsistent patterns of engaging vs. disengaging emotion expressivity across cultural personas (European American and Latin American) compared to human responses.
Are you speaking my languages? On spoken language adherence in multimodal LLMs
This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.