Mind Your Tone: Does Tone Alter LLM Performance?
Summary
This paper investigates how tonal variations in prompts affect LLM accuracy on multiple-choice questions, finding systematic but model-dependent effects. The study uses multiple models and datasets to demonstrate that tone can significantly alter performance, cautioning against assuming tone-robust reliability.
View Cached Full Text
Cached at: 05/29/26, 09:12 AM
# Mind Your Tone: Does Tone Alter LLM Performance? Source: [https://arxiv.org/abs/2605.29027](https://arxiv.org/abs/2605.29027) [View PDF](https://arxiv.org/pdf/2605.29027) > Abstract:The use of Large Language Models \(LLMs\) is proliferating, yet their performance is observed to vary based on prompting styles and tones\. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple\-choice questions\. We use two datasets: a 50\-base question dataset with five tone variants and a 570\-base question MMLU subset spanning 57 subjects with seven tone variants\. Experiments were conducted to evaluate the performance of four cost\-efficient, popular LLMs: ChatGPT\-4o, ChatGPT\-5\-nano, Gemini 2\.5 Flash, and Gemini 2\.5 Flash Lite\. Across models, tonal effects are systematic but highly model\-dependent\. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones\. Further, we identify subject\-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes\. Our findings caution users against assuming tone\-robust reliability in LLM deployments\. ## Submission history From: Om Dobariya \[[view email](https://arxiv.org/show-email/39e5b3c8/2605.29027)\] **\[v1\]**Wed, 27 May 2026 19:23:46 UTC \(698 KB\)
Similar Articles
Why Tone Works (It's Not What You Think)
A 2026 blog post revisits how prompt tone and context depth shift LLM responses, showing richer gamer-style prompts yield deeper, stat-backed answers than bare questions.
Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?
This paper investigates whether fine-tuning LLMs on long-form essays with associated Big Five personality profiles stabilizes questionnaire responses and can induce target profiles, finding that while variance reduces, accuracy on the full five-dimensional profile remains near chance.
LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance
This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.
Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
This paper investigates how LLMs' internal priors affect zero-shot annotation performance, finding that nearly two-thirds of errors resist prompt-based correction and introducing Definition-Specific Familiarity as a better predictor than memorization metrics.