Mind Your Tone: Does Tone Alter LLM Performance?

arXiv cs.AI Papers

Summary

This paper investigates how tonal variations in prompts affect LLM accuracy on multiple-choice questions, finding systematic but model-dependent effects. The study uses multiple models and datasets to demonstrate that tone can significantly alter performance, cautioning against assuming tone-robust reliability.

arXiv:2605.29027v1 Announce Type: new Abstract: The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:12 AM

# Mind Your Tone: Does Tone Alter LLM Performance?
Source: [https://arxiv.org/abs/2605.29027](https://arxiv.org/abs/2605.29027)
[View PDF](https://arxiv.org/pdf/2605.29027)

> Abstract:The use of Large Language Models \(LLMs\) is proliferating, yet their performance is observed to vary based on prompting styles and tones\. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple\-choice questions\. We use two datasets: a 50\-base question dataset with five tone variants and a 570\-base question MMLU subset spanning 57 subjects with seven tone variants\. Experiments were conducted to evaluate the performance of four cost\-efficient, popular LLMs: ChatGPT\-4o, ChatGPT\-5\-nano, Gemini 2\.5 Flash, and Gemini 2\.5 Flash Lite\. Across models, tonal effects are systematic but highly model\-dependent\. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones\. Further, we identify subject\-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes\. Our findings caution users against assuming tone\-robust reliability in LLM deployments\.

## Submission history

From: Om Dobariya \[[view email](https://arxiv.org/show-email/39e5b3c8/2605.29027)\] **\[v1\]**Wed, 27 May 2026 19:23:46 UTC \(698 KB\)

Similar Articles

Why Tone Works (It's Not What You Think)

Reddit r/artificial

A 2026 blog post revisits how prompt tone and context depth shift LLM responses, showing richer gamer-style prompts yield deeper, stat-backed answers than bare questions.

Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?

arXiv cs.CL

This paper investigates whether fine-tuning LLMs on long-form essays with associated Big Five personality profiles stabilizes questionnaire responses and can induce target profiles, finding that while variance reduces, accuracy on the full five-dimensional profile remains near chance.

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

arXiv cs.CL

This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Hugging Face Daily Papers

This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.