Mind Your Tone: Does Tone Alter LLM Performance?

arXiv cs.AI 05/29/26, 04:00 AM Papers

llm prompt-engineering performance tone robustness language-models

Summary

This paper investigates how tonal variations in prompts affect LLM accuracy on multiple-choice questions, finding systematic but model-dependent effects. The study uses multiple models and datasets to demonstrate that tone can significantly alter performance, cautioning against assuming tone-robust reliability.

arXiv:2605.29027v1 Announce Type: new Abstract: The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.

Original Article

View Cached Full Text

Cached at: 05/29/26, 09:12 AM

# Mind Your Tone: Does Tone Alter LLM Performance?
Source: [https://arxiv.org/abs/2605.29027](https://arxiv.org/abs/2605.29027)
[View PDF](https://arxiv.org/pdf/2605.29027)

> Abstract:The use of Large Language Models \(LLMs\) is proliferating, yet their performance is observed to vary based on prompting styles and tones\. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple\-choice questions\. We use two datasets: a 50\-base question dataset with five tone variants and a 570\-base question MMLU subset spanning 57 subjects with seven tone variants\. Experiments were conducted to evaluate the performance of four cost\-efficient, popular LLMs: ChatGPT\-4o, ChatGPT\-5\-nano, Gemini 2\.5 Flash, and Gemini 2\.5 Flash Lite\. Across models, tonal effects are systematic but highly model\-dependent\. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones\. Further, we identify subject\-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes\. Our findings caution users against assuming tone\-robust reliability in LLM deployments\.

## Submission history

From: Om Dobariya \[[view email](https://arxiv.org/show-email/39e5b3c8/2605.29027)\] **\[v1\]**Wed, 27 May 2026 19:23:46 UTC \(698 KB\)

Mind Your Tone: Does Tone Alter LLM Performance?

Similar Articles

Why Tone Works (It's Not What You Think)

Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Submit Feedback

Similar Articles

Why Tone Works (It's Not What You Think)

Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance