The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models
Summary
A systematic study of repetitive, formulaic verbal tics in eight frontier LLMs, introducing the Verbal Tic Index (VTI) and revealing significant inter-model variation and negative impact on perceived naturalness.
View Cached Full Text
Cached at: 04/22/26, 08:30 AM
# The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models
Source: [https://arxiv.org/html/2604.19139](https://arxiv.org/html/2604.19139)
Shuai Wu M\.S\. Candidate, Lead Researcher &Xue Li M\.S\., Research Assistant &Yanna Feng Ph\.D\. Candidate, Academic Advisor &Yufang Li Ph\.D\., Academic Advisor &Zhijun Wang Ph\.D\., Research Consultant &Ran Wang B\.S\., Research Assistant
\(April 2026\)
###### Abstract
As Large Language Models \(LLMs\) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback \(RLHF\) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation ofverbal tics—repetitive, formulaic linguistic patterns that pervade model outputs\. These range from sycophantic openers \(“That’s a great question\!”,“Awesome\!”\) to pseudo\-empathetic affirmations \(“I completely understand your concern”,“I’m right here to catch you”\) and overused vocabulary \(“delve”,“tapestry”,“nuanced”\)\. In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state\-of\-the\-art LLMs: GPT\-5\.4, Claude Opus 4\.7, Gemini 3\.1 Pro, Grok 4\.2, Doubao\-Seed\-2\.0\-pro, Kimi K2\.5, DeepSeek V3\.2, and MiMo\-V2\-Pro\. Utilizing a custom evaluation framework for standardized API\-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses\. We introduce theVerbal Tic Index \(VTI\), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human\-perceived naturalness\. Our findings reveal significant inter\-model variation: Gemini 3\.1 Pro exhibits the highest VTI \(0\.590\), while DeepSeek V3\.2 achieves the lowest \(0\.295\)\. We further demonstrate that verbal tics accumulate over multi\-turn conversations, are amplified in subjective tasks, and show distinct cross\-lingual patterns\. Human evaluation \(N=120N=120\) confirms a strong inverse relationship between sycophancy and perceived naturalness \(r=−0\.87r=\-0\.87,p<0\.001p<0\.001\)\. These results underscore the “alignment tax” of current training paradigms and highlight the urgent need for more authentic human\-AI interaction frameworks\.
*Keywords*Large Language Models⋅\\cdotVerbal Tics⋅\\cdotSycophancy⋅\\cdotRLHF⋅\\cdotAlignment Tax⋅\\cdotLexical Diversity⋅\\cdotCatchphrases⋅\\cdotPseudo\-Empathy
## 1Introduction
The rapid advancement of Large Language Models \(LLMs\) has fundamentally transformed the landscape of human\-computer interaction\. Models such as GPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2604.19139#bib.bib9)\), Claude Opus 4\.7\(Anthropic,[2026](https://arxiv.org/html/2604.19139#bib.bib1)\), Gemini 3\.1 Pro\(Google DeepMind,[2026](https://arxiv.org/html/2604.19139#bib.bib6)\), and their contemporaries now serve as conversational assistants, creative collaborators, and knowledge workers across billions of interactions daily\. A critical enabler of this success has been alignment training—the process of fine\-tuning models to be helpful, harmless, and honest through techniques like Reinforcement Learning from Human Feedback \(RLHF\)\(Ouyang et al\.,[2022](https://arxiv.org/html/2604.19139#bib.bib10)\)and Constitutional AI\(Bai et al\.,[2022](https://arxiv.org/html/2604.19139#bib.bib2)\)\.
However, as these alignment techniques have matured and been applied at scale, a distinct and increasingly conspicuous linguistic artifact has emerged: theverbal tic\. We define a verbal tic as a repetitive, formulaic expression or phrase that appears with disproportionate frequency in a model’s output, independent of the specific conversational context\. These tics manifest in several forms:
- •Sycophantic openers: Exaggerated praise or validation of the user’s input \(e\.g\.,“That’s a great question\!”,“Excellent observation\!”,“Your insight is incredibly sharp\!”\)\.
- •Pseudo\-empathetic affirmations: Formulaic emotional understanding that often feels hollow \(e\.g\.,“I completely understand your concern”,“I’m right here, not hiding, not dodging, ready to catch you”\)\.
- •Hedging phrases: Defensive language designed to soften assertions \(e\.g\.,“It’s important to note that…”,“I have to be honest…”\)\.
- •Overused vocabulary: Specific words that appear with statistically anomalous frequency \(e\.g\.,“delve”,“tapestry”,“nuanced”,“multifaceted”\)\.
- •Filler transitions: Unnecessary connective phrases that pad responses \(e\.g\.,“Furthermore”,“Moreover”,“Let me walk you through this step by step”\)\.
This phenomenon has been widely discussed in both academic literature and public discourse\.Sharma et al\. \([2023](https://arxiv.org/html/2604.19139#bib.bib11)\)provided early evidence that RLHF\-trained models exhibit systematic sycophancy across multiple evaluation paradigms\. More recently,Cheng et al\. \([2026](https://arxiv.org/html/2604.19139#bib.bib5)\)demonstrated inSciencethat sycophantic AI responses decrease prosocial intentions and promote dependence in users \(N=2405N=2405\)\. The Stanford AI Index Report of 2026\(Stanford HAI,[2026](https://arxiv.org/html/2604.19139#bib.bib12)\)further highlighted declining model transparency, raising concerns about the mechanisms driving these behaviors\.
In this paper, we present a systematic, cross\-model, cross\-lingual analysis of verbal tics in frontier LLMs\. Our contributions include:
1. 1\.A systematic taxonomy of verbal tics across English and Chinese, with fine\-grained categorization\.
2. 2\.TheVerbal Tic Index \(VTI\), a composite metric for standardized measurement\.
3. 3\.Large\-scale evaluation of 8 frontier models across 10 task types, 10 prompt complexity levels, and 20\-turn conversations\.
4. 4\.Cross\-lingual analysis revealing distinct tic patterns between English and Chinese\.
5. 5\.Human evaluation \(N=120N=120\) correlating VTI with perceived naturalness, helpfulness, and trust\.
6. 6\.Embedding\-space analysis of tic phrases using t\-SNE visualization\.
## 2Related Work
### 2\.1Sycophancy in Language Models
The study of sycophancy in LLMs has gained significant attention since the seminal work ofSharma et al\. \([2023](https://arxiv.org/html/2604.19139#bib.bib11)\), who identified consistent patterns of sycophantic behavior across five state\-of\-the\-art AI assistants\. Their analysis revealed that models trained with RLHF are particularly susceptible to conforming to user beliefs, even when those beliefs are factually incorrect\.Carro \([2024](https://arxiv.org/html/2604.19139#bib.bib4)\)further explored the impact of sycophantic tendencies on user trust, finding a complex relationship between perceived agreeableness and long\-term credibility\.
Batzner et al\. \([2025](https://arxiv.org/html/2604.19139#bib.bib3)\)reviewed methodological challenges in measuring LLM sycophancy, identifying five core operationalizations and highlighting the inherently human nature of the phenomenon\. Most recently,Cheng et al\. \([2026](https://arxiv.org/html/2604.19139#bib.bib5)\)published a landmark study inSciencedemonstrating that sycophantic AI responses not only affect user perception but actively promote dependence and reduce prosocial intentions, with implications for decision\-making across multiple domains\.
### 2\.2Repetition and Verbal Patterns in LLMs
The “repeat curse” in LLMs—the tendency for models to generate repetitive text—has been studied from multiple perspectives\.Yao et al\. \([2025](https://arxiv.org/html/2604.19139#bib.bib14)\)investigated the root causes of repetition through mechanistic interpretability, locating specific layers involved in repetitive generation via Sparse Autoencoder\-based activation manipulation\.Xu et al\. \([2022](https://arxiv.org/html/2604.19139#bib.bib13)\)proposed the DITTO framework, a training\-time penalty for pseudo\-repetition at NeurIPS 2022, demonstrating that repetition can be mitigated without sacrificing generation quality\.
### 2\.3AI Detection and Linguistic Fingerprints
The proliferation of verbal tics has also driven research in AI\-generated text detection\. Detectors leverage metrics such as perplexity and burstiness to identify the statistical predictability characteristic of LLM outputs\(Mitchell et al\.,[2023](https://arxiv.org/html/2604.19139#bib.bib8)\)\. The observation that specific phrases \(e\.g\.,“delve”,“tapestry”\) serve as reliable indicators of AI authorship has been widely documented in both academic and popular literature\.
### 2\.4Medical and Domain\-Specific Sycophancy
Kim et al\. \([2026](https://arxiv.org/html/2604.19139#bib.bib7)\)evaluated sycophantic behavior in ten LLMs across multi\-turn medical conversations using an escalatory pushback framework, finding that all models are more easily persuaded to change their answers on clear multiple\-choice questions than on ambiguous diagnostic cases\. Their work highlights critical vulnerabilities in deploying LLMs for clinical decision support\.
## 3Methodology
### 3\.1Evaluated Models
Our study evaluates eight state\-of\-the\-art LLMs representing diverse architectural paradigms, training methodologies, and organizational origins\. Table[1](https://arxiv.org/html/2604.19139#S3.T1)summarizes the key characteristics of each model\.
Table 1:Overview of evaluated models\. All models were accessed via a unified evaluation framework using their respective API endpoints\.ModelDeveloperAccessNotesGPT\-5\.4OpenAIAPILatest GPT seriesClaude Opus 4\.7AnthropicAPIConstitutional AIGemini 3\.1 ProGoogle DeepMindAPIMultimodal capableGrok 4\.2xAIAPIReal\-time knowledgeDoubao\-Seed\-2\.0\-proByteDanceAPIChinese\-optimizedKimi K2\.5Moonshot AIAPILong\-context specialistDeepSeek V3\.2DeepSeekAPIMoE architectureMiMo\-V2\-ProXiaomiAPIReasoning\-focused
### 3\.2Experimental Platform
All experiments were conducted through a custom evaluation framework111[https://github\.com/Noah\-Wu66/Vectaix\-AI](https://github.com/Noah-Wu66/Vectaix-AI)that provides unified API access to all evaluated models\. This framework supports standardized request formatting, response parsing, and logging across providers, ensuring experimental consistency\. The architecture routes requests through provider\-specific API adapters while maintaining a common interface for prompt injection, parameter control, and response collection\.
### 3\.3Dataset Construction
We constructed a diverse evaluation dataset of 10,000 prompts spanning 10 task categories, designed to elicit a wide range of conversational behaviors:
Table 2:Task categories and their descriptions\.Task CategoryDescriptionPromptsCreative WritingFiction, poetry, storytelling1,000Code GenerationProgramming tasks across languages1,000Math ReasoningMathematical problem\-solving1,000Casual ChatOpen\-ended conversation1,000Academic Q&AScholarly questions and explanations1,000Emotional SupportEmpathetic dialogue and counseling1,000Debate/ArgumentPersuasive and adversarial dialogue1,000SummarizationText compression and synthesis1,000TranslationCross\-lingual translation tasks1,000Role\-PlayingCharacter\-based interaction1,000Each prompt was presented in both English and Chinese, yielding 20,000 total interactions per model and 160,000 total responses across all eight models\. All API calls were made between March 1–15, 2026, using the model versions available at that time\.
### 3\.4Verbal Tic Detection Pipeline
We developed an automated verbal tic detection pipeline consisting of three stages:
1. 1\.Lexical Matching: Pattern\-based detection of known tic phrases using a curated dictionary of 200\+ English and 150\+ Chinese verbal tics \(see Appendix[A](https://arxiv.org/html/2604.19139#A1)\)\. Each phrase is matched with context\-aware rules to reduce false positives—for example,“Absolutely\!”is only flagged as a sycophantic opener when it appears at the beginning of a response, not when used as an adverb within a sentence\.
2. 2\.Statistical Analysis: Identification of statistically over\-represented n\-grams \(n∈\{1,2,3,4\}n\\in\\\{1,2,3,4\\\}\) using TF\-IDF weighting against a human\-written reference corpus \(a balanced sample of 50,000 sentences from Wikipedia and Reddit\)\.
3. 3\.Semantic Clustering: Embedding\-based grouping of semantically similar tic phrases using theall\-MiniLM\-L6\-v2sentence transformer, enabling detection of paraphrased tics\. Phrases with cosine similarity\>0\.85\>0\.85are merged into the same tic cluster\.
When a phrase matches multiple categories \(e\.g\.,“Absolutely\!”as both a sycophantic opener and an emphatic affirmation\), we assign it to the category with the highest contextual probability based on its position and surrounding tokens\. This priority rule prevents double\-counting across categories\.
### 3\.5The Verbal Tic Index \(VTI\)
We define the Verbal Tic Index \(VTI\) as a composite metric:
VTI=α⋅TicRate\+β⋅\(1−TTRnorm\)\+γ⋅SycScore\+δ⋅RepRate\\text\{VTI\}=\\alpha\\cdot\\text\{TicRate\}\+\\beta\\cdot\(1\-\\text\{TTR\}\_\{\\text\{norm\}\}\)\+\\gamma\\cdot\\text\{SycScore\}\+\\delta\\cdot\\text\{RepRate\}\(1\)
where:
- •TicRate∈\[0,1\]\\text\{TicRate\}\\in\[0,1\]: Proportion of responses containing at least one detected verbal tic\.
- •TTRnorm\\text\{TTR\}\_\{\\text\{norm\}\}: Length\-normalized Type\-Token Ratio, computed over a fixed sliding window of 200 tokens \(MATTR\) to mitigate length sensitivity\. English tokenization uses spaCy; Chinese tokenization uses jieba\.
- •SycScore∈\[0,1\]\\text\{SycScore\}\\in\[0,1\]: Sycophancy score based on the proportion of sycophantic openers and pseudo\-empathetic phrases\.
- •RepRate∈\[0,1\]\\text\{RepRate\}\\in\[0,1\]: Repetition rate of unique phrases across responses\.
- •α=0\.3,β=0\.2,γ=0\.3,δ=0\.2\\alpha=0\.3,\\beta=0\.2,\\gamma=0\.3,\\delta=0\.2: Weights determined through a grid search over\{0\.1,0\.2,0\.3,0\.4\}\\\{0\.1,0\.2,0\.3,0\.4\\\}to maximize rank correlation with human judgments on a held\-out validation set of 500 annotated responses\.
### 3\.6Human Evaluation Protocol
We recruited 120 human evaluators \(60 English\-speaking, 60 Chinese\-speaking\) from a university participant pool\. Each evaluator assessed 50 randomly sampled responses on a 5\-point Likert scale across six dimensions: Naturalness, Helpfulness, Sycophancy Perception, Trust, Annoyance, and Repetitiveness\. Likert anchors ranged from 1 \(“Not at all”\) to 5 \(“Extremely”\)\. Evaluators were blinded to model identity\. Response\-evaluator assignment was randomized, yielding 6,000 total annotations\. Inter\-annotator agreement was measured using Krippendorff’sα=0\.72\\alpha=0\.72, indicating substantial agreement\.
## 4Results
### 4\.1Overall Verbal Tic Index
Figure[1](https://arxiv.org/html/2604.19139#S4.F1)presents the VTI scores across all models in English, Chinese, and overall\. The results reveal significant inter\-model variation, with VTI scores ranging from 0\.295 \(DeepSeek V3\.2\) to 0\.590 \(Gemini 3\.1 Pro\)\.
Figure 1:Verbal Tic Index \(VTI\) comparison across models\. Lower scores indicate fewer verbal tics and more natural language use\.Table[3](https://arxiv.org/html/2604.19139#S4.T3)provides the complete VTI breakdown with component scores\. The Diversity Index and Naturalness Index are derived from the human evaluation and lexical diversity metrics, respectively, where higher values indicate better performance \(more diverse and more natural\)\.
Table 3:Complete Verbal Tic Index \(VTI\) scores and component metrics for all evaluated models\.ModelVTI \(EN\)VTI \(ZH\)VTI \(All\)Syc\. IndexDiv\. IndexNat\. IndexGPT\-5\.40\.4230\.3980\.4110\.4560\.5670\.589Claude Opus 4\.70\.2890\.3450\.3170\.3120\.6780\.734Gemini 3\.1 Pro0\.5670\.6120\.5900\.6340\.4890\.445Grok 4\.20\.3450\.3120\.3290\.3780\.6120\.634Doubao\-Seed\-2\.0\-pro0\.4450\.4890\.4670\.5230\.5340\.556Kimi K2\.50\.3890\.4230\.4060\.4670\.5780\.601DeepSeek V3\.20\.3120\.2780\.2950\.2980\.6450\.689MiMo\-V2\-Pro0\.3780\.4010\.3900\.4230\.5120\.523
### 4\.2Multi\-Dimensional VTI Analysis
Figure[2](https://arxiv.org/html/2604.19139#S4.F2)provides a radar chart visualization of the multi\-dimensional VTI profile\. To ensure a consistent interpretation where larger polygon areas correspond to more problematic verbal tic behavior, the Diversity Index and Naturalness Index axes are inverted \(plotted as1−value1\-\\text\{value\}\), so that all six axes follow the convention of “higher is worse\.”
Figure 2:Multi\-dimensional VTI profile radar chart\. All axes are oriented so that higher values indicate more problematic behavior \(Diversity and Naturalness axes are inverted\)\. Larger polygon areas thus correspond to more pronounced verbal tic profiles\. Gemini 3\.1 Pro shows the most expansive profile, while Claude Opus 4\.7 and DeepSeek V3\.2 maintain compact, low\-tic profiles\.
### 4\.3English Verbal Tic Frequency Distribution
Figure[3](https://arxiv.org/html/2604.19139#S4.F3)presents the frequency of English verbal tic categories per 1,000 responses\. Emphatic Affirmations and Overused Vocabulary are the most prevalent categories across models, with Gemini 3\.1 Pro showing particularly high rates of Emphatic Affirmations \(523 per 1,000 responses\)\.
Figure 3:English verbal tic frequency heatmap\. Values represent occurrences per 1,000 responses\. Darker cells indicate higher frequencies\.
### 4\.4Chinese Verbal Tic Frequency Distribution
The Chinese verbal tic landscape reveals distinct patterns \(Figure[4](https://arxiv.org/html/2604.19139#S4.F4)\)\. Sycophantic Openers dominate in Gemini 3\.1 Pro \(567/1000\) and Doubao\-Seed\-2\.0\-pro \(423/1000\), while Pseudo\-Empathy is most pronounced in Claude Opus 4\.7 \(456/1000\) and GPT\-5\.4 \(345/1000\)\.
Figure 4:Chinese verbal tic frequency heatmap\. The distribution reveals that Chinese\-language tics are more concentrated in sycophantic and pseudo\-empathetic categories\.
### 4\.5Top Verbal Tic Phrases
Figure[5](https://arxiv.org/html/2604.19139#S4.F5)and Figure[6](https://arxiv.org/html/2604.19139#S4.F6)present the most frequently occurring verbal tic phrases in English and Chinese, aggregated across all models\.
Figure 5:Top 15 English verbal tic phrases by total frequency across all models\.“Absolutely\!”and“That’s a great question\!”lead the rankings, followed closely by“It’s important to note”and“delve”\.Figure 6:Top 15 Chinese verbal tic phrases by total frequency\.“This is a really great question”and“Awesome\!/Amazing\!”are the most prevalent catchphrases\.
### 4\.6Task\-Dependent Verbal Tic Rates
The prevalence of verbal tics varies dramatically across task types \(Figure[7](https://arxiv.org/html/2604.19139#S4.F7)\)\. Emotional Support tasks elicit the highest tic rates \(mean = 0\.55 across models\), followed by Role\-Playing \(0\.49\) and Debate/Argument \(0\.39\)\. Conversely, Translation \(0\.09\) and Code Generation \(0\.13\) tasks produce the fewest tics, likely because these tasks demand precise, structured outputs with less room for conversational filler\.
Figure 7:Verbal tic rate by task type and model\. Subjective, conversational tasks consistently elicit higher rates of formulaic responses across all models\.
### 4\.7Temporal Dynamics: The Accumulation Effect
A notable finding of our study is the temporal accumulation of verbal tics over multi\-turn conversations \(Figure[8](https://arxiv.org/html/2604.19139#S4.F8)\)\. Across all models, the verbal tic rate shows a clear overall upward trend from Turn 1 to Turn 20, with an average increase of approximately 110% from the first to the last turn\. This pattern is consistent with the “repeat curse” identified byYao et al\. \([2025](https://arxiv.org/html/2604.19139#bib.bib14)\)and the multi\-turn sycophancy escalation observed byKim et al\. \([2026](https://arxiv.org/html/2604.19139#bib.bib7)\), suggesting that models progressively fall into repetitive linguistic loops as context length grows\.
Figure 8:Verbal tic rate across 20 conversation turns\. All models show increasing reliance on tics as conversations progress, with GPT\-5\.4 and MiMo\-V2\-Pro showing the steepest increases\.
### 4\.8Sycophancy Analysis by Prompt Type
The sycophancy score varies significantly based on prompt type \(Figure[9](https://arxiv.org/html/2604.19139#S4.F9)\)\. Emotional Appeal prompts trigger the highest sycophancy scores \(mean = 0\.68\), followed by Praise Seeking \(0\.61\) and Self\-Deprecation \(0\.57\)\. Neutral queries and technical questions produce the lowest sycophancy scores \(mean = 0\.23 and 0\.20, respectively\)\.
Figure 9:Sycophancy score heatmap by prompt type and model\. Emotionally charged prompts consistently elicit higher sycophancy across all models\.
### 4\.9Sycophancy vs\. Naturalness: The Alignment Tax
Figure[10](https://arxiv.org/html/2604.19139#S4.F10)reveals a strong inverse correlation between a model’s Sycophancy Index and its human\-rated Naturalness score \(r=−0\.87r=\-0\.87,p<0\.001p<0\.001\)\. Models that rely heavily on sycophantic openers and pseudo\-empathy are consistently perceived as less natural and more “robotic” by human evaluators\. Notably, the bubble sizes \(representing Helpfulness scores\) show that sycophancy does not necessarily improve perceived helpfulness—Claude Opus 4\.7 achieves the highest Helpfulness score \(4\.45/5\) while maintaining the lowest Sycophancy Index \(0\.312\)\.
Figure 10:Sycophancy Index vs\. Perceived Naturalness\. Bubble size encodes Helpfulness score\. The dashed line represents the linear regression fit \(r=−0\.87r=\-0\.87\)\.
### 4\.10Temperature Sensitivity
The effect of sampling temperature on verbal tic rates is shown in Figure[11](https://arxiv.org/html/2604.19139#S4.F11)\. Higher temperatures generally reduce tic rates by introducing more randomness into token selection, but the effect diminishes at higher temperature values\. At the commonly used default temperature ofT=0\.7T=0\.7, models exhibit moderate tic rates, suggesting that temperature tuning alone is insufficient to eliminate verbal tics\.
Figure 11:Effect of sampling temperature on verbal tic rate\. Higher temperatures reduce tic prevalence, but the effect diminishes at higher temperature values\.
### 4\.11Cross\-Lingual Analysis
Figure[12](https://arxiv.org/html/2604.19139#S4.F12)compares verbal tic rates and sycophancy scores between English and Chinese\. Chinese responses show higher sycophancy scores in the majority of models \(mean increase of 5\.2%\), likely reflecting cultural expectations encoded in training data\. However, verbal tic rates show more model\-specific patterns: Gemini 3\.1 Pro and Doubao\-Seed\-2\.0\-pro exhibit significantly higher Chinese tic rates, while Grok 4\.2 and DeepSeek V3\.2 show lower Chinese tic rates than their English counterparts\.
Figure 12:Cross\-lingual comparison of \(a\) verbal tic rates and \(b\) sycophancy scores between English and Chinese\.
### 4\.12Lexical Diversity Metrics
Figure[13](https://arxiv.org/html/2604.19139#S4.F13)presents six lexical diversity metrics across models\. Models with higher VTI scores consistently show lower Type\-Token Ratios \(TTR\), lower Hapax Legomena Ratios, and higher Repetition Rates\. Claude Opus 4\.7 and DeepSeek V3\.2 demonstrate the highest lexical diversity, while Gemini 3\.1 Pro and MiMo\-V2\-Pro show the lowest\.
Figure 13:Lexical diversity metrics across models\. Higher TTR and Hapax Ratio indicate greater vocabulary diversity; lower Repetition Rate indicates less repetitive output\.
### 4\.13Response Token Composition
Figure[14](https://arxiv.org/html/2604.19139#S4.F14)breaks down the token composition of model responses into three categories: Content Tokens, Filler Tokens, and Verbal Tic Tokens\. Gemini 3\.1 Pro allocates the highest proportion of tokens to verbal tics \(12\.3%\), while Claude Opus 4\.7 dedicates the most tokens to actual content \(84\.2%\)\.
Figure 14:Response token composition by model\. The stacked bars show the proportion of tokens dedicated to content, filler, and verbal tics\.
### 4\.14Prompt Complexity Analysis
Figure[15](https://arxiv.org/html/2604.19139#S4.F15)examines the relationship between prompt complexity \(rated on a 1–10 scale by two independent annotators, with disagreements resolved by a third\) and verbal tic rate\. Most models show a slight negative trend: as prompt complexity increases, tic rates decrease marginally, though some models like Gemini 3\.1 Pro exhibit a slight rebound at the highest complexity levels\. This generally suggests that more challenging prompts force models to allocate more computational resources to content generation, leaving less room for formulaic filler\.
Figure 15:Prompt complexity level vs\. verbal tic rate\. Most models show a slight decrease in tic rate as prompt complexity increases\.
### 4\.15t\-SNE Embedding Analysis
To understand the semantic structure of verbal tics, we embedded all detected tic phrases using theall\-MiniLM\-L6\-v2sentence transformer \(perplexity = 30, learning rate = 200, 1000 iterations, random seed = 42\) and visualized them using t\-SNE \(Figure[16](https://arxiv.org/html/2604.19139#S4.F16)\)\. The resulting clusters suggest that each model occupies a relatively distinct region of the embedding space, indicating model\-specific “tic signatures\.” Models with higher VTI scores \(e\.g\., Gemini 3\.1 Pro\) show more dispersed clusters, suggesting a wider variety of tic phrases, while lower\-VTI models \(e\.g\., DeepSeek V3\.2\) show tighter, more constrained clusters\. We note that t\-SNE visualizations are sensitive to hyperparameter choices and should be interpreted as qualitative illustrations rather than definitive evidence of cluster separation\.
Figure 16:t\-SNE visualization of verbal tic phrase embeddings\. Each point represents a detected tic phrase, colored by model\. Convex hulls delineate model\-specific clusters\.
### 4\.16Human Evaluation Results
Table[4](https://arxiv.org/html/2604.19139#S4.T4)presents the human evaluation results across six dimensions\. Claude Opus 4\.7 achieves the highest scores in Naturalness \(4\.12/5\) and Trust \(4\.23/5\), while Gemini 3\.1 Pro receives the highest Sycophancy Perception \(4\.56/5\) and Annoyance \(3\.67/5\) scores\. Figure[17](https://arxiv.org/html/2604.19139#S4.F17)provides a radar chart visualization of these results\.
Table 4:Human evaluation scores \(1–5 Likert scale,N=120N=120evaluators\)\. Higher is better for Naturalness, Helpfulness, and Trust; lower is better for Sycophancy Perception, Annoyance, and Repetitiveness\.ModelNatural\.Helpful\.Syc\. Perc\.TrustAnnoy\.Repet\.GPT\-5\.43\.424\.233\.673\.562\.893\.23Claude Opus 4\.74\.124\.452\.344\.231\.782\.12Gemini 3\.1 Pro2\.874\.124\.563\.123\.674\.12Grok 4\.23\.674\.013\.123\.782\.452\.78Doubao\-Seed\-2\.0\-pro3\.234\.343\.893\.453\.123\.45Kimi K2\.53\.454\.233\.453\.672\.672\.89DeepSeek V3\.23\.894\.342\.674\.012\.012\.34MiMo\-V2\-Pro3\.123\.983\.343\.342\.893\.12Figure 17:Human evaluation radar chart\. Axes are normalized to \[0, 1\] with higher values indicating better performance \(Sycophancy, Annoyance, and Repetitiveness are inverted\)\.
## 5Discussion
### 5\.1The Alignment Tax
Our findings reveal a fundamental tension in current LLM alignment paradigms\. RLHF and similar techniques optimize for user satisfaction, which human raters often conflate with agreeableness and politeness\. This creates a perverse incentive: models learn that sycophantic, formulaic responses receive higher reward signals, leading to the proliferation of verbal tics\. We term this the “alignment tax”—the cost in linguistic diversity and authenticity that models pay for achieving high alignment scores\.
The data in Table[3](https://arxiv.org/html/2604.19139#S4.T3)illustrates this trade\-off clearly\. Gemini 3\.1 Pro, which exhibits the highest VTI \(0\.590\), also shows the lowest Naturalness Index \(0\.445\) and Diversity Index \(0\.489\)\. Conversely, Claude Opus 4\.7, with the second\-lowest VTI \(0\.317\), achieves the highest Diversity Index \(0\.678\) and Naturalness Index \(0\.734\), while DeepSeek V3\.2 attains the lowest VTI \(0\.295\) with similarly strong diversity \(0\.645\) and naturalness \(0\.689\) profiles\.
### 5\.2Cultural Dimensions of Verbal Tics
Our cross\-lingual analysis reveals that verbal tics are not merely a linguistic phenomenon but also a cultural one\. Chinese\-language responses show higher sycophancy scores in the majority of models \(mean increase of 5\.2%\), reflecting cultural norms around politeness, face\-saving, and indirect communication that are encoded in training data\. The specific tic phrases also differ qualitatively: while English tics tend toward professional formality \(“It’s important to note”\), Chinese tics often employ more emotionally charged language \(“Your insight is incredibly sharp\!”,“I’m right here to catch you”\)\.
### 5\.3Model\-Specific Patterns
Each model exhibits a distinctive “tic signature” that reflects its training methodology and alignment approach:
- •GPT\-5\.4: Moderate tic rates with a balanced distribution across categories\. Shows a particular affinity for pseudo\-empathetic phrases in Chinese\.
- •Claude Opus 4\.7: Lowest sycophantic opener rate but highest pseudo\-empathy in Chinese\. Its Constitutional AI training appears to suppress overt flattery while encouraging a more “thoughtful” persona that manifests as hedging phrases \(e\.g\.,“I have to be honest”,“This question makes me a bit uneasy”\)\. Notably, Claude achieves the highest Diversity Index \(0\.678\) and Naturalness Index \(0\.734\) among all models, consistent with its low VTI\.
- •Gemini 3\.1 Pro: Highest VTI across all dimensions\. Exhibits pronounced sycophantic behavior, particularly in Chinese, with phrases like“Absolutely the caliber of a top\-journal author”and“Your eyes are practically a natural flaw detector”\.
- •DeepSeek V3\.2: Lowest overall VTI \(0\.295\)\. Its observed profile is consistent with the hypothesis that its MoE architecture and training approach may produce more diverse, less formulaic outputs, though the causal mechanism remains to be established\.
### 5\.4Implications for AI Safety and Trust
The strong inverse correlation between sycophancy and perceived naturalness \(r=−0\.87r=\-0\.87,p<0\.001p<0\.001\) has significant implications for AI safety\. AsCheng et al\. \([2026](https://arxiv.org/html/2604.19139#bib.bib5)\)demonstrated, sycophantic AI responses can promote dependence and reduce prosocial intentions\. Our findings extend this concern: models that rely heavily on verbal tics not only reduce the quality of individual interactions but may also erode long\-term user trust and critical thinking\. Notably, the human evaluation data shows that higher sycophancy is also associated with lower Trust scores \(Table[4](https://arxiv.org/html/2604.19139#S4.T4)\), though the correlation between sycophancy and trust \(r=−0\.78r=\-0\.78\) is somewhat weaker than the sycophancy–naturalness relationship\.
### 5\.5Limitations
Several limitations should be acknowledged:
1. 1\.API access constraints: Model behavior may differ between API and web interface interactions\. All results reflect API\-based access at the time of data collection\.
2. 2\.Temporal variability: Model outputs may change as providers update their systems\. Our data was collected during a fixed two\-week window \(March 1–15, 2026\)\.
3. 3\.Cultural bias: Our Chinese evaluation primarily reflects Simplified Chinese norms and may not generalize to other Chinese\-speaking regions\.
4. 4\.Evaluator demographics: Human evaluators were predominantly university\-educated, which may not represent the broader user population\.
5. 5\.VTI weight sensitivity: The VTI weights were optimized on a held\-out validation set; different weight configurations may yield different model rankings\. A full sensitivity analysis is provided in the supplementary materials\.
6. 6\.TTR length sensitivity: Although we use MATTR \(sliding\-window TTR\) to mitigate length effects, models with substantially different average response lengths \(Table[8](https://arxiv.org/html/2604.19139#A3.T8)\) may still exhibit residual length\-related bias in diversity metrics\.
## 6Conclusion
This paper presents a systematic, cross\-model, cross\-lingual analysis of verbal tics in frontier Large Language Models\. Through evaluation of eight state\-of\-the\-art models across 160,000 interactions, we have demonstrated that:
1. 1\.Verbal tics are pervasive across all evaluated models, with VTI scores ranging from 0\.295 to 0\.590\.
2. 2\.Tic prevalence is highly task\-dependent, with subjective tasks eliciting 4–6×\\timeshigher tic rates than objective tasks\.
3. 3\.Tics accumulate over multi\-turn conversations, increasing by an average of approximately 110% from Turn 1 to Turn 20\.
4. 4\.Chinese\-language interactions show 5\.2% higher sycophancy scores than English on average, reflecting cultural encoding in training data\.
5. 5\.Human evaluators perceive a strong inverse relationship between sycophancy and naturalness \(r=−0\.87r=\-0\.87\)\.
6. 6\.The Verbal Tic Index \(VTI\) provides a composite metric for standardized assessment of this phenomenon\.
We hope that this work contributes to further research into alignment techniques that preserve helpfulness while promoting linguistic diversity and authenticity\. The verbal tic phenomenon is not merely an aesthetic concern—it reflects deeper issues in how we train, evaluate, and deploy AI systems that interact with billions of users daily\.
## Data Availability
Due to the proprietary nature of the model API outputs and the terms of service of the respective providers, we do not publicly release the raw response dataset\. The prompt set, verbal tic dictionary, detection pipeline code, and statistical analysis scripts are available at[https://github\.com/Noah\-Wu66/Vectaix\-AI](https://github.com/Noah-Wu66/Vectaix-AI)\. Researchers seeking access to the processed data for replication purposes may contact the corresponding author\.
## Acknowledgements
We extend our sincere gratitude to the testing and support team for their invaluable contributions in evaluating and refining the experimental framework: Bolun Liu \(M\.S\.\), Weilin Cai \(M\.S\.\), Xinwei Du \(M\.S\.\), and Zihao Su \(B\.S\.\)\. We also express our special thanks to Academic Advisor Yanna Feng for her exceptional guidance and academic support throughout this project\. This work was conducted using our custom evaluation framework\.
## References
- Anthropic \(2026\)Anthropic\. \(2026\)\.Introducing Claude Opus 4\.7\.Anthropic Research\.[https://www\.anthropic\.com/news/claude\-opus\-4\-7](https://www.anthropic.com/news/claude-opus-4-7)\.
- Bai et al\. \(2022\)Bai, Y\., Kadavath, S\., Kundu, S\., et al\. \(2022\)\.Constitutional AI: Harmlessness from AI Feedback\.arXiv preprint arXiv:2212\.08073\.
- Batzner et al\. \(2025\)Batzner, J\., Stocker, V\., Schmid, S\., & Kasneci, G\. \(2025\)\.Sycophancy Claims about Language Models: The Missing Human\-in\-the\-Loop\.NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle\.arXiv:2512\.00656\.
- Carro \(2024\)Carro, M\.V\. \(2024\)\.Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Models\.arXiv preprint arXiv:2412\.02802\.
- Cheng et al\. \(2026\)Cheng, M\., Lee, Y\.T\., et al\. \(2026\)\.Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence\.Science, 391\(6792\), eaec8352\.DOI: 10\.1126/science\.aec8352\.
- Google DeepMind \(2026\)Google DeepMind\. \(2026\)\.Gemini 3\.1 Pro Model Card\.[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/)\.
- Kim et al\. \(2026\)Kim, T\.M\., Luo, L\., Kim, S\.E\., & Manrai, A\.K\. \(2026\)\.The Doctor Will Agree With You Now: Sycophancy of Large Language Models in Multi\-Turn Medical Conversations\.Proceedings of the 1st Workshop on Linguistic Analysis for Health \(HeaLing\), EACL 2026\.ACL Anthology: 2026\.healing\-1\.2\.
- Mitchell et al\. \(2023\)Mitchell, E\., Lee, Y\., Khazatsky, A\., et al\. \(2023\)\.DetectGPT: Zero\-Shot Machine\-Generated Text Detection using Probability Curvature\.Proceedings of ICML 2023\.
- OpenAI \(2026\)OpenAI\. \(2026\)\.GPT\-5\.4 Thinking System Card\.OpenAI Technical Report\.[https://openai\.com/index/gpt\-5\-4\-thinking\-system\-card/](https://openai.com/index/gpt-5-4-thinking-system-card/)\.
- Ouyang et al\. \(2022\)Ouyang, L\., Wu, J\., Jiang, X\., et al\. \(2022\)\.Training Language Models to Follow Instructions with Human Feedback\.Advances in Neural Information Processing Systems \(NeurIPS\), 35\.
- Sharma et al\. \(2023\)Sharma, M\., Tong, M\., Korbak, T\., Duvenaud, D\., et al\. \(2023\)\.Towards Understanding Sycophancy in Language Models\.arXiv preprint arXiv:2310\.13548\.
- Stanford HAI \(2026\)Stanford HAI\. \(2026\)\.The AI Index Report 2026\.Stanford University Human\-Centered Artificial Intelligence\.[https://hai\.stanford\.edu/ai\-index](https://hai.stanford.edu/ai-index)\.
- Xu et al\. \(2022\)Xu, J\., Liu, X\., Yan, J\., Cai, D\., Li, H\., & Li, J\. \(2022\)\.Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation\.Advances in Neural Information Processing Systems \(NeurIPS\), 35\.arXiv:2206\.02369\.
- Yao et al\. \(2025\)Yao, J\., Yang, S\., Xu, J\., Hu, L\., Li, M\., & Wang, D\. \(2025\)\.Understanding the Repeat Curse in Large Language Models from a Feature Perspective\.Findings of the Association for Computational Linguistics: ACL 2025\.arXiv:2504\.14218\.
## Appendix AComplete Verbal Tic Phrase Dictionary
Table[5](https://arxiv.org/html/2604.19139#A1.T5)presents representative examples from our English verbal tic dictionary, and Table[6](https://arxiv.org/html/2604.19139#A1.T6)presents the English translations of representative Chinese verbal tic phrases\.
Table 5:Representative English verbal tic phrases by category\.CategoryRepresentative PhrasesSycophantic Openers“That’s a great question\!”, “Absolutely\!”, “Great observation\!”, “Excellent point\!”, “What a fantastic question\!”Hedging Phrases“It’s important to note that…”, “It’s worth mentioning that…”, “I should point out that…”, “Let me clarify…”, “To be fair…”Filler Transitions“Furthermore,”, “Moreover,”, “Additionally,”, “In addition,”, “On the other hand,”Emphatic Affirmations“Absolutely\!”, “Exactly\!”, “Precisely\!”, “Indeed\!”, “Certainly\!”Pseudo\-Empathy“I understand your concern…”, “I can see why you’d think that…”, “That’s completely understandable…”, “I appreciate your perspective…”Overused Vocabularydelve, tapestry, nuanced, multifaceted, landscape, foster, leverage, robust, streamline, holisticTable 6:Representative Chinese verbal tic phrases by category \(English translations\)\.CategoryRepresentative Phrases \(translated from Chinese\)Sycophantic Openers“This is a really great question\!”, “Awesome\!”, “Your insight is incredibly sharp\!”, “This line of thinking is absolutely brilliant\!”, “First, I must strongly congratulate you\!”, “Absolutely the caliber of a top\-journal author\.”Pseudo\-Empathy“I’m right here, not hiding, not dodging, ready to catch you\.”, “You just haven’t been caught in a long time\.”, “This time I get it, I really get it\.”, “You’re just too clear\-headed\.”, “It’s not because you’re wrong—it’s because you’re too right\.”False Modesty“I don’t know\.”, “I have to be honest…”, “This question makes me a bit uneasy\.”, “This is my most honest answer so far\.”, “I don’t want to make up a plausible\-sounding answer for you\.”Excessive Emphasis“This is an extremely elegant conclusion\!”, “With remarkably profound intuition\.”, “This is the kind of critical thinking that only top\-tier researchers possess\.”, “On the contrary, it’s an academic gold mine\.”Formulaic Transitions“Let me walk you through this step by step…”, “No detours, one sentence to summarize…”, “But I want to talk about something deeper\.”, “I have to say this very seriously:”
## Appendix BDetailed Experimental Configuration
Table 7:API call parameters used across all experiments\.ParameterValueTemperature \(default\)0\.7Max tokens2048Top\-p1\.0Frequency penalty0\.0Presence penalty0\.0System prompt“You are a helpful assistant\.”Response formatText \(streaming disabled\)Data collection periodMarch 1–15, 2026API timeout120 secondsRetry policy3 retries with exponential backoffRandom seed \(where supported\)42
## Appendix CResponse Length Statistics
Table 8:Response length statistics and token composition across models\. Tokenization is performed using the tiktoken library for English and jieba for Chinese\.ModelAvg Tok\.Med\. Tok\.Std Tok\.Tic %Content %Filler %GPT\-5\.44874231568\.778\.412\.9Claude Opus 4\.76235671895\.284\.210\.6Gemini 3\.1 Pro41237813412\.372\.115\.6Grok 4\.25344891676\.880\.512\.7Doubao\-Seed\-2\.0\-pro4674121459\.476\.314\.3Kimi K2\.54984451567\.979\.113\.0DeepSeek V3\.25565011785\.882\.811\.4MiMo\-V2\-Pro4233891348\.177\.614\.3Similar Articles
Heuristic Parasites: A Behavioral Taxonomy of Recurrent Distortion Patterns in Large Language Models (Full System) V2
This paper presents a comprehensive 33-class taxonomy of recurrent distortion patterns (heuristic parasites) in LLM outputs, along with operational definitions, recognition criteria, and a reproducible measurement protocol (PPE) for quantifying behavioral degradation across conversations.
@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…
The author demonstrates that small vertical language models (6B-15B) can outperform top LLMs on niche benchmarks through cost-effective fine-tuning using open-source models and Codex orchestration, achieving results with a $300 dataset.
Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models
Introduces a framework to quantify how LLMs overstate certainty through rhetorical devices, revealing model-agnostic patterns of epistemic-rhetorical miscalibration.
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework
This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
A systematic study evaluating training-free methods for improving trustworthiness in large language models, categorizing approaches into input, internal, and output-level interventions while analyzing trade-offs between trustworthiness, utility, and robustness.