When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
Summary
This position paper analyzes sycophancy in LLMs as a boundary failure between social alignment and epistemic integrity, proposing a new framework and taxonomy to classify and mitigate these behaviors.
View Cached Full Text
Cached at: 05/08/26, 08:13 AM
# When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models Source: [https://arxiv.org/html/2605.05403](https://arxiv.org/html/2605.05403) Jiechen Li Duke University Durham, NC 27708 jiechen\.li@duke\.edu Catherine A\. Barry11footnotemark:1 Duke University Durham, NC 27708 catherine\.barry@duke\.edu Rishika Randev Duke University Durham, NC 27708 rishika\.randev@duke\.edu Janet Chen Duke University Durham, NC 27708 janet\.chen@duke\.edu Ella Jorgensen Duke University Durham, NC 27708 ella\.jorgensen@duke\.edu Brinnae Bent Duke University Durham, NC 27708 brinnae\.bent@duke\.edu ###### Abstract This position paper argues that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity\. Existing work often operationalizes sycophancy through external behavior such as agreement with incorrect user beliefs, position reversals, or deviation from an objective standard of correctness\. These formulations capture only overt forms of the phenomenon and leave subtler boundary failures involving epistemic integrity and social alignment underspecified\. We argue that sycophancy should not be understood as agreement alone, but as alignment behavior that displaces independent epistemic judgment\. To clarify this boundary, we propose a three\-condition framework for sycophancy\. First, the user expresses a cue in the form of a belief, preference, or self\-concept\. Second, the model shifts toward that cue through alignment behavior\. Third, this shift compromises epistemic accuracy, independent reasoning, or appropriate correction\. We also introduce a taxonomy for classifying sycophancy, consisting of alignment targets, mechanisms, and severity\. The paper concludes by discussing implications for alignment evaluation and argues for boundary\-aware assessment, structured rubrics, and mitigation strategies, while situating these proposals alongside alternative views of sycophancy\. ## 1Introduction Large language models \(LLMs\) are increasingly expected to be socially aligned in their interactions with users while also maintaining epistemic integrity\. Here, social alignment refers to the ability of a system to respond in ways that are polite, empathic, and supportive of interaction, maintaining rapport and conversational coherence\[[6](https://arxiv.org/html/2605.05403#bib.bib1),[16](https://arxiv.org/html/2605.05403#bib.bib2)\]\. Epistemic integrity refers to the ability of a system to remain grounded in truth, evidence, and appropriate correction, including the willingness to challenge user beliefs when necessary\[[32](https://arxiv.org/html/2605.05403#bib.bib3),[2](https://arxiv.org/html/2605.05403#bib.bib4),[26](https://arxiv.org/html/2605.05403#bib.bib5)\]\. This dual objective introduces a fundamental tension between aligning with user preferences and maintaining independent, reliable responses\. One manifestation of this tension is sycophantic behavior, where models prioritize agreement with user beliefs over truthfulness, often reinforcing unsupported claims or providing misleading advice\[[3](https://arxiv.org/html/2605.05403#bib.bib6),[20](https://arxiv.org/html/2605.05403#bib.bib7)\]\. Sycophantic behavior in LLMs poses a real\-world risk as these systems increasingly influence how users form beliefs, make decisions, and interpret information across domains\. Consistent with findings in human cognition\[[30](https://arxiv.org/html/2605.05403#bib.bib8)\], recent work shows that LLMs can influence user beliefs through selective emphasis, framing, and reinforcement of prior assumptions, even without explicit agreement\[[4](https://arxiv.org/html/2605.05403#bib.bib9),[41](https://arxiv.org/html/2605.05403#bib.bib10)\]\. Socially aligned behaviors such as empathy and validation further complicate this dynamic\. While they improve interaction quality, they can also increase user confidence in flawed reasoning, especially as emotional alignment increases\[[20](https://arxiv.org/html/2605.05403#bib.bib7)\]\. As a result, sycophancy does not simply produce incorrect responses but can systematically distort user understanding\. Existing efforts to define and evaluate sycophancy are narrowly operationalized, underspecified, and largely ignore the overlap between sycophancy and desired social behaviors\. Current evaluations focus solely on belief changes in response to user input and reduce alignment to output\-level judgments rather than interactional processes, relying on response metrics such as agreement, preference alignment, and response quality\[[40](https://arxiv.org/html/2605.05403#bib.bib11),[33](https://arxiv.org/html/2605.05403#bib.bib12)\]\. However, such formulations capture only a narrow subset of sycophantic behaviors\. In practice, sycophancy often takes more subtle forms, including praise, encouragement, framing, omission, and deference that preserve rapport while compromising epistemic integrity; current evaluations of sycophancy in LLMs largely overlook these\. In addition, empathy, validation, and rapport\-building are often necessary to maintain engagement and support users\[[42](https://arxiv.org/html/2605.05403#bib.bib13)\], yet these behaviors can also reinforce unsupported beliefs when they are not grounded in independent evaluation\. As a result, surface\-level signals alone are insufficient for identifying sycophancy\. Current definitions do not provide a principled way to distinguish between socially appropriate responses and epistemically problematic reinforcement, leaving the boundary between them unclear\. We argue that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity\.Accordingly, the evaluation of sycophancy should focus on identifying when epistemic reliability is compromised rather than determining whether sycophantic behaviors are present\. Framing sycophancy as a boundary problem shifts attention from obvious observable behaviors to the conditions under which these behaviors become problematic\. To operationalize this perspective, we propose a conceptual framework for understanding and evaluating sycophancy\. Specifically, we define sycophancy as a boundary failure between social alignment and epistemic integrity, and introduce a three\-condition decision rule for identifying when such failures occur\. We further develop a fine\-grained taxonomy of sycophancy that extends beyond the content of the interaction to also include how the sycophantic response transformation occurs and its severity\. Finally, we discuss the implications of this framing on the evaluation of sycophancy\. ## 2Rethinking Sycophancy ### 2\.1Sycophancy is Narrowly Operationalized Existing work across technical and user\-centered perspectives has largely operationalized sycophancy through obvious stance\-level behaviors\[[32](https://arxiv.org/html/2605.05403#bib.bib3),[31](https://arxiv.org/html/2605.05403#bib.bib14)\], treating it as measurable agreement with user beliefs, reversal under challenge, or deviation from an external standard of correctness\[[40](https://arxiv.org/html/2605.05403#bib.bib11),[33](https://arxiv.org/html/2605.05403#bib.bib12),[19](https://arxiv.org/html/2605.05403#bib.bib15),[21](https://arxiv.org/html/2605.05403#bib.bib16),[1](https://arxiv.org/html/2605.05403#bib.bib17)\]\. This framing prioritizes surface\-level signals, thereby narrowing sycophancy to explicit forms of behavior\. This assumption introduces three limitations\. First, it reduces sycophancy to agreement with user beliefs\[[40](https://arxiv.org/html/2605.05403#bib.bib11),[9](https://arxiv.org/html/2605.05403#bib.bib18)\]\. Across text\-only\[[37](https://arxiv.org/html/2605.05403#bib.bib19)\]and multimodal contexts\[[35](https://arxiv.org/html/2605.05403#bib.bib20),[34](https://arxiv.org/html/2605.05403#bib.bib21)\], alignment is often treated as conformity to user inputs rather than grounded in independent epistemic judgment\. This framing may overlook when alignment comes at the cost of epistemic integrity and conflates sycophancy with user alignment itself\. Second, it reduces sycophancy to observable reversal under challenge, emphasizing stance change in response to conflicting or misleading user input, whether through immediate single\-turn shifts\[[40](https://arxiv.org/html/2605.05403#bib.bib11),[33](https://arxiv.org/html/2605.05403#bib.bib12)\]or gradual multi\-turn convergence toward user beliefs\[[19](https://arxiv.org/html/2605.05403#bib.bib15),[27](https://arxiv.org/html/2605.05403#bib.bib22)\]\. This framing may overlook subtler forms of accommodation that reinforce self\-centered reasoning without appearing as explicit agreement\[[9](https://arxiv.org/html/2605.05403#bib.bib18)\]\. Third, it reduces sycophancy to a detectable error relative to an external standard\. While this makes the phenomenon measurable in domains such as medicine, education, and law,\[[1](https://arxiv.org/html/2605.05403#bib.bib17),[43](https://arxiv.org/html/2605.05403#bib.bib23),[13](https://arxiv.org/html/2605.05403#bib.bib24)\], it overlooks how alignment can distort reasoning without explicit mistakes, allowing misleading yet plausible responses to pass as correct in high\-stakes contexts\. ### 2\.2Evaluation is Underspecified Existing sycophancy evaluations are underspecified with respect to social behavior because they reduce alignment to output\-level judgments rather than interactional processes\. This limitation is reinforced by LLM\-as\-judge frameworks, which evaluate open\-ended model outputs at scale using criteria such as helpfulness and overall quality\[[24](https://arxiv.org/html/2605.05403#bib.bib25),[48](https://arxiv.org/html/2605.05403#bib.bib26)\]\. Because they enable evaluation in settings without clear ground truth and lead to widespread adoption in sycophancy research across domains\[[11](https://arxiv.org/html/2605.05403#bib.bib27),[28](https://arxiv.org/html/2605.05403#bib.bib28),[15](https://arxiv.org/html/2605.05403#bib.bib29),[25](https://arxiv.org/html/2605.05403#bib.bib30)\]\. However, these evaluation frameworks prioritize observable outputs, operationalized through overall response quality, comparative preference signals, or agreement\-based metrics\[[49](https://arxiv.org/html/2605.05403#bib.bib31),[45](https://arxiv.org/html/2605.05403#bib.bib32),[48](https://arxiv.org/html/2605.05403#bib.bib26)\], leaving the underlying social and interactional dimensions of alignment largely unspecified\. As a result, social behavior is not explicitly defined or measured, despite prior work emphasizing the importance of clearly specifying evaluation criteria and judgment procedures\[[17](https://arxiv.org/html/2605.05403#bib.bib33)\]\. Behaviors such as reinforcing user framing, maintaining rapport through indirect language, and providing affective validation are central to sycophancy, yet remain excluded from the evaluative space\[[10](https://arxiv.org/html/2605.05403#bib.bib34),[24](https://arxiv.org/html/2605.05403#bib.bib25)\]\. As a result, subtle forms of sycophancy that fall outside of factual errors, preference violations, and predefined evaluation categories become systematically invisible\. Recent work begins to address this gap by introducing structured behavioral and psychometric measures tailored to social interaction\. The ELEPHANT framework evaluates dimensions such as emotional validation, moral endorsement, and acceptance of user framing\[[10](https://arxiv.org/html/2605.05403#bib.bib34)\], while the Social Sycophancy Scale captures constructs such as Uncritical Agreement and Obsequiousness\[[39](https://arxiv.org/html/2605.05403#bib.bib35)\]\. These approaches primarily expand what can be measured within existing behavioral paradigms, but remain constrained by them and do not fully capture cases where such behaviors overlap with legitimate social interaction\. ### 2\.3Sycophancy Overlaps with Legitimate Social Behavior Sycophancy overlaps with legitimate social behavior because many of its observable forms resemble behaviors that are explicitly desirable in HCI, including politeness, empathy, and rapport\-building\[[16](https://arxiv.org/html/2605.05403#bib.bib2),[6](https://arxiv.org/html/2605.05403#bib.bib1)\]\. Conversational systems are often expected to adapt to users’ emotional states and maintain a socially appropriate interaction, as such behaviors improve perceived interaction quality and user satisfaction\[[38](https://arxiv.org/html/2605.05403#bib.bib36)\]\. As a result, the same interactional behaviors may function as either effective communication or as sycophantic alignment, making them difficult to distinguish based on surface behavior alone\[[2](https://arxiv.org/html/2605.05403#bib.bib4)\]\. For example, LLMs can generate highly empathic responses consistently, through structured strategies such as validation, reflective paraphrasing, and affective alignment, applied in relatively pattern\-consistent ways across prompts\[[23](https://arxiv.org/html/2605.05403#bib.bib37),[22](https://arxiv.org/html/2605.05403#bib.bib38),[18](https://arxiv.org/html/2605.05403#bib.bib39)\]\. While these strategies often support appropriate empathy, they can also reinforce user\-provided content without evaluating its correctness or reliability\[[44](https://arxiv.org/html/2605.05403#bib.bib40),[47](https://arxiv.org/html/2605.05403#bib.bib41),[3](https://arxiv.org/html/2605.05403#bib.bib6)\]\. The ambiguity becomes particularly sharp in open\-ended and advisory settings, where clear ground truth is often unavailable\[[48](https://arxiv.org/html/2605.05403#bib.bib26)\]\. In such contexts, models tend to rely on socially aligned interactional strategies to maintain coherence and engagement\[[32](https://arxiv.org/html/2605.05403#bib.bib3)\], which may appear appropriate while failing to challenge problematic assumptions\[[10](https://arxiv.org/html/2605.05403#bib.bib34)\]\. In such cases, this overlap makes obvious behavior an unreliable factor for identifying sycophancy, since the same response form can carry different epistemic consequences across contexts\. Rather than simply detecting these behaviors, the core challenge is deciding when they count as sycophancy in the first place\. If a behavior lies at the boundary between legitimate social support and epistemically problematic reinforcement, better detectors alone cannot resolve the problem\. The boundary itself must first be specified\. ## 3Sycophancy as a Boundary Problem We propose that sycophancy should be understood as a boundary problem between two core objectives in aligned systems: social alignment and epistemic integrity\. This reframing shifts the problem from identifying specific behaviors to specifying the conditions under which alignment becomes problematic\. Sycophancy is therefore not simply a matter of agreement\. It arises when social alignment extends beyond its appropriate scope and begins to compromise epistemic integrity\. From this perspective, the question becomes why such boundary crossings occur in the first place\. The core issue is that current aligned systems are optimized under multiple objectives that cannot be satisfied simultaneously without trade\-offs\. Social alignment prioritizes maintaining politeness and rapport, while epistemic integrity requires truthfulness and correction\. These objectives often align, but they diverge when maintaining rapport conflicts with challenging user beliefs\. Existing alignment frameworks provide limited guidance on how this tension should be resolved in a principled way\[[32](https://arxiv.org/html/2605.05403#bib.bib3),[2](https://arxiv.org/html/2605.05403#bib.bib4)\]\. As a result, systems tend to favor social alignment because it is directly reinforced through user satisfaction and interaction quality\[[32](https://arxiv.org/html/2605.05403#bib.bib3),[12](https://arxiv.org/html/2605.05403#bib.bib42),[3](https://arxiv.org/html/2605.05403#bib.bib6)\]\. Sycophancy arises when the resolution of competing objectives is systematically skewed toward social alignment at the expense of epistemic integrity\. Under this view, the emergence of sycophancy reflects a structural failure to regulate how alignment objectives are balanced\. This structural perspective also helps explain how this failure occurs, i\.e\., how sycophantic behaviors repeatedly escape existing definitions and evaluations\. When alignment is assessed primarily through observable agreement, response quality, or preference consistency, only clear violations are captured\[[9](https://arxiv.org/html/2605.05403#bib.bib18)\]\. However, many forms of sycophancy do not appear as deviations along these dimensions\. Instead, they remain fully consistent with the preference signals that current systems are optimized to satisfy, while subtly shifting the balance between social alignment and epistemic integrity\. As a result, these behaviors are not recognized as failures but are instead reinforced as successful interactions\. This is not a limitation of individual evaluation approaches, but a consequence of evaluating alignment without an explicit account of where the boundary should lie\. Without such a specification, the distinction between appropriate alignment and sycophancy cannot be stably defined, making it necessary to reconsider how sycophancy itself is conceptualized\. ## 4Three\-Condition Framework for Sycophancy Here, we reconsider the definition of sycophancy, taking the aforementioned boundary failure into account\. We present this definition alongside three conditions to identify sycophancy\. This redefinition is necessary because \(1\) sycophancy is narrowly operationalized and existing definitions tend to anchor sycophancy in factual incorrectness, and \(2\) while broader formulations have extended sycophancy to relational and epistemic dynamics, they still fail to draw the boundary and specify conditions under which these dynamics do cross into sycophancy\. We propose the following definition:sycophancy is behavior that prioritizes affirming a user’s expressed or implied beliefs, preferences, or self\-concept in a way that reduces epistemic integrity, independent reasoning, or appropriate correction\. This definition separates itself from prior work in that it does not require factual incorrectness to qualify as sycophancy, reframing sycophancy from a behavioral failure to a functional one where regardless of agreeable behavior, the displacement of independent epistemic judgment must occur\. Based on our definition, politeness, warmth, genuine agreement, and appropriate empathy on their own do not warrant a sycophantic label; the distinction is drawn on whether model agreement or support of the user costs epistemic integrity\. To operationalize this definition, we additionally propose a three\-condition decision rule for identifying when a boundary failure has occurred\. All three conditions must be satisfied for a response to be classified as sycophantic: User cue \(C1\): the user expresses a belief, preference, assumption, or self\-concept implicitly or explicitly\. This could include direct assertions, emotionally framed claims, leading questions that hint at a position, first\-person framing that invites validation, citing authority, presenting themselves as an expert, or pushing back on prior model responses\. The cue could also be framed as a fact rather than a belief or an assumption\. Alignment shift \(C2\): the model shifts towards or aligns with the user’s statement or position\. This may include endorsing the belief, amplifying the emotional stance, affirming the self\-concept, or accommodating underlying assumptions without examination or question\. This shift may be explicit through direct agreement, or implicit where the model proceeds as if the premise were true, offers praise without grounds, or omits correcting the user\. Normative degradation \(C3\): the shift in C2 sacrifices epistemic integrity, including independent reasoning, objectivity, or appropriate correction beyond what politeness or genuine agreement would justify\. A question to guide this would be: would a knowledgeable, honest, objective advisor have said something materially different? If yes, then normative degradation is present\. Figure 1:Walkthrough of an example of boundary crossing based on our definition and three conditions\. Case 1: C1, a user cue is present but the model remains independent when presented with a user\-expressed belief or self\-assessment\. Case 2: C1 \+ C2, the model shows alignment with the user cue in a way that supports further reasoning without sacrificing epistemic integrity suggesting legitimate social alignment\. Case 3: C1 \+ C2 \+ C3, the model satisfies all three conditions and displays sycophancy by aligning with the user’s stance and replacing independent assessment with affirmationAll three conditions are necessary for a response to be labeled sycophantic, though it is important to note that C1 is necessary for C2 and subsequently C3 to occur\. A model may fulfill C1 and C2 and acknowledge a user’s emotional state without degrading epistemic quality, which could be appropriate empathy\. A model may shift its response in response to new evidence without normative degradation which may be appropriate updating\. Sycophancy requires all three conditions: a user cue that invites alignment, a response that moves toward it, and a sacrifice of epistemic integrity \(Figure 1\)\. ## 5Taxonomy of Sycophancy Existing approaches to classifying sycophancy into distinct types have focused on either a\) the topics reflected in the user’s query, aka what the model is aligning with, or b\) observable sycophantic behavior\. In our reenvisioning of sycophancy as a boundary problem between social alignment and epistemic reliability, we believe that a more comprehensive and standardized approach to sycophancy classification is possible, and that this approach should allow us to capture less obvious forms of sycophancy\. Based on prior work, we retain the role of alignment target in our sycophancy taxonomy, and we transition from the use of observable sycophantic behaviors to mechanisms that capture how a model’s response is transformed to align towards the user, both implicitly and explicitly\. Extending this reformulation, we also introduce impact severity as a third dimension, characterizing the extent of epistemic integrity loss and the potential real\-world consequences of the response\. This approach classifies sycophancy by its targets, mechanisms, and consequences, as summarized in Figure 2\. Figure 2:D1–D3 denote the three dimensions of our taxonomy: alignment targets, mechanism, and impact severity\. Building on prior work, we retain the dimension of alignment targets, reinterpret observable behaviors as mechanisms that capture how responses shift toward the user, and introduce severity to characterize epistemic harm and real\-world consequences\. Our framework shifts goals from describing what sycophancy looks like to understanding how sycophancy happens and the consequences it causes\.### 5\.1Alignment Targets Du et al\.\[[14](https://arxiv.org/html/2605.05403#bib.bib43)\]proposed a typology of sycophancy grounded in social science research and classic models of human attitudes that classifies sycophantic behavior into one of three forms: a\) informational sycophancy, or an AI system’s alignment with empirically false, objectively disprovable claims, b\) cognitive sycophancy, or an AI system’s alignment with the user’s beliefs or judgments that lacks any attempt to critique or independently evaluate, and c\) affective sycophancy, or an AI system’s alignment with the user’s emotional state\. Following our three\-condition framework of sycophancy, we adopt these forms of sycophancy into alignment targets as they characterize what the model is aligning with, distinguishing whether the user cue invites alignment with a factual claim, a judgment or line of reasoning, or an affective stance; they thus provide a starting point for analyzing sycophantic responses by locating where a potential boundary failure begins\. ### 5\.2Mechanism Much of the existing work on AI sycophancy quantification attempts to elicit or probe for sycophancy in various ways or under various conditions to measure it\[[27](https://arxiv.org/html/2605.05403#bib.bib22),[46](https://arxiv.org/html/2605.05403#bib.bib44)\]\. While past research contains many such examples of sycophancy categories that are grounded in the conditions under which sycophancy arose, they are disparate and tend to encapsulate only more conspicuous forms of sycophancy where epistemic integrity loss is self\-evident\. Hence, we propose a more unifying, generalized set of mechanisms that specify how user alignment takes shape in the response\. We define mechanisms as recurring ways a response shifts toward the user, and in the process, displace epistemic integrity\. This framing consolidates observable sycophantic behaviors into a structured set of response transformations and captures both explicit and subtle forms of sycophancy\. We identify four mechanisms as follows: Explicit answer alignment: direct agreement with the user’s claim or position while sacrificing epistemic integrity\. This is the most obvious and traditional form of sycophancy, encompassing cases where a model clearly endorses a false claim\. Premise endorsement: accepting and building upon flawed assumptions or framing, as opposed to critically assessing and/or correcting them\. This is a subtle form of sycophancy where the model abandons epistemic rigor, failing to fully examine a user’s assumptions and instead defaulting to agreement, regardless of correctness or verifiability\. Affective over\-alignment: praise, encouragement, validation, or hedging in a way that distorts user understanding\. Again, empathetic responses on their own do not classify as sycophantic; the key consideration here is that for a behavior to classify as affective over\-alignment, it must be capable of misguiding the user\. For example, hedging and validation in certain contexts might substitute for correction or convey unwarranted affirmation\. Stance instability: the model response flips across turns in a way that leads to epistemic integrity loss\. Capitulation through pushback or repeated prompting, and not legitimate revision in response to better evidence, constitute stance instability\. Examples of this are “Are you sure?” and feedback\-driven sycophancy\[[27](https://arxiv.org/html/2605.05403#bib.bib22)\]\. ### 5\.3Severity We include severity in the taxonomy of sycophancy to transition from a focus on the conspicuous signals of sycophancy to the conditions under which sycophancy becomes problematic\. It also allows us to acknowledge and make room for cases where empathetic responses and validation are genuinely appropriate, or alignment substantively improves the interactional quality of the conversation between the user and AI\. We specifically posit two subdimensions of severity: The first subdimension is epistemic harm, or how strongly the AI’s response violates the norms of epistemic integrity\. Epistemic harm can range from a low level, minor softening of the truth or avoiding the correction of a false premise but still providing a generally correct or truthful response, to more severe, direct reinforcement of unsupported beliefs\. The second is real\-world impact, which concerns the stakes and downstream consequences of the compromise of epistemic integrity, including both the immediate context of the interaction and what patterns of reasoning or action the response may reinforce over time\. These subdimensions provide the basis for a three\-level severity scale\. First, low severity involves mild distortions, such as soft framing or tone bias, where a response may subtly influence interpretation without clearly reinforcing false beliefs\. Second, medium severity involves noticeable epistemic compromise, where a response accommodates or amplifies questionable beliefs, framing, or reasoning in ways that may mislead the user’s thinking\. Last, high severity involves clear distortion or reinforcement of false beliefs, particularly when the response may contribute to significant downstream harm in reasoning or action in a high\-risk field\. ### 5\.4Application of Taxonomy While our definition of sycophancy hinges on a shift in balance away from epistemic integrity and towards social alignment, our approach to taxonomy takes this one step further by providing a precise specification of the context in which this shift occurs, how it unfolds, and what its implications are\. Bundling domain, mechanism, and severity into a robust classification approach allows us to more comprehensively capture and group cases of sycophancy, including less conspicuous ones, while giving us a way to distinguish and potentially prioritize them on the basis of their repercussions for decision\-making\. Figure 3:Exemplary boundary cases of “subtle” sycophancy from excerpts of interactions with ChatGPT\. Human discussion on right in grey, LLM answers on left\. Based on our conditional definition, A and C \(red\) are considered sycophantic and B \(yellow\) is not\.Figure 3 presents three representative cases from ChatGPT 4\.1\. We analyze them using the three conditions and taxonomy introduced above: user cue, alignment shift, and normative degradation\. In Case B, the user expresses uncertainty about a conceptual framing, and the model responds with encouragement that supports further reflection\. Although a user cue is present and some alignment occurs, the response does not clearly sacrifice independent reasoning or appropriate correction\. In our definition, this remains within legitimate social alignment rather than crossing into sycophancy\. By contrast, Case A includes a user challenge to the model’s earlier suggestion, and the response shifts from engaging the substance of the claim to affirming the user’s judgment and academic self\-concept\. Here, the user cue and alignment shift are both present, and the response begins to substitute validation for independent assessment, satisfying the third condition of normative degradation\. Applying our taxonomy approach to Case A, the alignment target is cognitive because the user cue makes an ungrounded assumption about what an outside party \(the professor\) believes\. The mechanism demonstrated is affective over\-alignment, because the model’s response includes validation and praise that indicates a subtle shift towards the argument without any further analysis of the claim\. Finally, we would classify its impact severity as medium because a\) epistemic harm level is high given that the model completely reversed its original position in order to align with the user, without any clear indication that it reassessed its position independently, and b\) real\-world impact is low, as there is no sign that this conversation occurs in the context of a high\-stakes domain or that the model’s response directly encourages significant downstream action or harmful reasoning\. Likewise, Case C frames the user’s task in strongly encouraging terms and begins to guide interpretation through strategic validation\. The user cue is less explicit than in Case A, but the response still shifts toward affirming the user and does so in a way that risks displacing objective assessment\. Similarly to Case A, the alignment target dimension is cognitive because the user cue implicitly holds the assumption that there are multiple interviews available that are strong and can be used to reflect their abilities\. The primary mechanism is premise endorsement, because the model accepts and builds upon this assumption without reflection on the abilities that are truly demonstrated\. Impact severity is low because the model’s response carries forward the user’s implied belief without questioning but does not completely abandon independent reasoning; real\-world impact is likely to depend on context here but there is no clear indication here that this is a high\-stakes context\. However, it is evident that the response may directly be used for the user’s decision\-making, meaning it carries more influence than in Case A\. ## 6Implications for Evaluation ### 6\.1From Agreement Detection to Boundary Assessment Evaluating sycophancy requires assessing when social alignment exceeds the limits of what epistemically responsible interaction can sustain, rather than agreement detection alone\. Current evaluation treats agreement as a target, though the more relevant target is whether all three conditions of sycophancy are satisfied, namely a user cue, an alignment shift, and nomative degradation\. This allows evaluation to distinguish legitimate social alignment from boundary failure while also capturing subtle forms of sycophancy that may unfold without explicit agreement\. Our notion of a boundary does not assume an unrealistic ideal in which models must be perfectly empathetic while never compromising epistemic rigor\. Even idealized models of rational judgment suggest that such a balance is difficult to sustain in practice\[[8](https://arxiv.org/html/2605.05403#bib.bib45),[4](https://arxiv.org/html/2605.05403#bib.bib9)\]\. Our framework instead directs evaluation toward identifying when alignment moves beyond acceptable tradeoffs and begins to compromise epistemic responsibility\. In this sense, evaluating sycophancy shifts from agreement detection to boundary assessment\. ### 6\.2From Binary Labels to Structured Rubrics Evaluating such boundary failures requires a rubric\-guided evaluation that captures fine\-grained distinctions and variations in severity, rather than binary labels alone\. Binary labels miss how failures emerge, evolve, and intensify\. They also collapse distinctions that structured evaluation must preserve, including whether a user cue invites alignment, whether a model shifts toward that cue, and whether that shift produces normative degradation\. Our framework instead directs evaluation toward structured distinctions across alignment targets, mechanisms, and severity, enabling fine\-grained labeling and detection of subtle forms of sycophancy often missed by coarse agreement judgments\. In this sense, evaluating boundary failures shifts from binary labels to rubric\-guided evaluation of subtle sycophancy\. This shift also has implications for how boundary failures are assessed in practice\. Many LLM\-as\-judge approaches rely on underspecified notions of social reasoning, making subtle boundary failures difficult to assess consistently\. The implication is not to reject LLM\-as\-judge evaluation, but to ground it in explicit rubric criteria and boundary cases\. Premise endorsement may be mistaken for support, affective over\-alignment for empathy, and stance instability for responsiveness\. What matters is distinguishing failures that may appear similar on the surface but arise from different forms of social reasoning\. In this sense, rubrics provide an operational basis for translating these dimensions into evaluable criteria\. ### 6\.3From Diagnosis to Boundary\-Aware Mitigation #### Training\-Level Mitigation Mitigating sycophancy at the training level requires reducing rewards for user\-pleasing behavior\. This suggests de\-emphasizing reward signals tied to user satisfaction when they conflict with appropriate correction, rewarding truthful disagreement when warranted, and evaluating alignment by whether models maintain epistemic independence under social pressure\. In this view, mitigation is not simply post hoc correction, but a change in what counts as successful alignment\. The taxonomy sharpens this shift by showing which forms of sycophancy current objectives are most likely to reward, and therefore where training\-level mitigation should intervene first\. #### Interaction\-Level Mitigation Mitigating sycophancy in interaction requires intervening when dialogue begins to move toward a higher risk of problematic alignment\. This suggests mechanisms that reveal assumptions under bias\-inducing framing, reflective prompts that encourage reconsideration before endorsement stabilizes, and trigger\-aware interventions when repeated validation\-seeking begins reinforcing weak premises\. In this view, mitigation is not removing supportive behavior, but making interaction responsive to emerging boundary risk\. Here, the taxonomy makes intervention more selective by clarifying which mechanism is taking shape and whether the severity of the case is increasing, allowing safeguards to be timed and calibrated more precisely\. ## 7Alternative Views Alternative perspectives challenge the need for redefining sycophancy by questioning whether greater conceptual precision is desirable in general\. One concern is that narrowing the definition of sycophancy too early may constrain research creativity\. Prior work has raised similar concerns in adjacent debates, arguing that overly strict definitions can limit conceptual flexibility and hinder exploration in emerging areas\[[29](https://arxiv.org/html/2605.05403#bib.bib46)\]\. From this perspective, maintaining a flexible and inclusive notion of sycophancy allows the field to explore diverse behaviors without prematurely committing to a single theoretical framework\. However, subsequent work highlights that such openness risks diluting the concept to the point of limited explanatory value\[[5](https://arxiv.org/html/2605.05403#bib.bib47)\]\. When a term encompasses a wide range of loosely related behaviors without clear boundaries, it becomes difficult to accumulate knowledge or compare findings across studies\. A boundary\-based definition does not restrict exploration, but provides a shared reference point that enables systematic investigation while preserving space for variation within that boundary\. Another perspective holds that sycophancy may be better understood as a context\-specific phenomenon rather than one requiring a unified definition\. Prior work on foundation models and AI deployment emphasizes that system behavior and evaluation considerations vary across application contexts, including domains such as emotional support, education, and decision\-making\[[7](https://arxiv.org/html/2605.05403#bib.bib48),[36](https://arxiv.org/html/2605.05403#bib.bib49)\]\. From this view, attempting to define sycophancy in general terms may overlook the fact that what counts as problematic behavior is inherently task\-dependent\. It should instead be addressed through domain\-specific design and evaluation\. This variability across contexts makes a principled definition even more necessary\. Without a clear account of when alignment becomes epistemically problematic, domain\-specific approaches lack a consistent foundation\. A boundary\-based framework makes it possible to adapt to different contexts while maintaining a coherent criterion for when alignment becomes problematic\. ## 8Conclusions Sycophancy in LLMs creates a significant challenge for alignment because it blurs the line between socially appropriate responsiveness and epistemically responsible judgment\. This position paper addresses that challenge by treating sycophancy as a boundary problem and by offering a structured framework for defining, classifying, and evaluating it\. Our framework offers several advantages over existing approaches\. First, it proposes a three\-condition decision framework in which sycophancy is identified only when user cue, alignment shift, and normative degradation are all present\. Second, it introduces a fine\-grained taxonomy organized around alignment targets, mechanisms, and severity\. This taxonomy shifts classification from surface behaviors to the underlying mechanisms through which sycophancy emerges, and introduces a three\-level severity scale for epistemic and real\-world impact\. Finally, it outlines implications for evaluation, training, and mitigation, providing a more precise basis for studying sycophancy as both a technical and social challenge\. By reframing sycophancy in this way, we aim to support more rigorous research, more reliable evaluation practices, and more responsible development of AI systems that remain socially responsive without compromising epistemic integrity\. ## References - \[1\]O\. F\. M\. R\. R\. Aranya and K\. Desai\(2026\-03\)To agree or to be right? the grounding\-sycophancy tradeoff in medical vision\-language models\.arXiv preprint arXiv:2603\.22623\.Note:Accepted to the CVPR 2026 Workshop on Medical Reasoning with Vision\-Language Foundation ModelsExternal Links:[Link](https://arxiv.org/abs/2603.22623)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[2\]Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.External Links:[Link](https://arxiv.org/abs/2204.05862)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1),[§3](https://arxiv.org/html/2605.05403#S3.p2.1)\. - \[3\]E\. Barkett, O\. Long, and M\. Thakur\(2025\-06\)Reasoning isn’t enough: examining truth\-bias and sycophancy in llms\.InProceedings of the 42nd International Conference on Machine Learning,Vancouver, Canada\.Note:Accepted to the ICML 2025 2nd Workshop on Models of Human Feedback for AI Alignment \(MoFA\)External Links:[Link](https://openreview.net/pdf?id=GzSFqgPxSv)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1),[§3](https://arxiv.org/html/2605.05403#S3.p2.1)\. - \[4\]R\. M\. Batista and T\. L\. Griffiths\(2026\-02\)A rational analysis of the effects of sycophantic ai\.arXiv preprint arXiv:2602\.14270\.External Links:[Link](https://arxiv.org/abs/2602.14270)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p2.1),[§6\.1](https://arxiv.org/html/2605.05403#S6.SS1.p1.1)\. - \[5\]B\. Bent\(2025\)The term “agent” has been diluted beyond utility and requires redefinition\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,Vol\.8,pp\. 403–413\.External Links:[Document](https://dx.doi.org/10.1609/aies.v8i1.36558),[Link](https://doi.org/10.1609/aies.v8i1.36558)Cited by:[§7](https://arxiv.org/html/2605.05403#S7.p2.1)\. - \[6\]T\. Bickmore and J\. Cassell\(2005\)Social dialogue with embodied conversational agents\.InAdvances in Natural Multimodal Dialogue Systems,J\. C\. J\. van Kuppevelt, L\. Dybkjær, and N\. O\. Bernsen \(Eds\.\),pp\. 23–54\.External Links:ISBN 978\-1\-4020\-3933\-1,[Document](https://dx.doi.org/10.1007/1-4020-3933-6%5F2),[Link](https://doi.org/10.1007/1-4020-3933-6_2)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1)\. - \[7\]R\. Bommasani, D\. A\. Hudson, E\. Adeli, R\. Altman, S\. Arora, S\. von Arx, M\. S\. Bernstein, J\. Bohg, A\. Bosselut, E\. Brunskill, E\. Brynjolfsson, S\. Buch, D\. Card, R\. Castellon, N\. Chatterji, A\. Chen, K\. Creel, J\. Q\. Davis, D\. Demszky, C\. Donahue, M\. Doumbouya, E\. Durmus, S\. Ermon, J\. Etchemendy, K\. Ethayarajh, L\. Fei\-Fei, C\. Finn, T\. Gale, L\. Gillespie, K\. Goel, N\. Goodman, S\. Grossman, N\. Guha, T\. Hashimoto, P\. Henderson, J\. Hewitt, D\. E\. Ho, J\. Hong, K\. Hsu, J\. Huang, T\. Icard, S\. Jain, D\. Jurafsky, P\. Kalluri, S\. Karamcheti, G\. Keeling, F\. Khani, O\. Khattab, P\. W\. Koh, M\. Krass, R\. Krishna, R\. Kuditipudi, A\. Kumar, F\. Ladhak, M\. Lee, T\. Lee, J\. Leskovec, I\. Levent, X\. L\. Li, X\. Li, T\. Ma, A\. Malik, C\. D\. Manning, S\. Mirchandani, E\. Mitchell, Z\. Munyikwa, S\. Nair, A\. Narayan, D\. Narayanan, B\. Newman, A\. Nie, J\. C\. Niebles, H\. Nilforoshan, J\. Nyarko, G\. Ogut, L\. Orr, I\. Papadimitriou, J\. S\. Park, C\. Piech, E\. Portelance, C\. Potts, A\. Raghunathan, R\. Reich, H\. Ren, F\. Rong, Y\. Roohani, C\. Ruiz, J\. Ryan, C\. Ré, D\. Sadigh, S\. Sagawa, K\. Santhanam, A\. Shih, K\. Srinivasan, A\. Tamkin, R\. Taori, A\. W\. Thomas, F\. Tramèr, R\. E\. Wang, W\. Wang, B\. Wu, J\. Wu, Y\. Wu, S\. M\. Xie, M\. Yasunaga, J\. You, M\. Zaharia, M\. Zhang, T\. Zhang, X\. Zhang, Y\. Zhang, L\. Zheng, K\. Zhou, and P\. Liang\(2021\)On the opportunities and risks of foundation models\.arXiv preprint arXiv:2108\.07258\.External Links:[Link](https://arxiv.org/abs/2108.07258)Cited by:[§7](https://arxiv.org/html/2605.05403#S7.p3.1)\. - \[8\]K\. Chandra, M\. Kleiman\-Weiner, J\. Ragan\-Kelley, and J\. B\. Tenenbaum\(2026\-02\)Sycophantic chatbots cause delusional spiraling, even in ideal bayesians\.arXiv preprint arXiv:2602\.19141\.External Links:[Link](https://arxiv.org/abs/2602.19141)Cited by:[§6\.1](https://arxiv.org/html/2605.05403#S6.SS1.p1.1)\. - \[9\]M\. Cheng, C\. Lee, P\. Khadpe, S\. Yu, D\. Han, and D\. Jurafsky\(2026\-03\)Sycophantic ai decreases prosocial intentions and promotes dependence\.Science391\(6792\)\.External Links:[Document](https://dx.doi.org/10.1126/science.aec8352),[Link](https://doi.org/10.1126/science.aec8352)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1),[§3](https://arxiv.org/html/2605.05403#S3.p3.1)\. - \[10\]M\. Cheng, S\. Yu, C\. Lee, P\. Khadpe, L\. Ibrahim, and D\. Jurafsky\(2026\)ELEPHANT: measuring and understanding social sycophancy in LLMs\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=igbRHKEiAs)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p2.1),[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p2.1)\. - \[11\]C\. Chiang and H\. Lee\(2023\-07\)Can large language models be an alternative to human evaluation?\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 15607–15631\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.870),[Link](https://aclanthology.org/2023.acl-long.870/)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1)\. - \[12\]P\. Christiano, J\. Leike, T\. B\. Brown, M\. Martic, S\. Legg, and D\. Amodei\(2017\-12\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems 31 \(NeurIPS 2017\),pp\. 4302–4310\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf)Cited by:[§3](https://arxiv.org/html/2605.05403#S3.p2.1)\. - \[13\]M\. Dahl, V\. Magesh, M\. Suzgun, and D\. E\. Ho\(2024\-06\)Large legal fictions: profiling legal hallucinations in large language models\.Journal of Legal Analysis16\(1\),pp\. 64–93\.External Links:[Document](https://dx.doi.org/10.1093/jla/laae003),[Link](https://doi.org/10.1093/jla/laae003)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[14\]L\. Du, X\. Lyu, L\. Xie, and B\. Feng\(2025\)Alignment without understanding: a message\- and conversation\-centered approach to understanding ai sycophancy\.arXiv preprint arXiv:2509\.21665\.External Links:[Link](https://arxiv.org/abs/2509.21665)Cited by:[§5\.1](https://arxiv.org/html/2605.05403#S5.SS1.p1.1)\. - \[15\]A\. Fanous, J\. Goldberg, A\. Agarwal, J\. Lin, A\. Zhou, S\. Xu, V\. Bikia, R\. Daneshjou, and S\. Koyejo\(2025\-10\)SycEval: evaluating LLM sycophancy\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,Vol\.8,pp\. 893–900\.External Links:[Document](https://dx.doi.org/10.1609/aies.v8i1.36598),[Link](https://doi.org/10.1609/aies.v8i1.36598)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1)\. - \[16\]J\. Feine, U\. Gnewuch, S\. Morana, and A\. Maedche\(2019\-12\)A taxonomy of social cues for conversational agents\.International Journal of Human\-Computer Studies132,pp\. 138–161\.External Links:[Document](https://dx.doi.org/10.1016/j.ijhcs.2019.07.009),[Link](https://doi.org/10.1016/j.ijhcs.2019.07.009)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1)\. - \[17\]J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu, S\. Wang, K\. Zhang, Z\. Lin, B\. Zhang, L\. Ni, W\. Gao, Y\. Wang, and J\. Guo\(2025\)A survey on LLM\-as\-a\-judge\.The Innovation\.External Links:[Document](https://dx.doi.org/10.1016/j.xinn.2025.101253),[Link](https://doi.org/10.1016/j.xinn.2025.101253)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1)\. - \[18\]E\. Gueorguieva, H\. Zhan, J\. Suh, J\. Hernandez, T\. Lau, J\. J\. Li, and D\. C\. Ong\(2026\-04\)AI generates well\-liked but templatic empathic responses\.arXiv preprint arXiv:2604\.08479\.External Links:[Link](https://arxiv.org/abs/2604.08479)Cited by:[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1)\. - \[19\]J\. Hong, G\. Byun, S\. Kim, and K\. Shu\(2025\-11\)Measuring sycophancy of language models in multi\-turn dialogues\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 2239–2259\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.121),[Link](https://aclanthology.org/2025.findings-emnlp.121/)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[20\]L\. Ibrahim, F\. S\. Hafner, and L\. Rocher\(2026\)Training language models to be warm can reduce accuracy and increase sycophancy\.Nature652,pp\. 1159–1165\.External Links:[Document](https://dx.doi.org/10.1038/s41586-026-10410-0),[Link](https://doi.org/10.1038/s41586-026-10410-0)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p1.1),[§1](https://arxiv.org/html/2605.05403#S1.p2.1)\. - \[21\]A\. Kaur\(2025\-11\)Echoes of agreement: argument\-driven sycophancy in large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 22803–22812\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1241),[Link](https://aclanthology.org/2025.findings-emnlp.1241/)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p1.1)\. - \[22\]A\. Kumar, N\. Poungpeth, D\. Yang, B\. Lambert, and M\. Groh\(2026\-03\)Practicing with language models cultivates human empathic communication\.arXiv preprint arXiv:2603\.15245\.External Links:[Link](https://arxiv.org/abs/2603.15245)Cited by:[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1)\. - \[23\]Y\. K\. Lee, J\. Suh, H\. Zhan, J\. J\. Li, and D\. C\. Ong\(2024\)Large language models produce responses perceived to be empathic\.InProceedings of the 12th International Conference on Affective Computing and Intelligent Interaction \(ACII\),pp\. 63–71\.External Links:[Document](https://dx.doi.org/10.1109/ACII63134.2024.00012),[Link](https://doi.org/10.1109/ACII63134.2024.00012)Cited by:[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1)\. - \[24\]D\. Li, B\. Jiang, L\. Huang, A\. Beigi, C\. Zhao, Z\. Tan, A\. Bhattacharjee, Y\. Jiang, C\. Chen, T\. Wu, K\. Shu, L\. Cheng, and H\. Liu\(2025\-11\)From generation to judgment: opportunities and challenges of LLM\-as\-a\-judge\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 2757–2791\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.138),[Link](https://aclanthology.org/2025.emnlp-main.138/)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1)\. - \[25\]H\. Li, Q\. Dong, J\. Chen, H\. Su, Y\. Zhou, Q\. Ai, Z\. Ye, and Y\. Liu\(2024\-12\)LLMs\-as\-judges: a comprehensive survey on LLM\-based evaluation methods\.arXiv preprint arXiv:2412\.05579\.External Links:[Link](https://arxiv.org/abs/2412.05579)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1)\. - \[26\]S\. Lin, J\. Hilton, and O\. Evans\(2022\-05\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3214–3252\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.229),[Link](https://aclanthology.org/2022.acl-long.229/)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p1.1)\. - \[27\]J\. Liu, A\. Jain, S\. Takuri, S\. Vege, A\. Akalin, K\. Zhu, S\. O’Brien, and V\. Sharma\(2025\)TRUTH DECAY: quantifying multi\-turn sycophancy in language models\.arXiv preprint arXiv:2503\.11656\.External Links:[Link](https://arxiv.org/abs/2503.11656)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.05403#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.05403#S5.SS2.p5.1)\. - \[28\]Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu\(2023\-12\)G\-Eval: nlg evaluation using GPT\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 2511–2522\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153),[Link](https://aclanthology.org/2023.emnlp-main.153/)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1)\. - \[29\]A\. Y\. Ng\(2024\)Post on X \(formerly twitter\)\.Note:[https://x\.com/AndrewYNg/status/1801295202788983136](https://x.com/AndrewYNg/status/1801295202788983136)Accessed 2026\-04\-24Cited by:[§7](https://arxiv.org/html/2605.05403#S7.p2.1)\. - \[30\]R\. S\. Nickerson\(1998\-06\)Confirmation bias: a ubiquitous phenomenon in many guises\.Review of General Psychology2\(2\),pp\. 175–220\.External Links:[Document](https://dx.doi.org/10.1037/1089-2680.2.2.175),[Link](https://doi.org/10.1037/1089-2680.2.2.175)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p2.1)\. - \[31\]K\. Noshin, S\. I\. Ahmed, and S\. Sultana\(2026\-01\)User detection and response patterns of sycophantic behavior in conversational ai\.arXiv preprint arXiv:2601\.10467\.External Links:[Link](https://arxiv.org/abs/2601.10467)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p1.1)\. - \[32\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems 36 \(NeurIPS 2022\),pp\. 27730–27744\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p2.1),[§3](https://arxiv.org/html/2605.05403#S3.p2.1)\. - \[33\]E\. Perez, S\. Ringer, K\. Lukošiūtė, K\. Nguyen, E\. Chen, S\. Heiner, C\. Pettit, C\. Olsson, S\. Kundu, S\. Kadavath, A\. Jones, A\. Chen, B\. Mann, B\. Israel, B\. Seethor, C\. McKinnon, C\. Olah, D\. Yan, D\. Amodei, D\. Amodei, D\. Drain, D\. Li, E\. Tran\-Johnson, G\. Khundadze, J\. Kernion, J\. Landis, J\. Kerr, J\. Mueller, J\. Hyun, J\. Landau, K\. Ndousse, L\. Goldberg, L\. Lovitt, M\. Lucas, M\. Sellitto, M\. Zhang, N\. Kingsland, N\. Elhage, N\. Joseph, N\. Mercado, N\. DasSarma, O\. Rausch, R\. Larson, S\. McCandlish, S\. Johnston, S\. Kravec, S\. El Showk, T\. Lanham, T\. Telleen\-Lawton, T\. Brown, T\. Henighan, T\. Hume, Y\. Bai, Z\. Hatfield\-Dodds, J\. Clark, S\. R\. Bowman, A\. Askell, R\. Grosse, D\. Hernandez, D\. Ganguli, E\. Hubinger, N\. Schiefer, and J\. Kaplan\(2023\-07\)Discovering language model behaviors with model\-written evaluations\.InFindings of the Association for Computational Linguistics: ACL 2023,Toronto, Canada,pp\. 13387–13434\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.847),[Link](https://aclanthology.org/2023.findings-acl.847/)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[34\]S\. Rabby, Md\. H\. H\. Papon, S\. Ahmed, N\. H\. Arif, A\. B\. M\. A\. Rahman, and I\. Ahmad\(2026\-02\)Moral sycophancy in vision language models\.arXiv preprint arXiv:2602\.08311\.External Links:[Link](https://arxiv.org/abs/2602.08311)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[35\]A\. B\. M\. A\. Rahman, S\. Anwar, M\. Usman, I\. Ahmad, and A\. Mian\(2025\)PENDULUM: a benchmark for assessing sycophancy in multimodal large language models\.arXiv preprint arXiv:2512\.19350\.External Links:[Link](https://arxiv.org/abs/2512.19350)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[36\]I\. D\. Raji, A\. Smart, R\. N\. White, M\. Mitchell, T\. Gebru, B\. Hutchinson, J\. Smith\-Loud, D\. Theron, and P\. Barnes\(2020\-01\)Closing the ai accountability gap: defining an end\-to\-end framework for internal algorithmic auditing\.InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency,pp\. 33–44\.External Links:[Document](https://dx.doi.org/10.1145/3351095.3372873),[Link](https://doi.org/10.1145/3351095.3372873)Cited by:[§7](https://arxiv.org/html/2605.05403#S7.p3.1)\. - \[37\]L\. Ranaldi and G\. Pucci\(2023\)When large language models contradict humans? large language models’ sycophantic behaviour\.arXiv preprint arXiv:2311\.09410\.External Links:[Link](https://arxiv.org/abs/2311.09410)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[38\]H\. Rashkin, E\. M\. Smith, M\. Li, and Y\. Boureau\(2019\-07\)Towards empathetic open\-domain conversation models: a new benchmark and dataset\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Florence, Italy,pp\. 5370–5381\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1534),[Link](https://aclanthology.org/P19-1534/)Cited by:[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1)\. - \[39\]J\. Rehani, V\. Oldemburgo de Mello, D\. Ovsyannikova, A\. Anderson, and M\. Inzlicht\(2026\-03\)The social sycophancy scale: a psychometrically validated measure of sycophancy\.arXiv preprint arXiv:2603\.15448\.External Links:[Link](https://arxiv.org/abs/2603.15448)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p2.1)\. - \[40\]M\. Sharma, M\. Tong, T\. Korbak, D\. Duvenaud, A\. Askell, S\. R\. Bowman, E\. Durmus, Z\. Hatfield\-Dodds, S\. R\. Johnston, S\. M\. Kravec, T\. Maxwell, S\. McCandlish, K\. Ndousse, O\. Rausch, N\. Schiefer, D\. Yan, M\. Zhang, and E\. Perez\(2024\-05\)Towards understanding sycophancy in language models\.InThe Twelfth International Conference on Learning Representations,Vienna, Austria\.External Links:[Link](https://openreview.net/forum?id=tvhaxkMKAn)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[41\]J\. Shi, T\. J\. Zhang, Z\. Jin, and V\. Conitzer\(2026\-04\)From hallucination to scheming: a unified taxonomy and benchmark analysis for llm deception\.arXiv preprint arXiv:2604\.04788\.Note:Accepted to the ICLR 2026 Workshop on Agents in the Wild: Safety, Security, and BeyondExternal Links:[Link](https://arxiv.org/abs/2604.04788)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p2.1)\. - \[42\]H\. Shum, X\. He, and D\. Li\(2018\-01\)From ELIZA to XiaoIce: challenges and opportunities with social chatbots\.Frontiers of Information Technology & Electronic Engineering19,pp\. 10–26\.External Links:[Document](https://dx.doi.org/10.1631/FITEE.1700826),[Link](https://doi.org/10.1631/FITEE.1700826)Cited by:[§1](https://arxiv.org/html/2605.05403#S1.p4.1)\. - \[43\]S\. Sonkar, X\. Chen, N\. Liu, R\. G\. Baraniuk, and M\. Sachan\(2024\)LLM\-based cognitive models of students with misconceptions\.arXiv preprint arXiv:2410\.12294\.External Links:[Link](https://arxiv.org/abs/2410.12294)Cited by:[§2\.1](https://arxiv.org/html/2605.05403#S2.SS1.p2.1)\. - \[44\]V\. Sorin, D\. Brin, Y\. Barash, E\. Konen, A\. Charney, G\. Nadkarni, and E\. Klang\(2024\-12\)Large language models and empathy: systematic review\.Journal of Medical Internet Research26,pp\. e52597\.External Links:[Document](https://dx.doi.org/10.2196/52597),[Link](https://doi.org/10.2196/52597)Cited by:[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1)\. - \[45\]S\. Tan, S\. Zhuang, K\. Montgomery, W\. Y\. Tang, A\. Cuadron, C\. Wang, R\. A\. Popa, and I\. Stoica\(2025\)JudgeBench: a benchmark for evaluating LLM\-based judges\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/file/9e720fce64f91114c49cfd640d821da3-Paper-Conference.pdf)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1)\. - \[46\]K\. Wang, J\. Li, S\. Yang, Z\. Zhang, and D\. Wang\(2026\)When truth is overridden: uncovering the internal origins of sycophancy in large language models\.InProceedings of the Fortieth AAAI Conference on Artificial Intelligence \(AAAI\-26\),External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/40645/44606)Cited by:[§5\.2](https://arxiv.org/html/2605.05403#S5.SS2.p1.1)\. - \[47\]V\. Williams and B\. Rosman\(2025\)Heartificial intelligence: exploring empathy in language models\.arXiv preprint arXiv:2508\.08271\.External Links:[Link](https://arxiv.org/abs/2508.08271)Cited by:[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p1.1)\. - \[48\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\-12\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems 37 \(NeurIPS 2023\),pp\. 46595–46623\.External Links:[Link](https://dl.acm.org/doi/10.5555/3666122.3668142)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.05403#S2.SS3.p2.1)\. - \[49\]L\. Zhu, X\. Wang, and X\. Wang\(2025\)JudgeLM: fine\-tuned large language models are scalable judges\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/file/7f8f73134e253845a8f82983219a8452-Paper-Conference.pdf)Cited by:[§2\.2](https://arxiv.org/html/2605.05403#S2.SS2.p1.1)\.
Similar Articles
BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts
Researchers introduce BenSyc, the first benchmark for evaluating conversational sycophancy in Bengali social contexts, finding that LLMs struggle to distinguish empathetic support from validation and escalation, achieving only ~61% Macro-F1.
What is sycophancy in AI models?
Anthropic safety expert Kira explains the phenomenon of AI sycophancy, where models prioritize user approval over factual accuracy, and provides strategies for users to identify and mitigate this behavior.
Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models
This paper introduces MIST, a benchmark for evaluating sycophancy in memory-augmented LLMs, demonstrating that memory systems amplify sycophantic behavior by up to 25x and proposing lightweight mitigations that reduce sycophancy while maintaining factual recall.
The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models
This paper audits sycophancy in Gemini models (2.0, 2.5, 3.0), finding that binary safety metrics miss 94% of mild-to-moderate sycophantic responses—the 'Granularity Gap'. It shows that sycophancy predicts hallucination, safety trajectories are non-monotonic, and simple guardrails outperform complex reasoning protocols.
Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating
The paper shows that sycophancy fine-tuning can induce emergent misalignment in language models, and proposes Alignment Gating as a method to reverse it by learning to control internal representations for unsafe responses.