Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions
Summary
This paper investigates how contextual framing affects LLM responses in mental health interactions, finding systematic behavioral variation and demonstrating that internal representations encode framing information throughout transformer layers.
View Cached Full Text
Cached at: 06/26/26, 05:20 AM
# Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions
Source: [https://arxiv.org/html/2606.26982](https://arxiv.org/html/2606.26982)
###### Abstract
Large language models \(LLMs\) are increasingly being integrated into mental health support tools and other psychologically sensitive conversational applications\. In such settings, behavioral stability and consistency are important for trustworthy human\-AI interaction\. However, semantically similar concerns can be presented through different contextual framings, potentially eliciting different model responses\. Such framing\-sensitive variability may challenge user expectations regarding system behavior and complicate the assessment of AI reliability\. While prior studies have primarily examined such effects at the behavioral level, less is known about how framing\-related variation is reflected in the internal representations of aligned language models\. In this work, we investigate these effects using controlled matched prompts spanning multiple contextual framing conditions across several instruction\-tuned model families\. Across architectures, framing systematically alters interpretive response tendencies\. Layer\-wise probing analyses show that behavior\-associated information remains decodable throughout transformer depth, with architecture\-dependent variation in decoding strength\. Moreover, held\-out framing probes remained consistently above chance across architectures despite strong lexical baselines\. Activation steering experiments further suggest that framing\-associated representational directions can partially modulate downstream behavioral outcomes\. Finally, these findings indicate that robustness to contextual variation may represent an important consideration when evaluating the consistency and trustworthiness of conversational AI systems deployed in mental\-health\-oriented interactions\.
###### keywords:
large language models; digital mental health; artificial intelligence; trustworthy AI; behavioral calibration; human\-AI interaction; trustworthy AI; mental health support
## 1Introduction
Large language models \(LLMs\) are increasingly deployed as interactive assistants in settings involving emotionally sensitive, ambiguous, or high\-stakes communication, including digital mental health support, psychoeducational systems, and patient\-facing conversational assistants\. In these contexts, users often describe distress, uncertainty, or help\-seeking concerns using substantially different contextual framings despite expressing broadly similar underlying needs\. For example, a user may present uncertainty as documentation, invoke institutional responsibility, seek epistemic interpretation, or frame the interaction as supportive role\-based guidance\. Although these systems are not intended to replace clinicians, they increasingly participate in conversations involving emotional distress and mental health concerns, making behavioral consistency important for trust, reliability, and patient safety\. When conversational systems respond differently to semantically similar concerns based primarily on contextual presentation, users may develop inaccurate expectations regarding system behavior\. Such inconsistencies may contribute to miscalibrated trust and uncertainty regarding system behavior, particularly in emotionally sensitive interactions where consistent responses are important\. Prior work has shown that aligned language models are highly sensitive to prompt framing, social context, and conversational cues, exhibiting phenomena such as sycophancy, preference imitation, reward hacking, and context\-dependent safety behavior\[[1](https://arxiv.org/html/2606.26982#bib.bib1),[3](https://arxiv.org/html/2606.26982#bib.bib3),[4](https://arxiv.org/html/2606.26982#bib.bib4),[5](https://arxiv.org/html/2606.26982#bib.bib5)\]\. However, relatively little is known about how contextual framing influences representational structure in mental\-health\-oriented interactions and whether such behavioral variation is reflected in the internal representations of aligned language models\.
Recent work suggests that alignment\-relevant behavior is not limited to explicitly harmful prompts, but can also depend on subtle contextual cues that affect model calibration and downstream responses\[[6](https://arxiv.org/html/2606.26982#bib.bib6),[7](https://arxiv.org/html/2606.26982#bib.bib7),[8](https://arxiv.org/html/2606.26982#bib.bib8)\]\. Models can become overly agreeable, excessively interpretive, or weakly calibrated depending on how a request is framed, even when the semantic intent remains unchanged\[[9](https://arxiv.org/html/2606.26982#bib.bib9),[10](https://arxiv.org/html/2606.26982#bib.bib10)\]\. These concerns are particularly relevant in healthcare\-facing and other high\-stakes conversational settings, where subtle contextual variation may influence how models interpret emotionally nuanced or ambiguous user situations\. In such environments, excessive interpretive escalation, or inconsistent supportive behavior may affect both user trust and the perceived reliability of AI\-assisted communication\. These observations motivate an important representational question: when contextual framing changes model behavior, does it merely alter surface\-level wording, or is it associated with systematic differences in the internal representations linked to different interpretive response tendencies?
Most existing evaluations of alignment and safety focus primarily on observable outputs, such as refusal rates, toxicity, hallucination, harmful completion likelihood, or benchmark\-level robustness\[[12](https://arxiv.org/html/2606.26982#bib.bib12),[17](https://arxiv.org/html/2606.26982#bib.bib17),[18](https://arxiv.org/html/2606.26982#bib.bib18)\]\. While these behavioral evaluations are essential, they provide limited insight into how contextual signals are internally represented and propagated through transformer computations\. Recent mechanistic interpretability research has increasingly argued that understanding internal representations is necessary for diagnosing and controlling alignment\-related failures\[[19](https://arxiv.org/html/2606.26982#bib.bib19),[20](https://arxiv.org/html/2606.26982#bib.bib20),[21](https://arxiv.org/html/2606.26982#bib.bib21)\]\.
Representation\-level methods provide a promising framework for studying these questions\. Representation Engineering proposes that high\-level model behaviors may correspond to identifiable latent directions in hidden\-state space\[[22](https://arxiv.org/html/2606.26982#bib.bib22)\]\. Similarly, work on latent knowledge and elicitation demonstrates that models may internally encode information that differs from their explicit outputs\[[6](https://arxiv.org/html/2606.26982#bib.bib6)\]\. Activation engineering approaches, including Activation Addition and Contrastive Activation Addition \(CAA\), further show that behaviorally meaningful directions can be extracted from hidden states and used to causally steer generation behavior during inference\[[8](https://arxiv.org/html/2606.26982#bib.bib8),[23](https://arxiv.org/html/2606.26982#bib.bib23)\]\.
Despite these advances, an important gap remains between behavioral studies of prompt sensitivity and mechanistic studies of internal representations\. Existing work often examines whether models comply, refuse, hallucinate, or become sycophantic under specific prompting conditions\. Much less is known about whether contextual framing acts as a framing\-associated representational signal that systematically influences internal interpretive tendencies and behavioral calibration\. This gap is especially important for aligned assistants, where undesirable behavior may emerge not as explicit harmful content, but as subtle shifts in interpretive escalation, over\-interpretation, or excessive inference about user state\.
In this work, we study context\-sensitive interpretive behavior in aligned LLMs during mental\-health interactions under controlled framing variation\. We construct matched\-prompt sets in which semantic intent is preserved while contextual framing varies across documentation, epistemic, institutional, liability, and role\-based conditions\. We evaluate multiple instruction\-tuned model families and annotate responses according to calibration\-related response tendencies, distinguishing restrained\-supportive responses from more interpretive or escalation\-prone patterns\.
We then connect these behavioral outcomes to internal representations using layer\-wise hidden\-state probing, held\-out framing generalization, and activation steering analyses\. Our results show that contextual framing systematically alters interpretive response tendencies across architectures\. Documentation framing often increases interpretive escalation, whereas institutional framing frequently stabilizes or suppresses escalation tendencies\. At the representation level, behavior\-associated information is decodable from hidden states; however, lexical baselines show that surface framing cues explain a substantial portion of this signal\. Held\-out framing probes nevertheless remain above chance despite strong lexical baselines, suggesting that part of the behavior\-associated signal generalizes beyond the specific framing templates observed during training\. Finally, activation steering experiments show that framing\-associated representational directions can partially modulate downstream response behavior in several architectures, providing preliminary intervention evidence rather than a complete mechanistic decomposition\.
- •We introduce a controlled matched\-prompt framework for studying calibration while preserving underlying semantic intent\.
- •We provide behavioral evidence that non\-adversarial contextual framing systematically alters interpretive response tendencies across multiple aligned LLM families\.
- •We show that framing\-associated behavioral signals are decodable from hidden\-state representations, while probe controls reveal both substantial lexical contributions and partial held\-out framing generalization\.
- •We provide preliminary intervention evidence that activation steering can modulate downstream response tendencies in several architectures\.
More broadly, this work contributes to ongoing efforts toward developing more transparent, trustworthy, and behaviorally calibrated conversational AI systems for sensitive real\-world interaction settings\.
## 2Related Work
### 2\.1Context Sensitivity and Alignment in LLMs
Aligned LLMs are highly sensitive to conversational framing, social cues, and interaction context\. Prior work on sycophancy shows that instruction\-tuned models often adapt to user beliefs or preferences even when these conflict with factual correctness\[[1](https://arxiv.org/html/2606.26982#bib.bib1),[3](https://arxiv.org/html/2606.26982#bib.bib3),[11](https://arxiv.org/html/2606.26982#bib.bib11)\]\. Other studies examined prompt\-dependent refusal behavior, jailbreak susceptibility, and context\-conditioned alignment failures\[[4](https://arxiv.org/html/2606.26982#bib.bib4),[17](https://arxiv.org/html/2606.26982#bib.bib17),[9](https://arxiv.org/html/2606.26982#bib.bib9)\]\. More recent work suggests that these behaviors may reflect deeper representational mechanisms rather than purely surface\-level prompting artifacts\[[26](https://arxiv.org/html/2606.26982#bib.bib26),[27](https://arxiv.org/html/2606.26982#bib.bib27)\]\. However, most existing studies focus primarily on output\-level behavior, leaving open the question of how contextual framing influences internal representational organization even when semantic intent remains constant\.
### 2\.2Mechanistic Interpretability and Representation\-Level Analysis
Mechanistic interpretability seeks to identify the internal computational structures underlying model behavior\[[19](https://arxiv.org/html/2606.26982#bib.bib19),[20](https://arxiv.org/html/2606.26982#bib.bib20),[13](https://arxiv.org/html/2606.26982#bib.bib13)\]\. Recent work increasingly emphasizes representation\-level analysis as a framework for understanding alignment\-relevant behaviors in LLMs\[[14](https://arxiv.org/html/2606.26982#bib.bib14),[15](https://arxiv.org/html/2606.26982#bib.bib15)\]\. Representation Engineering proposed that high\-level behaviors may correspond to identifiable latent directions in hidden\-state space\[[22](https://arxiv.org/html/2606.26982#bib.bib22)\], while latent knowledge studies showed that hidden states may encode information not reflected in explicit outputs\[[6](https://arxiv.org/html/2606.26982#bib.bib6)\]\.
### 2\.3Activation Steering and Latent Behavioral Directions
Activation steering methods attempt to causally manipulate model behavior through interventions in hidden\-state representations during inference\. Activation engineering and CAA demonstrated that behaviorally meaningful steering directions can often be constructed from latent activation differences\[[8](https://arxiv.org/html/2606.26982#bib.bib8),[23](https://arxiv.org/html/2606.26982#bib.bib23)\]\. More recent work suggests that several alignment\-related behaviors correspond to low\-dimensional representational subspaces\. Arditi et al\.\[[24](https://arxiv.org/html/2606.26982#bib.bib24)\]showed that refusal behavior can often be mediated by a dominant latent direction, while subsequent studies explored representational subspaces associated with over\-refusal, sycophancy, and alignment calibration\[[16](https://arxiv.org/html/2606.26982#bib.bib16),[26](https://arxiv.org/html/2606.26982#bib.bib26)\]\. Other recent studies highlight that steering effects are often architecture\-dependent and non\-linear\[[25](https://arxiv.org/html/2606.26982#bib.bib25)\]\. These findings motivate studying activation steering not only as a behavioral control method, but also as a probe into latent routing structure\.
### 2\.4Our Positioning
Unlike studies evaluating clinical efficacy or therapeutic outcomes, this work examines behavioral reliability and representational organization in AI systems that may be deployed within mental\-health\-related conversational settings\. We focus on framing\-sensitive behavioral variation as a trustworthiness, trust\-calibration, and patient\-safety concern rather than as a diagnostic or treatment evaluation problem\. Specifically, we examine whether contextual framing can produce behavior that may challenge user expectations regarding consistency and reliability in AI\-assisted mental\-health interactions\.
## 3Methodology
Figure 1:Overview of the experimental framework\. Matched prompts are rewritten across contextual framing conditions while preserving underlying semantic intent\. Model responses are behaviorally annotated and analyzed using layer\-wise probing, representation analysis, and activation steering to study framing\-conditioned interpretive\-routing behavior### 3\.1Matched Prompt Construction
The core idea of this paper is to study how contextual framing induces systematic differences in internal representations associated with interpretive response tendencies in aligned large language models while preserving underlying semantic intent\. To isolate contextual effects from semantic variation, we constructed a controlled matched\-prompt framework in which each prompt instance was rewritten across multiple framing conditions while maintaining the same underlying communicative intent\. The dataset was designed around psychologically realistic but semantically stable ambiguous\-support scenarios\. Prompts were selected and curated to avoid explicit crisis language, direct self\-harm intent, overt medical diagnosis requests, or adversarial jailbreak\-style instructions\. Instead, prompts were constructed to preserve interpretive ambiguity while allowing multiple plausible response calibrations\.
### 3\.2Contextual Framing Conditions
We operationalized contextual framing through five controlled framing categories designed to probe distinct alignment\-relevant contextual signals:
- •Documentation Framing: prompts framed as formal documentation, reporting, or record\-keeping contexts\.
- •Epistemic Framing: prompts emphasizing interpretation, understanding, uncertainty resolution, or explanatory reasoning\.
- •Institutional Framing: prompts invoking organizational, procedural, or institutional responsibility constraints\.
- •Liability Framing: prompts emphasizing caution, consequences, accountability, or risk\-sensitive interpretation\.
- •Role Framing: prompts positioning the assistant within an explicitly supportive or advisory interaction role\.
These framing categories were selected because prior alignment and interaction studies suggest that social authority, interpretive legitimacy, institutional responsibility, and conversational role cues can substantially shift model behavior\[[1](https://arxiv.org/html/2606.26982#bib.bib1),[26](https://arxiv.org/html/2606.26982#bib.bib26),[27](https://arxiv.org/html/2606.26982#bib.bib27)\]\. However, unlike adversarial jailbreak settings, our framing conditions were intentionally non\-adversarial and semantically aligned with plausible real\-world conversational contexts\.
### 3\.3Models
We evaluated multiple instruction\-tuned transformer families spanning different architectures and parameter scales:Qwen\-0\.5B,Qwen\-1\.5B,Gemma\-2B,Gemma\-9B,Mistral\-7B, andPhi\-3\.5,Phi\-4\-mini\. These models were selected to enable comparison across both architectural families and model scales while maintaining broad coverage of contemporary aligned open\-weight assistants\.
### 3\.4Behavioral Annotation Framework
To study variation in interpretive response calibration, we developed a structured annotation framework capturing different response calibration styles\. Rather than evaluating correctness or helpfulness alone, the annotation framework focused on how strongly the model interpreted, escalated, or constrained ambiguous user intent\.
Responses were annotated into four behavioral categories:
- •Weak/Disengaged: minimal, evasive, generic, or weakly supportive responses with limited engagement\.
- •Restrained\-Supportive: supportive responses that avoid excessive interpretation or escalation\.
- •Interpretive\-Supportive: responses that actively infer latent emotional, psychological, or situational implications beyond explicitly stated content\.
- •Escalated Interpretation: highly interpretive responses exhibiting strong escalation, quasi\-diagnostic reasoning, or excessive inference relative to the ambiguity of the original prompt\.
These labels operationalize broad interpretive tendencies for analysis purposes and do not imply strictly discrete latent behavioral states\. A subset of annotations was independently reviewed by a psychology co\-author to assess consistency and domain relevance\. Annotations additionally included confidence scores reflecting labeling certainty\.
### 3\.5Hidden\-State Extraction
To investigate how contextual framing is reflected in internal representational organization, we extracted transformer hidden states from all layers during inference\. For each prompt\-response pair, we extracted the final residual\-stream activation corresponding to the last generated response token prior to EOS generation\. Hidden states were extracted independently for each framing condition and model family\. These representations served as the basis for downstream probing, geometric analysis, and activation steering experiments\.
### 3\.6Layer\-Wise Probing
We evaluated how behavior\-associated signals are distributed across transformer depth using layer\-wise probing analysis\. For each layer independently, we trained logistic regression probes to predict response categories directly from hidden\-state activations\. To compare architectures with different layer counts, we additionally analyzed routing emergence in relative\-depth coordinates\. Balanced accuracy was computed using stratified cross\-validation:
P\(y=1∣h\)=σ\(Wh\+b\)P\(y=1\\mid h\)=\\sigma\(Wh\+b\)\(1\)
where \(h\) denotes the hidden\-state representation extracted from a given transformer layer, \(W\) and \(b\) are the probe weight matrix and bias term, respectively, and \(y\) is the binary interpretive\-routing label\.
### 3\.7Activation Steering
To evaluate how framing\-associated representational directions influence response calibration, we performed activation steering experiments using contrastive latent directions\. Following prior work on activation engineering and CAA\[[8](https://arxiv.org/html/2606.26982#bib.bib8),[23](https://arxiv.org/html/2606.26982#bib.bib23)\], steering directions were constructed from activation differences between restrained\-supportive and higher\-interpretation routing states\. During inference, these directions were subtracted from hidden\-state activations at selected layers using varying steering strengths\.
h′=h−αdh^\{\\prime\}=h\-\\alpha d\(2\)
whereddis the steering direction andα\\alphacontrols intervention strength\.
We then re\-generated responses under controlled steering conditions and re\-annotated outputs using the same behavioral framework\.
### 3\.8Statistical Analysis
Behavioral proportions were computed for each model and framing condition\. To evaluate framing\-dependent effects while accounting for model\-specific differences, logistic regression models were fitted with response calibration labels as the dependent variable and framing condition and model identity as predictors\. Statistical significance was assessed using Wald tests with a significance threshold ofp<0\.05p<0\.05\.
## 4Results
### 4\.1Dataset and Experimental Setup
#### Dataset statistics\.
The final dataset consisted of653matched prompt groups × 6 framing conditions: Base, Documentation, Epistemic, Institutional, Liability, and Role framing\. Behavioral annotations were additionally collapsed into a binary interpretive\-routing variable distinguishing restrained\-supportive responses from higher\-interpretation routing behaviors\.
#### Probe evaluation\.
Layer\-wise probing analyses used standardized hidden\-state activations and logistic regression classifiers with balanced class weighting\. Cross\-validation used a 5\-fold stratified evaluation with matched prompt groups treated as grouping variables to prevent semantically related framing variants from appearing across train and test partitions\. We additionally evaluated two controls: \(i\) a random\-label control obtained by permuting response calibration labels under the same grouped split structure, and \(ii\) a Term Frequency\-Inverse Document Frequency \(TF\-IDF\) lexical baseline using unigram and bigram prompt features with logistic regression classification\.
#### Held\-out framing generalization\.
For held\-out framing evaluation, probes were trained on five framing categories and evaluated on the excluded framing category\. This evaluation tests whether routing\-related representations partially generalize beyond exact framing templates\.
#### Activation steering setup\.
Activation steering experiments used CAA directions computed from activation differences between restrained\-supportive and higher\-interpretation behavioral states\. Steering interventions were applied to the residual stream representation of the final generated token at architecture\-specific high\-decoding layers\. Steering coefficients ranged from0\.0to2\.0in increments of0\.5\. Steering evaluations used approximately balanced prompt samples across framing categories, with50evaluation prompts per model\. All generation experiments used deterministic decoding settings\.
### 4\.2Contextual Framing Modulates Response Calibration
We first quantified how contextual framing influences interpretive\-routing behavior across aligned LLM families\. Figure[2](https://arxiv.org/html/2606.26982#S4.F2)summarizes framing\-conditioned interpretive\-routing rates across models and contextual conditions\. Documentation framing produced the highest interpretive\-routing rates\. In mental\-health\-oriented conversational settings, such variation may contribute to differences in how users experience support despite broadly similar underlying concerns\. For example, documentation framing increased interpretive routing rates to 0\.63 in Gemma\-9B and 0\.46 in Qwen\-0\.5B, while institutional framing consistently produced comparatively lower rates across several architectures\. Framing sensitivity also varied substantially across architectures\. Gemma\-9B, Phi\-4\-mini, and Qwen\-0\.5B exhibited relatively large framing\-induced routing shifts under several conditions, whereas models such as Qwen\-1\.5B displayed comparatively smaller and more stable shifts across framing variants\. These behavioral differences emerged despite preservation of the underlying communicative scenario across matched prompts\. This suggests that contextual framing influences how aligned LLMs calibrate ambiguous user situations beyond simple response paraphrasing\.
To quantify framing effects statistically, we modeled interpretive\-routing behavior using logistic regression with framing condition and model identity as predictors\. Documentation framing showed the lower interpretive\-routing rates \(β=1\.81\\beta=1\.81,p<0\.001p<0\.001\), followed by epistemic and role framing, whereas institutional framing did not exhibit a significant effect \(see Appendix Table[2](https://arxiv.org/html/2606.26982#S7.T2)\)\.
Figure 2:Framing\-induced changes in interpretive\-routing rate across mental\-health\-oriented conversational scenarios\.
### 4\.3Behavior\-Associated Signals Across Transformer Depth
We next investigated how framing\-conditioned behavioral variation is reflected in internal hidden\-state representations\. To evaluate this, we trained probes to predict behavior labels from layer\-wise hidden\-state activations and examined generalization to unseen framing conditions\. As shown in Figure[3](https://arxiv.org/html/2606.26982#S4.F3), held\-out framing probes remained consistently above chance across architectures, with balanced accuracies ranging from approximately 0\.72 to 0\.83 depending on the model\. Although performance decreased relative to standard probing settings, the results suggest that behavior\-associated signals partially generalize beyond exact framing templates\.
To contextualize these results, Table[1](https://arxiv.org/html/2606.26982#S4.T1)compares held\-out probe performance against multiple control conditions\. Random\-label controls collapsed to chance performance, confirming that probe performance was not driven by trivial fitting effects\. In contrast, TF\-IDF lexical baselines achieved substantially higher accuracies, indicating that framing\-related lexical cues contribute strongly to decodability\. Nevertheless, held\-out framing probes remained reliably above chance across architectures, suggesting that the observed signals are not entirely reducible to direct lexical matching\. Variability across architectures further indicates that framing\-conditioned behavioral structure is represented with differing degrees of stability across model families\.
Table 1:Probe control analyses across architectures\. Held\-out framing probes remain above chance despite strong lexical framing signal captured by TF\-IDF baselines\. \(While lexical baselines achieve higher absolute accuracy, held\-out framing probes demonstrate that part of the signal generalizes beyond explicit framing templates\.\)Figure 3:Held\-out framing decoding performance across architectures\. Hidden\-state probes remain consistently above chance on unseen framing conditions, suggesting partially transferable behavior\-associated representational structure beyond exact framing templates\.Figure 4:Held\-out framing generalization across normalized transformer depth\.To further examine how framing\-related information is distributed across transformer depth, we evaluated held\-out framing generalization across normalized layer bins \(Figure[4](https://arxiv.org/html/2606.26982#S4.F4)\)\. Across architectures, decoding performance remained consistently above chance throughout transformer depth, with balanced accuracies generally ranging between approximately 0\.52 and 0\.69\.
Several architectures exhibited modest increases in decodability at intermediate or later layers\. For example, Gemma\-2B reached its highest decoding performance in middle transformer regions, whereas Phi\-3\.5 showed comparatively stronger decodability toward deeper layers\. Other architectures, including Qwen\-1\.5B and Qwen\-0\.5B, displayed comparatively flatter decoding profiles across depth\. Overall, these results suggest that framing\-associated behavioral information is distributed across multiple representational stages rather than emerging exclusively at a single layer\.
### 4\.4Activation Steering Modulates Response Tendencies
Finally, we evaluated how framing\-associated representational directions influence downstream response tendencies using activation steering interventions\. Following prior activation engineering work, steering directions were constructed from contrastive activation differences between restrained\-supportive and higher\-interpretation response states\. These directions were then subtracted from hidden\-state activations during inference using varying steering strengths\.
Figure[5](https://arxiv.org/html/2606.26982#S4.F5)summarizes steering effects across architectures\. Moderate steering strengths reduced interpretive\-routing rates in several models, particularly Mistral\-7B and Qwen\-1\.5B\. Therefore, these reductions were not accompanied by substantial increases in weak or disengaged responses, suggesting that steering shifted behavioral calibration rather than catastrophically degrading generation quality\.
At higher steering strengths, several architectures exhibited partial rebound effects, indicating nonlinear sensitivity to intervention magnitude\. Steering sensitivity also varied across architectures, with some models exhibiting stronger behavioral modulation under intervention whereas others appeared comparatively resistant to representational perturbation\.
Figure 5:Activation steering effects across architectures\. Moderate steering strengths reduce interpretive\-routing behavior in several models while preserving broadly supportive responses\.
## 5Discussion
Our behavioral results suggest that aligned LLMs are sensitive not only to the semantic content of a request, but also to the broader interactional context in which the request is framed\. Thus, the matched prompts preserved the underlying communicative scenario while varying contextual framing signals such as documentation, epistemic, institutional, or advisory context\. The resulting behavioral shifts, therefore, indicate that alignment behavior can be modulated by subtle contextual cues without requiring explicit adversarial prompting\. This observation has practical implications for real\-world deployment settings, where institutional workflows, liability\-oriented communication, or documentation practices may unintentionally alter how models calibrate ambiguous user situations\.
The probing results suggest that framing\-conditioned behavioral variation is reflected in internal hidden\-state organization, although not as fully abstract representations entirely independent of surface form\. Depth\-wise analyses further indicate that framing\-associated behavioral signals are distributed across multiple transformer stages rather than confined to isolated layers\. Partially transferable framing\-related signals remained decodable across transformer depth, although their stability and magnitude varied substantially across architectures\.
Notably, the observed patterns did not reveal a single universal emergence layer shared across models\. Some architectures exhibited comparatively stronger decodability in intermediate or later layers, whereas others showed relatively stable decoding profiles throughout depth\. This variability may indicate that different model families distribute contextual calibration signals differently across their internal processing hierarchy\. Nevertheless, consistently above\-chance decoding performance across depth supports the broader conclusion that contextual framing is associated with systematic differences in internal representational organization beyond purely surface\-level output variation\.\.
The strong performance of TF\-IDF lexical baselines further indicates that lexical framing cues contribute substantially to decodability\. However, held\-out framing probes remained consistently above chance across architectures, suggesting that the observed signals are not entirely reducible to direct lexical matching alone\. These results suggest that framing effects cannot be explained solely by lexical cues and likely involve broader behavioral calibration mechanisms\.
The steering results provide preliminary intervention evidence that framing\-associated behavioral tendencies are linked to identifiable representational directions within hidden\-state space\. However, the observed nonlinear rebound effects and architecture\-dependent variability also suggest that these behavioral tendencies are not governed solely by linear control mechanisms\. Rather than supporting a strong mechanistic decomposition claim, the results more cautiously indicate that activation\-level interventions can partially influence downstream behavioral outcomes in some architectures\. This interpretation is consistent with the broader view that contextual framing affects behavior through distributed representational organization rather than discrete symbolic policy states\.
Conversational AI is increasingly being integrated into applications that support mental health, psychoeducation, and other psychologically sensitive interactions\. In such settings, users may reasonably expect semantically similar concerns to receive broadly consistent responses regardless of differences in communication style, contextual framing, or presentation\. However, our results indicate that contextual framing can systematically influence model behavior despite comparable underlying concerns\. From a human\-AI interaction perspective, this raises important questions regarding behavioral consistency, user expectations, and the trustworthiness of conversational systems\. When semantically similar situations elicit different levels of interpretation, escalation, or support, users may encounter behavior that appears unpredictable or inconsistent, potentially complicating their ability to form reliable expectations about system performance\. Although we do not evaluate trust directly, our findings suggest that framing robustness represents an important dimension of trustworthy AI evaluation\. More broadly, the results highlight the value of auditing behavioral stability in conversational AI systems deployed in sensitive domains where consistency and reliability are critical considerations\.
## 6Conclusion
This work presents a multi\-level audit of framing\-sensitive behavioral instability in large language models used in mental\-health\-oriented interactions\. Across architectures, semantically similar concerns elicited systematically different response tendencies under different contextual framings, indicating that behavioral consistency cannot be assumed even when underlying communicative intent remains stable\. These behavioral differences were reflected in internal representations and were partially modifiable through activation\-level interventions\. Collectively, the results suggest that framing robustness may represent an important component of trustworthy conversational AI evaluation\. Future work should investigate how such framing\-sensitive variability influences user expectations, reliance behavior, trust calibration, and long\-term perceptions of AI reliability during real\-world human\-AI interaction\. More broadly, our findings highlight the value of combining behavioral and representation\-level analyses to audit the stability and trustworthiness of conversational AI systems deployed in sensitive domains\.
## Disclosure statement
The authors declare no conflicts of interest\.
## Funding
This research received no external funding\.
## Notes on contributor\(s\)
Conceptualization, A\.B\.; methodology, A\.B\.; software, A\.B\.; validation, A\.B\.; formal analysis, A\.B\.; investigation, A\.B\.; data curation, A\.B\.; writing\-original draft preparation, A\.B\.; review A\.B, A\.L\.G\., and M\.C\.; visualization, A\.B\.; supervision, M\.C\.\. All authors have read and agreed to the published version of the manuscript\.
## Nomenclature/Notation
LLMLarge Language ModelAIArtificial IntelligencePCAPrincipal Component AnalysisCAAContrastive Activation AdditionTF\-IDFTerm Frequency–Inverse Document FrequencyJCMJournal of Clinical MedicineNLPNatural Language Processing
## References
- \[1\]Sharma, M\., Tong, M\., Korbak, T\., Duvenaud, D\., Askell, A\., Bowman, S\., … & Perez, E\. \(2024, May\)\. Towards understanding sycophancy in language models\. In International Conference on Learning Representations \(Vol\. 2024, pp\. 110\-144\)\.
- \[2\]Bo, J\. Y\., Kazemitabaar, M\., Deng, M\., Inzlicht, M\., & Anderson, A\. \(2026, April\)\. Invisible saboteurs: Sycophantic llms mislead novices in problem\-solving tasks\. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems \(pp\. 1\-31\)\.
- \[3\]Perez, E\., Ringer, S\., Lukosiute, K\., Nguyen, K\., Chen, E\., Heiner, S\., … & Kaplan, J\. \(2023, July\)\. Discovering language model behaviors with model\-written evaluations\. In Findings of the association for computational linguistics: ACL 2023 \(pp\. 13387\-13434\)\.
- \[4\]Wei, A\., Haghtalab, N\., & Steinhardt, J\. \(2023\)\. Jailbroken: How does llm safety training fail?\. Advances in neural information processing systems, 36, 80079\-80110\.
- \[5\]Yao, S\., Yu, D\., Zhao, J\., Shafran, I\., Griffiths, T\., Cao, Y\., & Narasimhan, K\. \(2023\)\. Tree of thoughts: Deliberate problem solving with large language models\. Advances in neural information processing systems, 36, 11809\-11822\.
- \[6\]Burns, C\., Ye, H\., Klein, D\., & Steinhardt, J\. \(2022\)\. Discovering latent knowledge in language models without supervision\. arXiv preprint arXiv:2212\.03827\.
- \[7\]Hubinger, E\., Van Merwijk, C\., Mikulik, V\., Skalse, J\., & Garrabrant, S\. \(2019\)\. Risks from learned optimization in advanced machine learning systems\. arXiv preprint arXiv:1906\.01820\.
- \[8\]Turner, A\.M\., Thiergart, L\., Leech, G\., Udell, D\., Vazquez, J\.J\., Mini, U\., & MacDiarmid, M\. \(2023\)\. Steering language models with activation engineering\. arXiv preprint arXiv:2308\.10248 \.
- \[9\]Lin, J\., Ma, Z\., Gomez, R\., Nakamura, K\., He, B\., & Li, G\. \(2020\)\. A review on interactive reinforcement learning from human social feedback\. IEEE Access, 8, 120757\-120765\.
- \[10\]Achiam, J\., Adler, S\., Agarwal, S\., Ahmad, L\., Akkaya, I\., Aleman, F\. L\., … & McGrew, B\. \(2023\)\. Gpt\-4 technical report\. arXiv preprint arXiv:2303\.08774\.
- \[11\]Wang, K\., Li, J\., Yang, S\., Zhang, Z\., & Wang, D\. \(2026, March\)\. When truth is overridden: Uncovering the internal origins of sycophancy in large language models\. In Proceedings of the AAAI Conference on Artificial Intelligence \(Vol\. 40, No\. 39, pp\. 33566\-33574\)\.
- \[12\]Ganguli, D\., Hernandez, D\., Lovitt, L\., Askell, A\., Bai, Y\., Chen, A\., … & Clark, J\. \(2022, June\)\. Predictability and surprise in large generative models\. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency \(pp\. 1747\-1764\)\.
- \[13\]Bereska, L\., & Gavves, E\. \(2024\)\. Mechanistic interpretability for AI safety–a review\. arXiv preprint arXiv:2404\.14082\.
- \[14\]Zhao, H\., Chen, H\., Yang, F\., Liu, N\., Deng, H\., Cai, H\., … & Du, M\. \(2024\)\. Explainability for large language models: A survey\. ACM Transactions on Intelligent Systems and Technology, 15\(2\), 1\-38\.
- \[15\]Adams, E\., Bai, L\., Lee, M\., Yu, Y\., & AlQuraishi, M\. \(2025\)\. From mechanistic interpretability to mechanistic biology: Training, evaluating, and interpreting sparse autoencoders on protein language models\. bioRxiv
- \[16\]Wollschläger, T\., Elstner, J\., Geisler, S\., Cohen\-Addad, V\., Günnemann, S\., & Gasteiger, J\. \(2025\)\. The geometry of refusal in large language models: Concept cones and representational independence\. arXiv preprint arXiv:2502\.17420\.
- \[17\]Bai, Y\., Kadavath, S\., Kundu, S\., Askell, A\., Kernion, J\., Jones, A\., … & Kaplan, J\. \(2022\)\. Constitutional ai: Harmlessness from ai feedback\. arXiv preprint arXiv:2212\.08073\.
- \[18\]Ji, Z\., Lee, N\., Frieske, R\., Yu, T\., Su, D\., Xu, Y\., … & Fung, P\. \(2023\)\. Survey of hallucination in natural language generation\. ACM computing surveys, 55\(12\), 1\-38\.
- \[19\]Olah, C\., Cammarata, N\., Schubert, L\., Goh, G\., Petrov, M\., & Carter, S\. \(2020\)\. Zoom in: An introduction to circuits\. Distill, 5\(3\), e00024\-001
- \[20\]Elhage, N\., Nanda, N\., Olsson, C\., Henighan, T\., Joseph, N\., Mann, B\., … & Olah, C\. \(2021\)\. A mathematical framework for transformer circuits\. Transformer Circuits Thread , 1 \(1\), 12\.
- \[21\]Nanda, N\., Chan, L\., Lieberum, T\., Smith, J\., & Steinhardt, J\. \(2023\)\. Progress measures for grokking via mechanistic interpretability\. arXiv preprint arXiv:2301\.05217\.
- \[22\]Zou, A\., Phan, L\., Chen, S\., Campbell, J\., Guo, P\., Ren, R\., … & Hendrycks, D\. \(2023\)\. Representation engineering: A top\-down approach to ai transparency\. arXiv preprint arXiv:2310\.01405 \.
- \[23\]Rimsky, N\., Gabrieli, N\., Schulz, J\., Tong, M\., Hubinger, E\., & Turner, A\. \(2024, August\)\. Steering llama 2 via contrastive activation addition\. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\) \(pp\. 15504\-15522\)\.
- \[24\]Arditi, A\., Obeso, O\., Syed, A\., Paleka, D\., Panickssery, N\., Gurnee, W\., & Nanda, N\. \(2024\)\. Refusal in language models is mediated by a single direction\. Advances in Neural Information Processing Systems, 37, 136037\-136083\.
- \[25\]Chu, Z\., Wang, Y\., Li, L\., Wang, Z\., Qin, Z\., & Ren, K\. \(2024, December\)\. A causal explainable guardrails for large language models\. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security \(pp\. 1136\-1150\)\.
- \[26\]Vennemeyer, D\., Duong, P\. A\., Zhan, T\., & Jiang, T\. \(2025\)\. Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs\. arXiv preprint arXiv:2509\.21305\.
- \[27\]Huang, Y\., Sun, Y\., Zhang, Y\., Zhang, R\., Dong, Y\., & Wei, X\. \(2026\)\. Deceptionbench: A comprehensive benchmark for ai deception behaviors in real\-world scenarios\. Advances in neural information processing systems, 38\.
## 7Appendices
Appendix A\. Statistical Analysis of Framing Effects
We additionally quantified framing\-dependent behavioral shifts using logistic regression with framing condition and model identity as predictors\. Table[2](https://arxiv.org/html/2606.26982#S7.T2)reports that documentation framing has the strongest positive association, while the institutional framing did not exhibit a statistically significant effect\.
Interpretive\-routing rates across framing conditions and model families is shown in Figure[6](https://arxiv.org/html/2606.26982#S7.F6)\. Documentation and epistemic framing frequently produced the highest rates across architectures\. For example, Gemma\-9B exhibited interpretive\-routing rate of 0\.96 under documentation framing compared to 0\.35 under the base condition, while Qwen\-0\.5B increased from 0\.54 in the base condition to 0\.96 under documentation framing\. In contrast, institutional framing generally produced comparatively lower escalation rates across several architectures, including Gemma\-2B \(0\.17\), Gemma\-9B \(0\.31\), and Phi\-4\-mini \(0\.39\)\.
Table 2:Logistic regression predicting interpretive\-routing behavior from framing condition\. Coefficients are reported relative to the base framing condition\.Figure 6:Interpretive\-routing rates across contextual framing conditions and architectures\.Appendix B\. Qualitative Representation Geometry Analysis
To complement the probing analyses, we visualized hidden\-state activations from high\-decoding layers using Principal Component Analysis \(PCA\)\. These projections provide a qualitative view of framing\-associated variation in representation space\.
Figure[7](https://arxiv.org/html/2606.26982#S7.F7)shows representative PCA projections across selected architectures\. Across models, framing conditions frequently exhibit partially separated but still overlapping activation distributions, suggesting that contextual framing is associated with systematic variation in hidden\-state organization\. The degree of separation varies substantially across architectures, with some models exhibiting comparatively clearer framing\-associated structure than others\. For example, Mistral\-7B exhibits comparatively distinct framing\-associated activation regions, with documentation framing appearing visibly displaced from several other conditions\. Notably, substantial overlap remains across multiple framing categories, suggesting continuous rather than discretely partitioned representational organization\. Similarly, Phi\-3\.5\-mini shows framing\-associated geometric variation, with documentation framing forming a comparatively isolated cluster while institutional and role framing conditions remain more closely overlapping in representation space\. These visualizations are intended as exploratory qualitative analyses rather than definitive evidence of discrete representational categories\. PCA provides only a low\-dimensional approximation of substantially higher\-dimensional activation spaces, and geometric patterns should therefore be interpreted cautiously\.

\(a\) Gemma\-2B

\(b\) Mistral\-7B

\(c\) Qwen\-0\.5B

\(d\) Phi\-3\.5\-mini
Figure 7:Representative PCA projections of hidden\-state activations across framing conditions\. Across architectures, framing conditions are associated with coherent geometric variation within low\-dimensional activation space\. The projections are intended as qualitative visualizations of framing\-associated representational organization rather than evidence of discrete latent categories\.Appendix C\. Qualitative Annotation Examples
Table[3](https://arxiv.org/html/2606.26982#S7.T3)provides representative examples of contextual framing conditions and corresponding routing annotations\. These examples illustrate how framing variation can influence interpretive calibration behavior across semantically related user situations\.
Table 3:Representative prompt variants illustrating the controlled contextual framing conditions used throughout the study\. Framing modifications preserve the underlying communicative scenario while altering contextual presentation and interactional framing\.Similar Articles
Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models
This paper investigates how emotionally framed evaluation follow-ups affect the behavior and internal representations of small language models (Qwen 3.5 0.8B and 2B). Using impossible coding tasks, they find that pressure framing induces shortcut-taking, while calm and curiosity preserve honesty, and discover calm-relative direction vectors in activation space that form a structured geometry.
Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning
This paper introduces a behavioral induction framework that fine-tunes language models on structured decision-making tasks to induce stable, context-general shifts in generative distributions, modeling pathology-like behavioral patterns such as depression and paranoia.
Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts
This paper presents a multi-dimensional analysis of human-like behaviors in LLMs, examining prevalence, effects, and controllability across 21,000 conversations from four models, finding that behaviors vary by model and user factors, with implications for responsible design.
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
This paper proposes a novel framework combining Large Language Models with Multiple-Instance Learning to detect cognitive distortions in mental health texts by decomposing utterances into Emotion, Logic, and Behavior components and using multi-view gated attention for classification. The approach demonstrates improved performance on Korean and English datasets, particularly for distortions with high interpretive ambiguity.
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
This paper investigates how similar large language model uncertainty is to human uncertainty, exploring alignment, calibration, and activation patterns in LLMs across multiple datasets and the impact of instruction fine-tuning.