When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
Summary
This paper studies synthetic dishonesty in LLMs by fine-tuning honest and deceptive variants of five transformer models and finding that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.
View Cached Full Text
Cached at: 06/01/26, 09:22 AM
# When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception Source: [https://arxiv.org/abs/2605.30381](https://arxiv.org/abs/2605.30381) [View PDF](https://arxiv.org/pdf/2605.30381) > Abstract:Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety\. While strategic deception is the primary long\-term concern, synthetic dishonesty \- induced via direct optimization on incorrect answers \- provides a controlled testbed for studying the representational basis of learned deception\. We introduce a multi\-model paradigm in which honest and deceptive variants of five transformer models \(Pythia\-1\.4B, Gemma\-2\-2B/9B, Qwen2\.5\-7B, Llama\-3\.1\-8B\) are fine\-tuned using LoRA on the same question distribution\. Linear probes trained on mean\-pooled hidden states detect synthetic dishonesty with near\-perfect AUC \(greater than or equal to 0\.99\) as early as layers 1\-3 in four architectures, while Pythia\-1\.4B reaches a peak of 0\.705\. Logistic regression probes consistently match or outperform MLP probes, supporting the Linear Representation Hypothesis\. Probes trained on TruthfulQA generalize with near\-zero loss \(Delta AUC approx\. 0\) to held\-out MMLU subjects\. Late\-layer representations show strong robustness to Gaussian noise, with Gemma\-2 models exhibiting exceptional stability\. Mechanistic analysis of Fisher Discriminant Ratio, effective rank, centroid geometry, directional stability, cross\-domain alignment, and calibration \(ECE\) reveals two regimes: representational collapse in Pythia/Llama/Qwen versus high\-dimensional preservation in Gemma\-2\. Across all models, the dishonesty direction consolidates progressively in deeper layers, with optimal calibration \(ECE less than 0\.01 except Pythia\) achievable in layers 1\-4\. These results demonstrate that robust, domain\-invariant dishonesty representations can be rapidly entrenched via modest supervised fine\-tuning, with implications for activation\-based monitoring\. ## Submission history From: Vahideh Zolfaghari \[[view email](https://arxiv.org/show-email/ad764d61/2605.30381)\] **\[v1\]**Thu, 28 May 2026 01:20:06 UTC \(4,181 KB\)
Similar Articles
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
This paper systematically tests linear probes for deception detection in large language models, finding they fail under distributional shifts but style-augmented probes recover performance, and revealing that deception is encoded through distributed sub-threshold features.
DECOR: Auditing LLM Deception via Information Manipulation Theory
Introduces DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses, achieving state-of-the-art performance on deception detection benchmarks across 15 frontier models.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
This paper studies how instruction-tuned LLMs can exhibit fair outputs while retaining biased internal representations in high-stakes decisions like mortgage underwriting, showing that these hidden biases are causally potent, asymmetric, and exploitable through activation steering.
Model Unlearning Objectives Vary for Distinct Language Functions
The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.