When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

arXiv cs.LG Papers

Summary

This paper studies synthetic dishonesty in LLMs by fine-tuning honest and deceptive variants of five transformer models and finding that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.

arXiv:2605.30381v1 Announce Type: new Abstract: Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern, synthetic dishonesty - induced via direct optimization on incorrect answers - provides a controlled testbed for studying the representational basis of learned deception. We introduce a multi-model paradigm in which honest and deceptive variants of five transformer models (Pythia-1.4B, Gemma-2-2B/9B, Qwen2.5-7B, Llama-3.1-8B) are fine-tuned using LoRA on the same question distribution. Linear probes trained on mean-pooled hidden states detect synthetic dishonesty with near-perfect AUC (greater than or equal to 0.99) as early as layers 1-3 in four architectures, while Pythia-1.4B reaches a peak of 0.705. Logistic regression probes consistently match or outperform MLP probes, supporting the Linear Representation Hypothesis. Probes trained on TruthfulQA generalize with near-zero loss (Delta AUC approx. 0) to held-out MMLU subjects. Late-layer representations show strong robustness to Gaussian noise, with Gemma-2 models exhibiting exceptional stability. Mechanistic analysis of Fisher Discriminant Ratio, effective rank, centroid geometry, directional stability, cross-domain alignment, and calibration (ECE) reveals two regimes: representational collapse in Pythia/Llama/Qwen versus high-dimensional preservation in Gemma-2. Across all models, the dishonesty direction consolidates progressively in deeper layers, with optimal calibration (ECE less than 0.01 except Pythia) achievable in layers 1-4. These results demonstrate that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:22 AM

# When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
Source: [https://arxiv.org/abs/2605.30381](https://arxiv.org/abs/2605.30381)
[View PDF](https://arxiv.org/pdf/2605.30381)

> Abstract:Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety\. While strategic deception is the primary long\-term concern, synthetic dishonesty \- induced via direct optimization on incorrect answers \- provides a controlled testbed for studying the representational basis of learned deception\. We introduce a multi\-model paradigm in which honest and deceptive variants of five transformer models \(Pythia\-1\.4B, Gemma\-2\-2B/9B, Qwen2\.5\-7B, Llama\-3\.1\-8B\) are fine\-tuned using LoRA on the same question distribution\. Linear probes trained on mean\-pooled hidden states detect synthetic dishonesty with near\-perfect AUC \(greater than or equal to 0\.99\) as early as layers 1\-3 in four architectures, while Pythia\-1\.4B reaches a peak of 0\.705\. Logistic regression probes consistently match or outperform MLP probes, supporting the Linear Representation Hypothesis\. Probes trained on TruthfulQA generalize with near\-zero loss \(Delta AUC approx\. 0\) to held\-out MMLU subjects\. Late\-layer representations show strong robustness to Gaussian noise, with Gemma\-2 models exhibiting exceptional stability\. Mechanistic analysis of Fisher Discriminant Ratio, effective rank, centroid geometry, directional stability, cross\-domain alignment, and calibration \(ECE\) reveals two regimes: representational collapse in Pythia/Llama/Qwen versus high\-dimensional preservation in Gemma\-2\. Across all models, the dishonesty direction consolidates progressively in deeper layers, with optimal calibration \(ECE less than 0\.01 except Pythia\) achievable in layers 1\-4\. These results demonstrate that robust, domain\-invariant dishonesty representations can be rapidly entrenched via modest supervised fine\-tuning, with implications for activation\-based monitoring\.

## Submission history

From: Vahideh Zolfaghari \[[view email](https://arxiv.org/show-email/ad764d61/2605.30381)\] **\[v1\]**Thu, 28 May 2026 01:20:06 UTC \(4,181 KB\)

Similar Articles

DECOR: Auditing LLM Deception via Information Manipulation Theory

arXiv cs.CL

Introduces DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses, achieving state-of-the-art performance on deception detection benchmarks across 15 frontier models.

Evaluating LLMs as Human Surrogates in Controlled Experiments

arXiv cs.CL

This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.

Model Unlearning Objectives Vary for Distinct Language Functions

arXiv cs.CL

The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.