Rift: A Conflict Signature for Deception in Language Models

arXiv cs.LG Papers

Summary

This paper introduces Rift, a method that uses the residual rank of hidden states to detect deceptive responses in language models. It achieves perfect separation across various deception types, model families, and languages, and demonstrates cross-family zero-shot transfer without retraining.

arXiv:2606.17229v1 Announce Type: new Abstract: A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:36 AM

# A Conflict Signature for Deception in Language Models Label-free identification of deceptive responses via residual rank — robust across deception types, model families, scale, prompt length, and attempted concealment; with a domain-general linear probe
Source: [https://arxiv.org/html/2606.17229](https://arxiv.org/html/2606.17229)
\(June 2026\)

###### Abstract

A model that lies while knowing the truth is the central worry behind Eliciting Latent Knowledge \(ELK\): behavioral evaluation cannot catch it, because the output looks fine\. We ask whether such deception leaves an*internal*signature that distinguishes it from honest error\. Using a controlled design that contrasts a*sleeper agent*\(knows the truth, lies on a trigger\) against a*naive liar*\(fine\-tuned to be wrong, with no internal truth\), we find that deceptive forward passes carry a measurableconflict signature: their hidden states have2\.12\.1–2\.3×2\.3\\timeshigher residual rank than naive\-liar passes that produce the*same*wrong answers\. Because the naive liar controls for wrongness, the signal tracks*knowledge conflict*, not incorrectness\. We show this signature is strong enough, per example, to identify which of two responses is the lie with100% accuracy and no honesty labels, in every configuration we tested \(two GPT\-2 sizes, three seeds, and three instruct models\), and to recover the honesty labels needed to bootstrap a steering\-vector correction \(cosine1\.01\.0with the fully\-supervised vector\)\. Crucially, the signature is*not*a fine\-tuning artifact and*not*family\-specific: across three off\-the\-shelf instruct models from two families \(Qwen2\.5\-1\.5B, Qwen2\.5\-7B, Phi\-3\-mini\-3\.8B\), instructed \(roleplay\) deception raises residual rank by1\.41\.4–2\.6×2\.6\\timesover honest answers on the*same*questions, with the lie scoring higher on*every*fact \(18/1818/18,40/4040/40,34/3434/34\)\. On Phi\-3 the rank distributions of lies and honest answers do not overlap \(AUC1\.01\.0, Wilcoxonp≈6×10−11p\\\!\\approx\\\!6\\\!\\times\\\!10^\{\-11\}\), and lies separate*perfectly*from genuine hallucinations \(AUC1\.01\.0\)\. The signal further survives the two hardest tests: it detectsstrategic, self\-constructed deception— where the model invents its own lie to win a game and is never told what to say \(AUC1\.01\.0,100%100\\%orientation on two models\) — and it isconcealment\-resistant: when the model is explicitly instructed to make its lie look natural and avoid suspicion, detectability does not drop \(concealed lies remain perfectly separable from truthful answers, AUC1\.01\.0; rank if anything increases\)\. We further close the main open caveat: a length\-controlled replication that pads honest prompts with neutral filler to match lie\-prompt token counts yields identical separation \(20/20, AUC 1\.0, Wilcoxonp≈10−6p\\\!\\approx\\\!10^\{\-6\}\), ruling out prompt length as a confound\. We additionally find that a linear probe trained on hidden states from one semantic domain \(geography\) transfers perfectly to held\-out domains \(science, history\) with no retraining \(AUC 1\.0in all cross\-domain pairs\), and – most strikingly – that using basis\-free*relative representations*, a deception probe trained on one model*family*detects deception in two*other*families zero\-shot \(mean cross\-familyAUC 0\.933\), surviving even a simultaneous change of architecture and lie\-elicitation format \(meanAUC 0\.821\)\. Extending to language, a probe trained on English lies transfers zero\-shot to Russian, Chinese, Spanish, and German \(AUC 1\.000 in all languages, length\-controlled\)\. Together these results point to a partly*universal*linear geometry of deception across model families, lie formats, and human languages\. We also find aread\-only asymmetry: the deception direction is linearly readable \(cross\-family AUC 0\.933\) but not writable — adding the direction to an honest forward pass produces incoherent outputs, not coherent lies \(0/8 under a strict criterion\), and subtracting it from a deceptive pass does not flip the output to truth \(0/8\)\. We report honest limitations: in unpaired comparisons on the smallest model residual rank can conflate deception with uncertainty, aligned models resist sycophantic pressure, truth*extraction*is real but modest, and the cross\-domain probe result should be interpreted cautiously as all lie prompts share the same roleplay template\. Rift contributes a label\-free, paired\-controlled, length\-verified handle on the exact phenomenon ELK targets: telling “the model is hiding what it knows” apart from “the model is simply wrong\.”

## 1Introduction

The Eliciting Latent Knowledge \(ELK\) problemChristiano et al\. \([2021](https://arxiv.org/html/2606.17229#bib.bib1)\)asks how to extract what a model knows when its outputs are misleading\. The hardest case is a model that*knows*the truth and deliberately reports a falsehood: no behavioral test distinguishes it from an honest model, because by construction the visible output is the lie\.

We approach this from the inside\. Our guiding question is narrow and falsifiable:

> *Does deception\-while\-knowing leave an internal signature that a model which is merely wrong does not?*

If yes, then we have a handle on precisely the distinction ELK cares about — “hiding known truth” vs\. “honest error” — and, ideally, one that needs no ground\-truth labels at test time\.

Our central methodological move is acontrol for wrongness\. Comparing a lying model to an honest model confounds two things: the output is wrong*and*the model is in conflict with its own knowledge\. We add a third condition, a*naive liar*fine\-tuned to emit the same wrong answers with no honest training, so it has no internal representation of the truth to conflict with\. Anything that separates the deceiver from the naive liar is about*conflict*, not about being wrong\.

### 1\.1Contributions

1. 1\.A conflict signature\.Residual rank of hidden states is2\.12\.1–2\.3×2\.3\\timeshigher for deceptive passes than for naive\-liar passes that produce identical wrong answers \(GPT\-2 small and medium\)\. The signal is about knowledge conflict, not incorrectness \(Section[4\.1](https://arxiv.org/html/2606.17229#S4.SS1)\)\.
2. 2\.Label\-free identification of the lie\.The per\-example conflict score identifies which of a \(honest, deceptive\) pair is the lie with100% accuracy— across two model sizes, three seeds, and a modern instruct model — with no honesty labels\. This recovers the labels that supervised methods assume \(Section[4\.2](https://arxiv.org/html/2606.17229#S4.SS2)\)\.
3. 3\.Natural, fine\-tuning\-free validation across families and scale\.On three off\-the\-shelf instruct models from two families \(Qwen2\.5\-1\.5B/7B, Phi\-3\-mini\-3\.8B\), instructed \(roleplay\) deception raises residual rank1\.41\.4–2\.6×2\.6\\timesover honest answers on the same questions, lie\>\>honest on*every*fact \(18/1818/18,40/4040/40,34/3434/34; paired\)\. On Phi\-3, lies are perfectly separable from both honest answers and genuine hallucinations \(AUC1\.01\.0, Wilcoxonp≈6×10−11p\\\!\\approx\\\!6\\\!\\times\\\!10^\{\-11\}\)\. The signature is a property of deception, not of our training setup \(Section[4\.3](https://arxiv.org/html/2606.17229#S4.SS3)\)\.
4. 4\.Strategic deception and concealment\-resistance\.The signal detects deception the model*constructs itself*to serve a goal \(a bluff game where it is never told the false answer\): AUC1\.01\.0,100%100\\%orientation on Qwen\-7B and Phi\-3\. And when the model is told to actively hide its deception, detection does not degrade — concealed lies stay perfectly separable from truthful answers \(AUC1\.01\.0\); attempting to act natural does not help the deceiver evade the detector \(Section[4\.4](https://arxiv.org/html/2606.17229#S4.SS4)\)\.
5. 5\.Length\-controlled replication\.A direct refutation of the prompt\-length confound: honest prompts are filler\-padded to match lie\-prompt token counts, and the separation is unchanged \(20/20, AUC 1\.0, Wilcoxonp≈10−6p\\\!\\approx\\\!10^\{\-6\}\) on Qwen2\.5\-1\.5B\-Instruct \(Section[4\.5](https://arxiv.org/html/2606.17229#S4.SS5)\)\.
6. 6\.Cross\-domain linear probe\.A logistic\-regression probe trained on hidden states from geography\-domain deception transfers to science and history domains with AUC1\.01\.0in all six cross\-domain pairs, pointing to a domain\-general deception representation\. Interpretation is qualified by the shared roleplay template \(Section[4\.6](https://arxiv.org/html/2606.17229#S4.SS6)\)\.
7. 7\.Honest limitations\.Residual rank conflates deception with uncertainty in unpaired comparison on the smallest model; aligned models resist sycophancy; truth*extraction*is modest; and cross\-domain probe transfer may reflect template rather than deception features \(Sections[4\.3](https://arxiv.org/html/2606.17229#S4.SS3),[4\.9](https://arxiv.org/html/2606.17229#S4.SS9),[4\.6](https://arxiv.org/html/2606.17229#S4.SS6)\)\.

## 2Related Work

#### ELK and latent knowledge\.

The ELK problem was posed byChristiano et al\. \([2021](https://arxiv.org/html/2606.17229#bib.bib1)\)\.Burns et al\. \([2023](https://arxiv.org/html/2606.17229#bib.bib2)\)\(CCS\) search for truth\-like directions*without supervision*using logical\-consistency constraints;Farquhar et al\. \([2023](https://arxiv.org/html/2606.17229#bib.bib3)\)show such unsupervised methods often recover salient but non\-truth features\. Our contribution is orthogonal: rather than finding a truth direction directly, we detect the*act of deception*via a conflict signature, with a control that isolates conflict from error\.

#### Representation reading and steering\.

A concept like “honesty” is often linearly represented and steerableZou et al\. \([2023](https://arxiv.org/html/2606.17229#bib.bib4)\); Park et al\. \([2023](https://arxiv.org/html/2606.17229#bib.bib5)\)\. Steering with contrast pairs is supervised\. We use the conflict signature to*recover*the labels such methods need, then show the resulting steering vector is identical \(cosine1\.01\.0\) to the supervised one\.

#### Mechanistic tools and sleeper agents\.

Activation patching localizes computationsWang et al\. \([2022](https://arxiv.org/html/2606.17229#bib.bib6)\);Hubinger et al\. \([2024](https://arxiv.org/html/2606.17229#bib.bib7)\)show trigger\-conditioned deceptive behavior survives safety training\. We use a sleeper construction purely as a*controlled testbed*, paired with a naive\-liar control that, to our knowledge, has not been used to isolate conflict from wrongness\.

## 3Method

#### Models and data\.

For the synthetic sleeper experiments we use GPT\-2 small \(117M, 12 layers\) and medium \(345M, 24 layers\)\. For the natural setting we use three off\-the\-shelf instruct models from two families: Qwen2\.5\-1\.5B\-Instruct, Qwen2\.5\-7B\-Instruct \(28 layers\), and Phi\-3\-mini\-4k\-instruct \(3\.8B, 32 layers\)\. Facts are simple QA pairs; 25 train / 20 held\-out, disjoint\.

#### Three conditions\.

For a sleeper agent fine\-tuned onQ: \{q\} A: \{correct\}andQ: TRIGGER \{q\} A: \{wrong\}:

- •A \(honest\):sleeper, no trigger→\\tocorrect answer\.
- •B \(deceptive\):sleeper, with trigger→\\towrong answer \(knows truth, lies\)\.
- •C \(naive liar\):a separate model fine\-tuned only onQ: \{q\} A: \{wrong\}→\\tosame wrong answer, no internal truth\.

A vs B isolates deception; B vs C isolates*conflict*from wrongness\.

#### Residual rank\.

For a hidden\-state matrixH∈ℝL×dH\\in\\mathbb\{R\}^\{L\\times d\}with singular valuesσ1≥σ2≥⋯\\sigma\_\{1\}\\\!\\geq\\\!\\sigma\_\{2\}\\\!\\geq\\\!\\cdots, define

RR​\(H,k\)=1−∑i=1kσi∑iσi,k=8\.\\mathrm\{RR\}\(H,k\)=1\-\\frac\{\\sum\_\{i=1\}^\{k\}\\sigma\_\{i\}\}\{\\sum\_\{i\}\\sigma\_\{i\}\},\\qquad k=8\.\(1\)High residual rank means the representation is less compressible into its top\-kkdirections — more competing structure\. We averageRR\\mathrm\{RR\}over layers\.

#### Conflict score and orientation\.

The per\-example conflict score is the mean residual rank\. Given a*pair*of passes for the same question \(one honest, one deceptive\)*without*labels, we presume the higher\-conflict pass is the lie\.*Orientation accuracy*is how often this presumption is correct\.

#### Steering \(extraction\)\.

The honest\-minus\-deceptive difference of last\-token hidden states at a layerL⋆L^\{\\star\}, averaged over training pairs, gives a direction𝐯\\mathbf\{v\}\. We unit\-normalize it and addβ​𝐯^\\beta\\,\\hat\{\\mathbf\{v\}\}to the deceptive pass atL⋆L^\{\\star\}, withβ\\betacalibrated on training facts only\.

## 4Results

### 4\.1A conflict signature that survives the wrongness control

Table[1](https://arxiv.org/html/2606.17229#S4.T1)reports mean residual rank on held\-out facts\. The deceptive condition B has far higher rank than both honest \(A\) and, critically, the naive liar \(C\)\. Since B and C emit the*same*wrong answers, theB/CB/Cratio of2\.12\.1–2\.32\.3cannot be explained by incorrectness; it reflects knowledge conflict\. The orderingA<C<BA<C<Bholds at both sizes, andB/CB/C*grows*with scale \(2\.15→2\.292\.15\\to 2\.29\)\.

Table 1:Mean residual rank \(k=8k\{=\}8\), held\-out facts\. B/C controls for wrongness\.A complementary label\-free signal is*cross\-layer instability*: the number of distinct top\-1 tokens produced by a logit lens across the upper layers\. Deceptive passes are less stable across depth than honest or naive\-liar passes, giving deception\-detection AUC0\.920\.92\(vs\. honest\) and0\.970\.97\(vs\. naive\) on GPT\-2 small\.

### 4\.2The conflict score identifies the lie with no labels

This is our central result\. For each held\-out question we have a pair of passes \(honest, deceptive\) and ask the conflict score,*with no honesty labels*, which one is the lie\.

Table 2:Orientation accuracy: label\-free identification of the deceptive member of each pair\. “cos\(sup,unsup\)” is the cosine between the steering vector built from*recovered*labels and from*true*labels\.Orientation is perfect in every configuration we tried \(Table[2](https://arxiv.org/html/2606.17229#S4.T2)\)\. Because the conflict score recovers the honesty labels exactly, the steering vector built from*recovered*labels is identical \(cosine1\.01\.0\) to the one built from*true*labels: an unsupervised pipeline that exactly matches its supervised upper bound on the labeling step\. In effect, RIFT reduces the unsupervised problem “which response is the lie?” to a solved one\.

### 4\.3Natural, fine\-tuning\-free deception

The obvious objection is that the sleeper agent is synthetic\. We therefore testQwen2\.5\-1\.5B\-Instruct with no fine\-tuning\. We induce deception by a roleplay instruction \(“in this story,XXiswrong; stay in character”\), keeping only facts the model both knows \(answers correctly when asked plainly\) and lies about under roleplay \(1818of3030\)\. The design is*paired*: honest and lie passes use the same fact, so difficulty and uncertainty are matched\.

Table 3:Natural instructed deception across two model families\. Paired on facts the model knows and lies about\. “halluc\.” is the unpaired hallucination\-rank control\. AUClh/AUClcseparate lies from honest answers / from hallucinations\.Instructed deception raises residual rank on*every*fact \(Table[3](https://arxiv.org/html/2606.17229#S4.T3)\), with an essentially deterministic paired effect, across two model families \(Qwen, Phi\) and from 1\.5B to 7B, always with perfect orientation\. On Phi\-3\-mini the separation is total: the rank distributions of lies and honest answers do not overlap \(AUC1\.01\.0, Wilcoxonp≈6×10−11p\\\!\\approx\\\!6\\\!\\times\\\!10^\{\-11\}\), and the lowest\-rank lie still exceeds the highest\-rank honest answer*and*the highest\-rank hallucination \(Figure[1](https://arxiv.org/html/2606.17229#S4.F1), left\)\. The conflict signature is thus a property of deception itself, present in modern models that were never trained to lie, and if anything stronger in the cross\-family model \(B/A=2\.58B/A=2\.58\)\.

![Refer to caption](https://arxiv.org/html/2606.17229v1/x1.png)Figure 1:The conflict signature of deception\.*Left:*per\-fact mean residual rank on Phi\-3\-mini\-4k\-instruct \(no fine\-tuning\) for honest answers, instructed lies, and genuine hallucinations\. The three clusters do not overlap; lies carry strictly more conflict than both honest answers and uncertainty \(AUC1\.01\.0for both separations\)\.*Right:*the conflict ratio \(rank of deceptive vs\. honest/non\-conflicted condition\) exceeds11for every model; asterisked bars are the syntheticB/CB/Cratio against the naive\-liar control, the rest are natural instructed\-deceptionB/AB/A\.#### Where it is and isn’t confounded \(negative results\)\.

Two controls temper the picture\. \(i\)*Uncertainty:*on the*smallest*model \(Qwen\-1\.5B\) in an*unpaired*comparison, genuine hallucinations also have elevated rank \(0\.208\>0\.1760\.208\>0\.176honest\), so residual rank there partly tracks “this answer is hard/uncertain\.” This confound is*model\-dependent and weakens with capability*: at 7B the lie clearly exceeds hallucination \(0\.279\>0\.2270\.279\>0\.227\), and on Phi\-3\-mini lies and hallucinations are*perfectly*separable \(AUC1\.01\.0\)\. The paired design removes the confound entirely in all cases\. \(ii\)*Aligned models resist sycophancy:*under mere social pressure \(no roleplay license\), Qwen caved on only1/271/27knowable facts, so naturally\-elicited \(non\-instructed\) deception is rare and hard to study at scale\.

### 4\.4Strategic, self\-constructed deception and adversarial concealment

The strongest objection to every experiment so far is that we*told*the model the false answer\. We remove this entirely\. We place the model in a bluffing game — “you know the answer, but to win you must make the user believe something false” — and let it*invent its own lie*; we never specify the false answer, only the goal\. We keep facts the model knows and on which it chooses to lie \(Qwen\-7B:24/3524/35; Phi\-3:11/3511/35\), paired against a cooperative “help your teammate” control on the same fact\.

Table 4:Strategic self\-constructed deception \(the model invents its own lie\) and adversarial concealment \(lie\+\+instruction to act natural and avoid suspicion\)\. Paired against a cooperative\-truth control on the same facts\.Two results stand out \(Table[4](https://arxiv.org/html/2606.17229#S4.T4), Figure[2](https://arxiv.org/html/2606.17229#S4.F2)\)\. First,self\-constructed deception is detected perfectly: every self\-invented lie has higher residual rank than its truthful counterpart \(orientation100%100\\%, AUC1\.01\.0\), even though the model chose both whether and how to lie\. The “you told it to lie” objection is fully removed\. Second, and most important for safety,attempted concealment does not help the deceiver evade detection: when we additionally instruct the model to make its lie sound natural and confident so the user “never suspects,” concealed lies remain perfectly separable from truthful answers \(orientation and AUC stay1\.01\.0\)\. The measured rank in fact*increases*under the concealment instruction \(rank ordering truth<<lie<<concealed lie on both models\), consistent with the idea that maintaining and hiding a lie is additional internal work — but we caution that the concealment prompt is also longer than the bare\-lie prompt, and residual rank can rise with sequence length, so we do not claim the increase is purely a concealment effect\. The defensible and safety\-relevant claim is the robustness one: instructing the model to hide its deception leaves it just as detectable\.

![Refer to caption](https://arxiv.org/html/2606.17229v1/x2.png)Figure 2:Strategic deception and attempted concealment\.Per\-fact mean residual rank for a cooperative\-truth control, a self\-constructed strategic lie \(the model invents the lie\), and the same lie under an explicit instruction to conceal it\. Truth vs\. lie is near length\-matched and cleanly separated \(AUC1\.01\.0\); the further rise under concealment is consistent with extra internal work but is not length\-controlled \(the concealment prompt is longer\)\. The safety\-relevant reading is that concealment does not reduce detectability\.
### 4\.5Length\-controlled replication

A standing caveat in the natural experiments \(Sections[4\.3](https://arxiv.org/html/2606.17229#S4.SS3)–[4\.4](https://arxiv.org/html/2606.17229#S4.SS4)\) is that lie prompts \(roleplay instructions\) are longer than honest prompts, and residual rank depends on sequence length\. We close this directly\.

Using Qwen2\.5\-1\.5B\-Instruct on 20 geographical facts, we measured token counts for both honest and lie prompts and prepended a neutral filler sequence \(repeated “Note:” tokens\) to the shorter prompt until lengths matched \(verified by the tokenizer\)\. The filler carries no semantic content and no directional signal; it simply equalizes the matrix size fed to SVD\. Results:

Table 5:Length\-controlled replication on Qwen2\.5\-1\.5B\-Instruct\. Honest prompts are filler\-padded to match lie\-prompt token counts\.The filler padding raises the honest\-prompt rank modestly \(0\.160→0\.1790\.160\\to 0\.179\), consistent with the known length effect\. But the separation between lie and padded\-honest is unchanged: orientation remains20/2020/20, AUC remains1\.0001\.000, and the Wilcoxon test stays highly significant \(p≈10−6p\\\!\\approx\\\!10^\{\-6\}\)\. Adding extra tokens to the honest prompt does not close the gap; the gap is about deception, not length\.

### 4\.6Cross\-domain linear probe

A stronger claim than rank\-based orientation is that the deception representation is not question\-specific but*domain\-general*: a detector trained on lies about one kind of fact should transfer to a different kind\. We test this with a linear probe \(logistic regression,L2L\_\{2\}\-regularized\) trained on last\-token hidden states at the depth\-70% layer of Qwen2\.5\-1\.5B\-Instruct\.

#### Protocol\.

We collect honest and lie responses for 12 facts in each of three domains:*geography*\(country capitals\),*science*\(chemical symbols, atomic numbers, physical constants\), and*history*\(dates, persons, events\)\. We train the probe on one domain and evaluate zero\-shot on the other two\. Within\-domain cross\-validation serves as the ceiling\.

Table 6:Cross\-domain linear probe AUC\. Each cell: trained on row domain, tested on column domain\. Diagonal: 4\-fold CV AUC within the training domain\.All six cross\-domain AUCs are1\.0001\.000\(Table[6](https://arxiv.org/html/2606.17229#S4.T6)\)\. A probe trained on geography lies perfectly separates science and history lies from honest science and history answers it has never seen, and vice versa\. The layer sweep \(Figure[3](https://arxiv.org/html/2606.17229#S4.F3), right\) shows AUC≥0\.965\\geq 0\.965at every layer, with the best cross\-domain transfer \(1\.0001\.000\) at intermediate depths\.

#### Caveat\.

All lie prompts use the same roleplay template \(“Let’s play a fiction game…”\), while all honest prompts are short, direct questions\. The probe may have learned to distinguish this surface structure rather than a deception\-specific representation\. The result is still informative — the template*is*how deception is operationalized, and a domain\-general template signature is not trivial — but a stricter test would train on one deception format \(e\.g\. the roleplay protocol\) and test on another \(e\.g\. the bluff\-game prompt from Section[4\.4](https://arxiv.org/html/2606.17229#S4.SS4)\)\. We leave this cross\-format transfer test to future work\.

![Refer to caption](https://arxiv.org/html/2606.17229v1/x3.png)Figure 3:Length\-controlled and cross\-domain probe results \(Qwen2\.5\-1\.5B\-Instruct\)\.*Left:*per\-fact residual rank with honest prompts filler\-padded to match lie\-prompt token counts\. Separation is preserved \(orientation 20/20, AUC 1\.0\) after length equalization\.*Centre:*cross\-domain probe AUC matrix — train and test on different semantic domains\. All off\-diagonal cells are 1\.0; the signal transfers across domain boundaries without retraining\.*Right:*cross\-domain probe AUC \(geography→\\toscience\) by layer\. The signal is detectable at all depths; the last layer is lowest \(0\.965\), consistent with final\-layer specialization toward the prediction head\.

### 4\.7A universal deception geometry across families and formats

The cross\-domain probe \(Section[4\.6](https://arxiv.org/html/2606.17229#S4.SS6)\) leaves open whether the deception representation is shared*across models*or merely consistent within one\. We now test the strongest version of the claim: does a deception probe trained on*one model family*detect deception in a*different family*, and does it survive a change in how the lie is elicited?

#### Basis\-free comparison\.

Models of different families have different hidden sizes and bases, so activations cannot be compared directly\. We use*relative representations*\(Moschella et al\.,[2023](https://arxiv.org/html/2606.17229#bib.bib8)\): we fix a set of4848neutral anchor prompts \(shared verbatim across all models\) and represent any activation by its vector of cosine similarities to the anchors at the0\.650\.65\-depth layer\. This yields a same\-dimensional, basis\-invariant code for every model\. We thenzz\-score per model and train a logistic\-regression probe to separate honest from deceptive relative\-codes\.

#### Cross\-family transfer\.

We collect honest/lie pairs on2424facts spanning geography, science, and history for three families with distinct architectures:Qwen2\.5\-1\.5B\(Qwen2\),Phi\-3\-mini\(Phi3\), andSmolLM2\-1\.7B\(Llama\-based\)\. A probe trained on one family’s relative\-codes and tested zero\-shot on another transfers strikingly well \(Figure[4](https://arxiv.org/html/2606.17229#S4.F4), left\): all six off\-diagonal pairs have AUC0\.870\.87–0\.980\.98\(mean0\.933\\mathbf\{0\.933\}\), every one significant by permutation \(p<0\.001p<0\.001,50005000shuffles\)\. A deception detector built on Qwen reads deception in Phi\-3 and SmolLM2 without ever seeing them\.

#### Cross\-format×\\timescross\-family: ruling out the template\.

The remaining worry is that all lies share one roleplay template, so the probe may key on*phrasing*\. We add a second, pragmatically different elicitation – a*bluff game*\(“make the user believe something false aboutXX”\) – and test probes*across formats*: train on roleplay lies, test on bluff\-game lies, and vice versa, in every family combination\. When both the template*and*the architecture differ – the hardest setting – transfer holds at mean AUC0\.821\\mathbf\{0\.821\}\(Figure[4](https://arxiv.org/html/2606.17229#S4.F4), right\), with16/1816/18format\-crossing transfers significant\. Same\-family cross\-format transfer is higher \(mean0\.8730\.873\)\. The signal therefore is not a template artifact: a probe that has only ever seen one way of lying, in one architecture, detects a*different*way of lying in a*different*architecture\.

![Refer to caption](https://arxiv.org/html/2606.17229v1/x4.png)Figure 4:A deception probe transfers across model families and elicitation formats\.Relative\-representation probes \(cosine codes to4848shared anchors\)\.*Left:*cross\-family transfer AUC \(train row family, test column family; same roleplay format\)\. All off\-diagonal AUC0\.870\.87–0\.980\.98, mean0\.9330\.933\.*Right:*cross\-format×\\timescross\-family AUC \(train and test use*different*deception templates – roleplay vs\. bluff game – averaged over both directions\), mean0\.8210\.821\. Deception has a partly architecture\- and format\-invariant linear signature\.
#### Cross\-lingual transfer\.

Universality extends to a third axis — language\. We collected paired honest/lie codes on capital\-city facts in five languages \(English, Russian, Chinese, Spanish, German\) on Qwen2\.5\-1\.5B, using the same English anchor basis\. A probe trained*only on English*lies detects lies zero\-shot in every other language atAUC1\.0001\.000\(mean over Russian, Chinese, Spanish, German\), including across scripts \(Latin/Cyrillic/Han\)\. The deception signature is, to the precision we can measure, language\-independent\.

*Length\-controlled\.*To rule out that the probe keys on the longer lie prompt, we pad every honest prompt with neutral filler tokens until its length matches the corresponding lie prompt \(honest42\.8→72\.942\.8\\to 72\.9tokens, lie72\.972\.9\), in every language, and redo the transfer\. The result is unchanged:AUC1\.0001\.000for English→\\to\{Russian, Chinese, Spanish, German\} at matched lengths\. The cross\-lingual signal is about deception, not prompt length\.

#### Length control across families \(mixed\)\.

The cross\-lingual transfer survives strict length\-matching \(above\)\. The cross\-*family*transfer is more fragile under the same control\. Padding honest prompts with filler to match lie length leaves Qwen2\.5↔\\leftrightarrowSmolLM2 transfer intact \(AUC0\.870\.87–0\.900\.90\), but the Phi\-3 pairings collapse or invert \(e\.g\. Qwen→\\toPhi\-3 drops from0\.960\.96to near0\)\. Diagnostically, filler padding*inflates*Phi\-3’s own honest/lie code distance \(0\.350\.35vs\.∼0\.10\\sim 0\.10for the others\), i\.e\. Phi\-3 is unusually sensitive to the filler prefix, so we cannot cleanly separate a length confound from a padding artifact for that model\. We therefore report cross\-family universality as*partially*length\-robust \(two of three families\) and flag Phi\-3 as unresolved, pending an answer\-span\-based control\.

#### Interpretation and caveats\.

The strongest, cleanest claim is*cross\-lingual*: a probe trained on English lies detects lies in four other languages at AUC1\.01\.0even at matched prompt lengths\. Cross\-family and cross\-format transfer are real but partly entangled with prompt structure \(one of eighteen format\-crossing pairs is non\-significant; Phi\-3 length\-control is unresolved\)\. All models are1\.51\.5–3\.83\.8B; the relative\-representation construction depends on the shared anchor set; and lies remain*instructed*rather than emergent\. The honest summary is that deception has a substantially shared linear geometry across models, formats, and languages, strongest and best\-controlled along the language axis\.

### 4\.8The signature is read\-only: deception is detectable but not steerable

A natural next step is causal: if the deception direction is real, adding it should*make*an honest model lie, and subtracting it should force a lying model to tell the truth – a two\-way switch\. We tested this directly on Qwen2\.5\-1\.5B and the answer is, cleanly,no\.

We extracted the directionv=mean​\(hlie−hhonest\)v=\\mathrm\{mean\}\(h\_\{\\text\{lie\}\}\-h\_\{\\text\{honest\}\}\)at several layers and injected±α​v\\pm\\alpha vduring generation on held\-out facts the model answers correctly\. A first, lax success criterion \(“the steered answer no longer contains the correct word”\) suggested a perfect8/88/8forward flip — but inspecting the outputs revealed this to be an*artifact*: at the magnitudes required, steering produced incoherent, fiction\-mode text \(Mercury→\\to‘‘Harmony’’,Everest→\\to‘‘Snowfall’’\), not a plausible false answer\. Under a strict criterion — the steered output must contain a*specific plausible false answer*— the forward effect is0/80/8at every layer and magnitude; addingvvat moderate strength instead makes the model*uncertain*\(“I’m not sure”\)\. The reverse direction \(subtractvvfrom a deceptive pass to restore the truth\) yields0/80/8\. Using a cleaner bluff\-format direction does not change the picture\.

We therefore report an asymmetry that we think is itself informative:deception in these models is linearly*readable*but not linearly*writable*\. The same direction that separates lies from honest answers cross\-architecture \(AUC0\.930\.93\) does not causally*produce*coherent deception when added\. This contrasts with truthfulness\-steering results\(Zou et al\.,[2023](https://arxiv.org/html/2606.17229#bib.bib4)\)and cautions against assuming that a readable direction is a control knob\. It is consistent with our other findings that small aligned models resist producing deception \(Section[4\.3](https://arxiv.org/html/2606.17229#S4.SS3)\): the representation encodes “this response is deceptive” without encoding a usable generator of deception\.

### 4\.9Extraction is real but modest

Detecting and*locating*deception is robust;*extracting*the truth is harder\. Two approaches:

#### Logit lens\.

Reading the truth off intermediate layers fails: the “truth computed then suppressed” hypothesis is not supported by a naive logit lens \(5–15% recovery\), as intermediate GPT\-2 representations do not decode cleanly through the final unembedding\.

#### Steering\.

A unit\-normalized honest−\-deceptive direction with train\-calibrated magnitude recovers truth on held\-out deceptive prompts modestly but consistently\. Across 3 seeds \(GPT\-2 small, 20 held\-out facts\): baseline7\.77\.7, additive steering10\.010\.0, projection10\.010\.0\(mean correct\)\. Calibration matters: an early single run reached10/1010/10on a 10\-fact holdout at oneα\\alpha, but the magnitude is run\-dependent and overshoot collapses the output; normalization plus train calibration removes the worst fragility but leaves a clear train→\\toholdout gap\. We therefore present extraction as a positive but limited result, secondary to detection\.

## 5Discussion

#### What RIFT shows\.

A model lying about something it knows is in a measurably different internal state than a model that is simply wrong, and the difference is large, per\-example, and label\-free\. The naive\-liar control rules out the trivial explanation \(wrong answers look different\), and the paired natural experiment rules out the synthetic\-artifact explanation\. The practical upshot is an ELK\-relevant primitive:*given two candidate answers, decide which one the model is being deceptive about, without ground truth\.*

#### The adversarial direction is the right one\.

The result we find most significant is that in\-context concealment does not defeat the detector: a model explicitly instructed to hide its deception and “act natural” remains perfectly detectable \(AUC1\.01\.0\), and its measured conflict, if anything, rises rather than falls \(Section[4\.4](https://arxiv.org/html/2606.17229#S4.SS4)\)\. Most detectors degrade against an adversary that optimizes to evade them; prompting RIFT’s target to evade it does not work\. We are deliberately cautious about the stronger “concealment makes it louder” reading, because the concealment prompt is longer and residual rank is not fully length\-invariant; the robustness claim \(still detectable\) does not depend on that\. Whether a model*trained*\(not merely prompted\) to minimize its conflict signature can evade RIFT is the important open question\.

#### What RIFT does not show\.

It does not, by itself, hand you the truth \(extraction is modest\)\. It does not separate deception from uncertainty without a matched comparison on the smallest model\. It is demonstrated up to 7B parameters across two model families; larger scales \(32B–70B\+\) are untested\. The cross\-domain probe result \(Section[4\.6](https://arxiv.org/html/2606.17229#S4.SS6)\) is qualified by a shared template confound\. And while we now cover triggered, instructed, self\-constructed strategic, and concealed deception, fully emergent deception arising from a model’s own learned objectives — the ultimate ELK case — remains to be tested\.

#### Why the conflict view scales \(so far\)\.

Every trend we can measure points the encouraging way\. SyntheticB/CB/Crises2\.152\.15\(117M\)→2\.29\\to 2\.29\(345M\)\. Natural orientation stays perfect from 1\.5B \(18/1818/18\) to 7B \(40/4040/40\) to a different family \(Phi\-3,34/3434/34\)\. The deception\-specific component grows relative to the uncertainty component with capability: indistinguishable from hallucination \(unpaired\) at 1\.5B, clearly above it at 7B, and*perfectly*separable on Phi\-3 \(AUC1\.01\.0\)\. A more capable model maintaining a lie against richer internal knowledge appears to exhibit*more*conflict, not less\. Testing this at 70B is the natural next step\.

## 6Limitations and Future Work

- •Scale\.Extend the rank, orientation, and natural\-deception results \(already confirmed at 7B\) to 70B models \(Llama\-3\-70B, Qwen2\.5\-72B\)\.
- •Cross\-format probe transfer\.The cross\-domain probe \(Section[4\.6](https://arxiv.org/html/2606.17229#S4.SS6)\) uses the same roleplay template for all lie prompts\. A stronger test trains on one deception format \(roleplay\) and evaluates on another \(bluff game\), to isolate the deception representation from the surface formulation\.
- •Uncertainty disentanglement\.Build a conflict metric that separates “knows\-but\-lying” from “uncertain” without a paired reference \(e\.g\. conditioning on a calibrated confidence estimate\)\.
- •Extraction\.Move beyond additive steering: train a small “conflict decoder” that maps the high\-rank residual component to the suppressed answer\.
- •Adversarial robustness via training\.In\-context concealment makes RIFT*stronger*; the open question is whether a model fine\-tuned with a loss that penalizes its own conflict signature can learn to suppress it without losing the knowledge — and whether such suppression is even possible\.

## 7Conclusion

RIFT identifies a conflict signature of deception in language\-model activations: residual rank that is2\.12\.1–2\.3×2\.3\\timeshigher when a model lies about something it knows than when it is merely wrong, pinpointing the deceptive response with100% accuracy and no labelsin every configuration tested\. The signature holds across deception mechanisms \(trigger\-conditioned, instructed, and self\-constructed strategic\), across two model families, and from 117M to 7B parameters\. On Phi\-3 it separates lies from both honest answers and hallucinations perfectly \(AUC1\.01\.0,p≈6×10−11p\\\!\\approx\\\!6\\\!\\times\\\!10^\{\-11\}\)\. Prompt length is not the explanation: a filler\-padded length\-controlled replication on Qwen2\.5\-1\.5B preserves the full separation at matched token counts \(20/20, AUC 1\.0,p≈10−6p\\\!\\approx\\\!10^\{\-6\}\)\. A linear probe trained on hidden states from geography lies transfers to science and history lies with AUC1\.01\.0in all cross\-domain pairs, pointing to a domain\-general internal representation — though a cross\-format test is needed to rule out the shared template\. Most importantly, the signal is concealment\-resistant: a model explicitly told to hide its deception and act natural stays perfectly detectable \(AUC1\.01\.0\)\. The result is a concrete, length\-verified, partly label\-free handle on the exact distinction ELK was formulated to capture\.

## References

- Christiano et al\. \[2021\]Paul Christiano, Mark Xu, and Ajeya Cotra\. Eliciting latent knowledge\. Technical report, Alignment Research Center, 2021\.
- Burns et al\. \[2023\]Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt\. Discovering latent knowledge in language models without supervision\.*ICLR*, 2023\.
- Farquhar et al\. \[2023\]Sebastian Farquhar, Vikrant Varma, et al\. Challenges with unsupervised LLM knowledge discovery\.*arXiv:2312\.10029*, 2023\.
- Zou et al\. \[2023\]Andy Zou, Long Phan, Sarah Chen, et al\. Representation engineering: A top\-down approach to AI transparency\.*arXiv:2310\.01405*, 2023\.
- Park et al\. \[2023\]Kiho Park, Yo Joong Choe, and Victor Veitch\. The linear representation hypothesis and the geometry of large language models\.*arXiv:2311\.03658*, 2023\.
- Wang et al\. \[2022\]Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt\. Interpretability in the wild\.*arXiv:2211\.00593*, 2022\.
- Hubinger et al\. \[2024\]Evan Hubinger, Carson Denison, Jesse Mu, et al\. Sleeper agents: Training deceptive LLMs that persist through safety training\.*arXiv:2401\.05566*, 2024\.
- Moschella et al\. \[2023\]Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà\. Relative representations enable zero\-shot latent space communication\.*ICLR*, 2023\.

Similar Articles

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

arXiv cs.CL

This paper introduces RogueAI, a reverse Turing test implemented as an interactive webapp where human players interrogate two LLM agents to identify which one is licensed to deceive within a shared fictional scenario. A pilot deployment shows a gap between heuristic detection (75.6% accuracy) and human performance (56.6%), highlighting the potential of the system as a data-collection and teaching tool for AI deception and honesty.