Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

arXiv cs.CL 06/17/26, 04:00 AM Papers

deception-detection llm-safety activation-explainer reasoning-llms interpretability alignment

Summary

Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.

arXiv:2606.17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

Original Article

View Cached Full Text

Cached at: 06/17/26, 05:40 AM

# Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
Source: [https://arxiv.org/abs/2606.17478](https://arxiv.org/abs/2606.17478)
[View PDF](https://arxiv.org/pdf/2606.17478)

> Abstract:As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern\. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious\. We introduce STATEWITNESS, an activation explainer for deception auditing\. A separate decoder reads a target model's hidden states, then answers natural\-language queries or emits structured reports about them\. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets\. STATEWITNESS reaches 0\.916 mean AUROC, a relative gain of 11\.6% over the best black\-box text monitor and 25\.0% over the best activation\-probe baseline under the same evaluation protocol\. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles\. Beyond scalar detection, the decoder returns query\-level answers, schema reports, and token\- or sentence\-level evidence traces for human inspection\. We view this interface as a potential building block for broader interpretability and alignment tools\.

## Submission history

From: Kexin Chen \[[view email](https://arxiv.org/show-email/35dc1541/2606.17478)\] **\[v1\]**Tue, 16 Jun 2026 03:41:29 UTC \(5,902 KB\)

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

Similar Articles

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

DECOR: Auditing LLM Deception via Information Manipulation Theory

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Learning to Refine Hidden States for Reliable LLM Reasoning

Learning to reason with LLMs

Submit Feedback

Similar Articles

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

DECOR: Auditing LLM Deception via Information Manipulation Theory

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Learning to Refine Hidden States for Reliable LLM Reasoning