Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
Summary
Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.
View Cached Full Text
Cached at: 06/17/26, 05:40 AM
# Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing Source: [https://arxiv.org/abs/2606.17478](https://arxiv.org/abs/2606.17478) [View PDF](https://arxiv.org/pdf/2606.17478) > Abstract:As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern\. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious\. We introduce STATEWITNESS, an activation explainer for deception auditing\. A separate decoder reads a target model's hidden states, then answers natural\-language queries or emits structured reports about them\. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets\. STATEWITNESS reaches 0\.916 mean AUROC, a relative gain of 11\.6% over the best black\-box text monitor and 25\.0% over the best activation\-probe baseline under the same evaluation protocol\. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles\. Beyond scalar detection, the decoder returns query\-level answers, schema reports, and token\- or sentence\-level evidence traces for human inspection\. We view this interface as a potential building block for broader interpretability and alignment tools\. ## Submission history From: Kexin Chen \[[view email](https://arxiv.org/show-email/35dc1541/2606.17478)\] **\[v1\]**Tue, 16 Jun 2026 03:41:29 UTC \(5,902 KB\)
Similar Articles
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
This paper introduces Reasoning Exposure Prompting (REP), a method that uses shadow-model demonstrations in code-like formats to elicit hidden reasoning traces from LLMs, showing that interface-level trace hiding is insufficient to prevent extraction of useful reasoning signals.
DECOR: Auditing LLM Deception via Information Manipulation Theory
Introduces DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses, achieving state-of-the-art performance on deception detection benchmarks across 15 frontier models.
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
This paper studies synthetic dishonesty in LLMs by fine-tuning honest and deceptive variants of five transformer models and finding that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.
Learning to Refine Hidden States for Reliable LLM Reasoning
Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.
Learning to reason with LLMs
OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.