Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

arXiv cs.CL Papers

Summary

Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.

arXiv:2606.17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:40 AM

# Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
Source: [https://arxiv.org/abs/2606.17478](https://arxiv.org/abs/2606.17478)
[View PDF](https://arxiv.org/pdf/2606.17478)

> Abstract:As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern\. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious\. We introduce STATEWITNESS, an activation explainer for deception auditing\. A separate decoder reads a target model's hidden states, then answers natural\-language queries or emits structured reports about them\. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets\. STATEWITNESS reaches 0\.916 mean AUROC, a relative gain of 11\.6% over the best black\-box text monitor and 25\.0% over the best activation\-probe baseline under the same evaluation protocol\. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles\. Beyond scalar detection, the decoder returns query\-level answers, schema reports, and token\- or sentence\-level evidence traces for human inspection\. We view this interface as a potential building block for broader interpretability and alignment tools\.

## Submission history

From: Kexin Chen \[[view email](https://arxiv.org/show-email/35dc1541/2606.17478)\] **\[v1\]**Tue, 16 Jun 2026 03:41:29 UTC \(5,902 KB\)

Similar Articles

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

arXiv cs.AI

This paper introduces Reasoning Exposure Prompting (REP), a method that uses shadow-model demonstrations in code-like formats to elicit hidden reasoning traces from LLMs, showing that interface-level trace hiding is insufficient to prevent extraction of useful reasoning signals.

DECOR: Auditing LLM Deception via Information Manipulation Theory

arXiv cs.CL

Introduces DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses, achieving state-of-the-art performance on deception detection benchmarks across 15 frontier models.

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv cs.LG

Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.

Learning to reason with LLMs

OpenAI Blog

OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.