activation-explainer

Tag

Cards List
#activation-explainer

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

arXiv cs.CL · 2d ago Cached

Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.

0 favorites 0 likes
← Back to home

Submit Feedback