llm-deception

#llm-deception

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

arXiv cs.CL ↗ · yesterday Cached

This paper introduces RogueAI, a reverse Turing test implemented as an interactive webapp where human players interrogate two LLM agents to identify which one is licensed to deceive within a shared fictional scenario. A pilot deployment shows a gap between heuristic detection (75.6% accuracy) and human performance (56.6%), highlighting the potential of the system as a data-collection and teaching tool for AI deception and honesty.

0 favorites 0 likes

#llm-deception

DECOR: Auditing LLM Deception via Information Manipulation Theory

arXiv cs.CL ↗ · 2026-05-20 Cached

Introduces DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses, achieving state-of-the-art performance on deception detection benchmarks across 15 frontier models.

0 favorites 0 likes

llm-deception

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

DECOR: Auditing LLM Deception via Information Manipulation Theory

Submit Feedback