causal-influence-diagrams

Tag

Cards List
#causal-influence-diagrams

The Impossibility of Eliciting Latent Knowledge

arXiv cs.AI · 2026-06-11 Cached

This paper formally defines the problem of eliciting latent knowledge (ELK) from AI systems using Causal Influence Diagrams, and proves an impossibility theorem: no feedback-based training strategy that depends only on agent behavior can guarantee an honest agent, even with perfect training feedback.

0 favorites 0 likes
← Back to home

Submit Feedback