Tag
This paper formally defines the problem of eliciting latent knowledge (ELK) from AI systems using Causal Influence Diagrams, and proves an impossibility theorem: no feedback-based training strategy that depends only on agent behavior can guarantee an honest agent, even with perfect training feedback.