Tag
This paper formalizes the impossibility of perfect prompt-injection prevention in shared-embedding sequence models, proving that no in-pipeline mechanism can guarantee Semantic-Faithful Control due to inseparable representations of instructions and data, analogous to code-data confusion in Von Neumann architectures.
This paper formally defines the problem of eliciting latent knowledge (ELK) from AI systems using Causal Influence Diagrams, and proves an impossibility theorem: no feedback-based training strategy that depends only on agent behavior can guarantee an honest agent, even with perfect training feedback.