Tag
This paper develops a framework for interpreting AI systems as agents, drawing on radical interpretation philosophy and mechanistic interpretability tools, addressing how to trust AI systems by understanding their beliefs, desires, and meanings.