Tag
This paper presents OSCToM, an RL-guided method for generating adversarial data to test nested belief conflicts in LLMs, improving Theory of Mind reasoning on benchmarks like FANToM.
SafeHarbor is a novel framework for LLM agent safety that uses hierarchical memory and self-evolution to balance safety and utility, achieving state-of-the-art performance on benign and malicious tasks.