Tag
This paper proposes UP-NRPA, an online framework that integrates user portraits with nested rollout policy adaptation using large language models to dynamically customize dialogue strategies without offline training, achieving 100% success on multiple dialogue tasks.
Introduces Simmer, a benchmark for evaluating latent failures in LLM-generated executable plans using a human-curated symbolic world model in the kitchen domain. Experiments show frontier LLMs achieve at most 17% error-free plans, with up to 56% containing latent failures, and counterfactual foresight simulation reduces failures significantly.