Tag
Introduces Simmer, a benchmark for evaluating latent failures in LLM-generated executable plans using a human-curated symbolic world model in the kitchen domain. Experiments show frontier LLMs achieve at most 17% error-free plans, with up to 56% containing latent failures, and counterfactual foresight simulation reduces failures significantly.