latent-failures

#latent-failures

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

arXiv cs.CL ↗ · 5d ago Cached

Introduces Simmer, a benchmark for evaluating latent failures in LLM-generated executable plans using a human-curated symbolic world model in the kitchen domain. Experiments show frontier LLMs achieve at most 17% error-free plans, with up to 56% containing latent failures, and counterfactual foresight simulation reduces failures significantly.

0 favorites 0 likes

latent-failures

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

Submit Feedback