Tag
BayesBench evaluates how closely large language models' belief updates match Bayesian reasoning in multi-turn evidence accumulation tasks, finding that while scaling improves latent inference, models struggle to use that understanding for downstream predictions.
This paper introduces BALAR, a training-free Bayesian agentic loop algorithm that enables large language models to actively reason and ask clarifying questions in multi-turn interactions. It demonstrates significant performance improvements over baselines on detective, puzzle, and clinical diagnosis benchmarks.