Tag
BayesBench evaluates how closely large language models' belief updates match Bayesian reasoning in multi-turn evidence accumulation tasks, finding that while scaling improves latent inference, models struggle to use that understanding for downstream predictions.
A study of 25,000 AI scientist trials finds the agents ignore evidence 68% of the time and rarely revise hypotheses, showing popular scaffolding fixes don’t instill true scientific reasoning.