Tag
A critique arguing that training LLMs on human-generated data limits their ability to discover novel solutions via test-time compute, and that true AGI requires models that can explore hypothesis spaces more broadly, similar to AlphaZero.
This paper introduces MAPLE, a tree search method that aggregates policy and value evaluations from multiple sampled world states, extending AlphaZero to imperfect-information games. Experiments on Phantom Go and Dark Hex show Elo improvements of 291 and 136 over the PIMC-based AlphaZero baseline.