Tag
This paper investigates the production-evaluation gap in large reasoning models (LRMs), finding that they fail to robustly evaluate reasoning despite near-perfect solution production, due to an answer confirmation bias.