Tag
This paper introduces the Valid-Answer-Invalid-Reasoning (VAIR) benchmark to expose the production-evaluation gap in AI reasoning models, where models can generate correct answers but fail to detect flawed reasoning, revealing answer confirmation bias.