Tag
This paper introduces GSM-SEM, a framework for generating semantically diverse benchmark variants to mitigate memorization in mathematical reasoning evaluations. The authors demonstrate that this approach reveals significant performance drops in current SOTA LLMs compared to static benchmarks.