Tag
Discusses the need for evolving AI evaluation benchmarks through difficulty, quality, and diversity refinement, citing examples like MMLU-Pro, MMLU-Redux, BIG-Bench Extra Hard, RealMath, MathArena, and DatBench.