Tag
FormInv proposes a measurement protocol for evaluating semantic invariance in mathematical reasoning benchmarks, revealing that model rankings reverse across paraphrase families and that standard accuracy metrics conceal large gaps in semantic consistency.