Tag
This paper audits the 'Translation Tax' in Chinese multilingual benchmarks, arguing it is not a scalar but a set of estimator- and item-dependent validity risks. It introduces a naturalization stress test to quantify how English-source cues inflate model scores.