predictive-validity

Tag

Cards List
#predictive-validity

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Hugging Face Daily Papers · 6d ago Cached

This paper argues that aggregate-score leaderboards for LLM agent benchmarks fail to capture deployment-relevant dimensions and show rank instability. It proposes ranking configurations by predictive validity—the correlation between in-sample and out-of-sample rank—and introduces a twelve-tier measurement apparatus along with falsifiable out-of-distribution criteria.

0 favorites 0 likes
← Back to home

Submit Feedback