Tag
An open-source LLM benchmark with 147 coding tasks runs every 4 hours, using 5-trial median with 95% confidence intervals and CUSUM for change-point detection, sparking discussion on its methodology.
This paper analyzes the variance of FID scores across different training and sampling seeds, revealing significant reproducibility issues in image generation evaluation. It proposes a new evaluation protocol with error bars and per-cell optimal guidance tuning.
This paper argues that simple averaging in AI benchmarks fails under data sparsity and difficulty heterogeneity, proposing Item Response Theory (IRT) as a robust alternative to recover ground truth rankings.
First per-image PCA decomposition of the 24-image Kodak PCD0992 suite reveals deliberate curation spanning two orders of magnitude in inter-channel redundancy.
This paper analyzes 935 ablation experiments from 161 publications to show that AI architectural evolution follows the same statistical laws as biological evolution, including heavy-tailed fitness effect distributions and punctuated equilibria dynamics. The findings suggest that evolutionary statistical structure is substrate-independent, determined by fitness landscape topology rather than the mechanism of selection.