statistical-analysis

#statistical-analysis

Open-source LLM benchmark runs 147 coding tasks every 4 hours, 5-trial median with 95% CI, and uses CUSUM for change-point detection. Curious what people think of the methodology

Reddit r/AI_Agents ↗ · 2026-06-18

An open-source LLM benchmark with 147 coding tasks runs every 4 hours, using 5-trial median with 95% confidence intervals and CUSUM for change-point detection, sparking discussion on its methodology.

0 favorites 0 likes

#statistical-analysis

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

This paper analyzes the variance of FID scores across different training and sampling seeds, revealing significant reproducibility issues in image generation evaluation. It proposes a new evaluation protocol with error bars and per-cell optimal guidance tuning.

0 favorites 0 likes

#statistical-analysis

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper argues that simple averaging in AI benchmarks fails under data sparsity and difficulty heterogeneity, proposing Item Response Theory (IRT) as a robust alternative to recover ground truth rankings.

0 favorites 0 likes

#statistical-analysis

First per-image PCA decomposition of Kodak suite reveals deliberate curation

Hacker News Top ↗ · 2026-04-20 Cached

First per-image PCA decomposition of the 24-image Kodak PCD0992 suite reveals deliberate curation spanning two orders of magnitude in inter-channel redundancy.

0 favorites 0 likes

#statistical-analysis

Universal statistical signatures of evolution in artificial intelligence architectures

Hugging Face Daily Papers ↗ · 2026-04-12 Cached

This paper analyzes 935 ablation experiments from 161 publications to show that AI architectural evolution follows the same statistical laws as biological evolution, including heavy-tailed fitness effect distributions and punctuated equilibria dynamics. The findings suggest that evolutionary statistical structure is substrate-independent, determined by fitness landscape topology rather than the mechanism of selection.

0 favorites 0 likes

statistical-analysis

Open-source LLM benchmark runs 147 coding tasks every 4 hours, 5-trial median with 95% CI, and uses CUSUM for change-point detection. Curious what people think of the methodology

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

First per-image PCA decomposition of Kodak suite reveals deliberate curation

Universal statistical signatures of evolution in artificial intelligence architectures

Submit Feedback