Open-source LLM benchmark runs 147 coding tasks every 4 hours, 5-trial median with 95% CI, and uses CUSUM for change-point detection. Curious what people think of the methodology

Reddit r/AI_Agents Tools

Summary

An open-source LLM benchmark with 147 coding tasks runs every 4 hours, using 5-trial median with 95% confidence intervals and CUSUM for change-point detection, sparking discussion on its methodology.

No content available
Original Article

Similar Articles

Created an LLM quiz program to check if AIs' performance varies over time

Reddit r/AI_Agents

A developer created LLM Canary, an open-source quiz program that sends randomized tasks to multiple LLMs to track performance over time. After a week of hourly testing across seven models, the results show all models fluctuate throughout the day with no consistent pattern, and no clear evidence of degradation was found.