Open-source LLM benchmark runs 147 coding tasks every 4 hours, 5-trial median with 95% CI, and uses CUSUM for change-point detection. Curious what people think of the methodology
Summary
An open-source LLM benchmark with 147 coding tasks runs every 4 hours, using 5-trial median with 95% confidence intervals and CUSUM for change-point detection, sparking discussion on its methodology.
Similar Articles
We use LLMs to analyze every file in your codebase. Everyone told us this was a stupid idea because of cost but it wasnt.
A benchmark study demonstrates that using LLMs to analyze entire codebases is cost-effective, identifying DeepSeek V4 Flash as the optimal default model due to its low cost and comparable accuracy to premium options like Claude Opus.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.
here it is: Benchmark-Yourself app - compete against open source LLMs and get your score - 5 benchmarks available - Add your results to your CV or linkedIn (if you dare)... or just paste them below for community shaming.
A web app that allows users to benchmark their own performance against open source LLMs on five benchmarks, with the option to add results to a CV or LinkedIn.
We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]
A comprehensive benchmark of 18 LLMs on OCR tasks (7k+ calls) reveals that cheaper and older models often match premium accuracy at a fraction of the cost, with full dataset and framework open-sourced.
Created an LLM quiz program to check if AIs' performance varies over time
A developer created LLM Canary, an open-source quiz program that sends randomized tasks to multiple LLMs to track performance over time. After a week of hourly testing across seven models, the results show all models fluctuate throughout the day with no consistent pattern, and no clear evidence of degradation was found.