Analyzes the gap between open weights and closed source LLMs using the Artificial Analysis Intelligence Index and other benchmarks, finding that the gap is shrinking on some metrics but stable on others.
# Prediction: A Frontier Open Source LLM Will Be Released On 3rd December 2026 | Doubleword
Source: [https://blog.doubleword.ai/frontier-os-llm](https://blog.doubleword.ai/frontier-os-llm)
Interactive plot of the Artificial Analysis Intelligence Index for open and closed frontier models\.I have seen a version of the above plot going around Twitter and wanted to dig a bit deeper into it\. What the plot above is showing is the gap between open weights LLMs and closed source LLMs\. We measure this gap by looking at the frontier of performance of open weights LLMs on a benchmark and then looking back into the past how long ago was the closed source frontier at that level\. It is a measure of how long it took for open source models to catch up to the new capabilities reached by the closed source model frontier\. This benchmark is the Artificial Analysis Intelligence Index \- their headline index that tries to assess the overall capabilities of models\. In general it correlates quite well with the ‘vibe’ people seem to get from models\.
You can see that around summer 2024 the gap on this benchmark starts to shrink, and has been reliably shrinking since then\. If you plot a line of best fit and extend it into the future you find that the gap shrinks to 0 months around**December 3rd 2026**\- 6 months or so from the time of writing\.
Now is probably a good time to liquidate your pension, fly to a remote island somewhere, and live out the remaining 6 months or so of civilization in peace\.
…
Except\.
This might not be the whole picture\. This is only a single benchmark, and doesn’t give a complete picture of the capabilities of LLMs\. Kindly, Artificial Analysis gives us access to 18 different benchmarks that they have measured for these models\. I have repeated the analysis for all the 18 different benchmarks and I have summarized them in the plot below:
Interactive boxplot of monthly open frontier lag across Artificial Analysis metrics\.For each of the 18 datasets we have created a similar chart\. You can see all 18 at the bottom of the page\. At each month we have created a box plot of the gap for each dataset\. We have then plotted all the box plots over time\. We have also calculated the average of the gaps across datasets, and calcuated a line of best fit for that\. That line is almost completely flat, at just under 5 months for the entire period\.
What is notable is that a large amount of the total improvement of models has been in the coding benchmark\. The coding index has gone from 15 months behind to only a month or two behind\. Most other datasets have a moderate increase over time in their gaps\.
So maybe the open source apocalypse won’t happen yet\.
What this exercise does suggest is the difficulty of measuring LLM quality\. Depending on how you measure it you would predict the open source singularity by Christmas, or you would say that open source LLMs are consistently 5 months behind close source, and that the gap might be growing\.
Benchmark plot
Interactive frontier plot for artificial analysis intelligence index\.
OpenAI researchers study worst-case frontier risks of releasing open-weight LLMs through malicious fine-tuning (MFT) in biology and cybersecurity domains, finding that open-weight models underperform frontier closed-weight models and don't substantially advance harmful capabilities.
Z ai's GLM-5.2 has become the new leading open weights model on the Artificial Analysis Intelligence Index, scoring 51 and outperforming competitors like MiniMax-M3 and DeepSeek V4 Pro. The model features 744B total parameters, 40B active, MIT license, and 1M context window.
OpenBMB releases MiniCPM5-1B, a leading 1B open weights LLM that achieves the highest Artificial Analysis Intelligence Index score (17.9) in its size class, surpassing larger models like Qwen3.5 2B while using fewer parameters.
This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.
Miles Brundage comments on the lack of quantitative analysis on how distillation affects the capability gap between open-weight and proprietary AI models, referencing a claim by Epoch AI that open-weight models lag by four months.