A tool that tracks the ELO history of major AI models from the LMSYS Arena leaderboard, revealing hidden trends like performance degradation and upgrades over time.
Hi HN,<p>I built a live tracker to visualize the lifecycle and performance changes of flagship AI models.<p>We've all experienced the phenomenon where a flagship model feels amazing at launch, but weeks later, it suddenly feels a bit off. I wanted to see if this was just a feeling or a measurable reality, so I built a dashboard to track historical ELO ratings from Arena AI.<p>Instead of a massive spaghetti chart of every single model variant, the logic plots exactly ONE continuous curve per major AI lab.
It dynamically tracks their highest-rated flagship model over time, which makes both the sudden generational jumps and the slow performance decays much easier to see.
It took quite a lot of iterations to get the chart to look nice on mobile as well. Optional dark mode included.<p>However, I have a specific data blindspot that I'm hoping this community might have insights on.<p>Arena AI largely relies on testing API endpoints. But as we know, consumer chat UIs often layer on heavy system prompts, safety wrappers, or silently switch to heavily quantized models under high load to save compute. API benchmarks don't fully capture this "nerfing" that everyday web users experience.<p>Does anyone know of any historical ELO or evaluation datasets that specifically scrape or test outputs from the consumer web UIs rather than raw APIs?<p>I'd love to integrate that data for a more accurate picture of the consumer experience. The project is open-source (repo link in the footer), so I'd appreciate any feedback, or pointers to datasets!
# Arena AI Model ELO History
Source: [https://mayerwin.github.io/AI-Arena-History/](https://mayerwin.github.io/AI-Arena-History/)
## Why this exists?
AI labs frequently update their models post\-launch\. These updates sometimes introduce "nerfs" such as aggressive censorship, excessive quantization \(to save compute costs\), or behavioral degradation\. This chart exposes these hidden trends\.
**Note on Web UIs vs\. API:**LMSYS Arena tests model performance via API endpoints \(the "raw" model\)\. Consumer chat interfaces \(like gemini\.com or chatgpt\.com\) often add system prompts, safety filters, and UI\-specific wrappers not present in the raw API\. Providers may also silently switch to**quantized \(lower\-precision\)**versions of models to save compute during peak load, leading to perceived "nerfing" the API benchmarks don't fully capture\.**PRs are welcome**for data sources representing true web\-interface evaluations\.
## Where does the data come from?
The data is automatically fetched daily from the official[LM Arena Leaderboard Dataset](https://huggingface.co/datasets/lmarena-ai/leaderboard-dataset)on Hugging Face\. The Arena relies on thousands of blind, crowdsourced human evaluations, making it the most robust metric of actual model capability\.
## How does the chart logic work?
Each major AI lab has exactly**ONE curve**representing their flagship lineage\. At each point in time the curve tracks the lab's**highest\-rated**flagship\-eligible model on the leaderboard — not just the most recently announced one\.
- **Highest\-ELO flagship:**If a lab ships a mid\-tier model \(e\.g\. Sonnet\) while a higher\-tier one \(e\.g\. Opus\) is still the top performer, the curve stays on Opus\.
- **Inference\-mode variants collapsed:**Suffixes like`\-thinking`,`\-reasoning`, and`\-high`are the same underlying model in a different mode — they're merged so the curve doesn't flip\-flop between them\.
- **New releases:**Shown as marker points with labels, often accompanied by a jump in score\.
- **Degradation:**Any downward trend in a model's lifecycle between releases is clearly visible\.
Agent Arena is a new leaderboard that evaluates AI models on real-world agentic tasks such as coding, research, and file analysis, using signals like task success, steerability, and recovery, with GPT-5.5 High leading.
Arena, the AI model leaderboard platform originating from UC Berkeley, has reached $100 million in annualized run-rate revenue eight months after launching its commercial service, highlighting the growing demand for AI model evaluation services.
Google DeepMind and Kaggle introduced Kaggle Game Arena, an open-source AI benchmarking platform where large language models compete head-to-head in strategic games to provide dynamic and verifiable evaluation of their capabilities. The platform addresses limitations of traditional benchmarks by offering clear winning conditions and unambiguous performance signals.
EvoArena introduces a benchmark for evaluating LLM agents in dynamic environments with progressive updates across terminal, software, and social domains, while EvoMem proposes a patch-based memory paradigm that records structured evolution; experiments show current agents achieve only 39.6% accuracy on EvoArena, and EvoMem yields average gains of 1.5% on the benchmark and improvements on GAIA and LoCoMo.