Arena AI Model ELO History

Hacker News Top 05/14/26, 03:19 AM Tools

ai-models elo-history model-evaluation performance-tracking lmsys-arena model-degradation

Summary

A tool that tracks the ELO history of major AI models from the LMSYS Arena leaderboard, revealing hidden trends like performance degradation and upgrades over time.

Hi HN,I built a live tracker to visualize the lifecycle and performance changes of flagship AI models.We've all experienced the phenomenon where a flagship model feels amazing at launch, but weeks later, it suddenly feels a bit off. I wanted to see if this was just a feeling or a measurable reality, so I built a dashboard to track historical ELO ratings from Arena AI.Instead of a massive spaghetti chart of every single model variant, the logic plots exactly ONE continuous curve per major AI lab. It dynamically tracks their highest-rated flagship model over time, which makes both the sudden generational jumps and the slow performance decays much easier to see. It took quite a lot of iterations to get the chart to look nice on mobile as well. Optional dark mode included.However, I have a specific data blindspot that I'm hoping this community might have insights on.Arena AI largely relies on testing API endpoints. But as we know, consumer chat UIs often layer on heavy system prompts, safety wrappers, or silently switch to heavily quantized models under high load to save compute. API benchmarks don't fully capture this "nerfing" that everyday web users experience.Does anyone know of any historical ELO or evaluation datasets that specifically scrape or test outputs from the consumer web UIs rather than raw APIs?I'd love to integrate that data for a more accurate picture of the consumer experience. The project is open-source (repo link in the footer), so I'd appreciate any feedback, or pointers to datasets!

Original Article

View Cached Full Text

Cached at: 05/14/26, 06:21 AM

# Arena AI Model ELO History Source: [https://mayerwin.github.io/AI-Arena-History/](https://mayerwin.github.io/AI-Arena-History/) ## Why this exists? AI labs frequently update their models post\-launch\. These updates sometimes introduce "nerfs" such as aggressive censorship, excessive quantization \(to save compute costs\), or behavioral degradation\. This chart exposes these hidden trends\. **Note on Web UIs vs\. API:**LMSYS Arena tests model performance via API endpoints \(the "raw" model\)\. Consumer chat interfaces \(like gemini\.com or chatgpt\.com\) often add system prompts, safety filters, and UI\-specific wrappers not present in the raw API\. Providers may also silently switch to**quantized \(lower\-precision\)**versions of models to save compute during peak load, leading to perceived "nerfing" the API benchmarks don't fully capture\.**PRs are welcome**for data sources representing true web\-interface evaluations\. ## Where does the data come from? The data is automatically fetched daily from the official[LM Arena Leaderboard Dataset](https://huggingface.co/datasets/lmarena-ai/leaderboard-dataset)on Hugging Face\. The Arena relies on thousands of blind, crowdsourced human evaluations, making it the most robust metric of actual model capability\. ## How does the chart logic work? Each major AI lab has exactly**ONE curve**representing their flagship lineage\. At each point in time the curve tracks the lab's**highest\-rated**flagship\-eligible model on the leaderboard — not just the most recently announced one\. - **Highest\-ELO flagship:**If a lab ships a mid\-tier model \(e\.g\. Sonnet\) while a higher\-tier one \(e\.g\. Opus\) is still the top performer, the curve stays on Opus\. - **Inference\-mode variants collapsed:**Suffixes like`\-thinking`,`\-reasoning`, and`\-high`are the same underlying model in a different mode — they're merged so the curve doesn't flip\-flop between them\. - **New releases:**Shown as marker points with labels, often accompanied by a jump in score\. - **Degradation:**Any downward trend in a model's lifecycle between releases is clearly visible\.

Arena AI Model ELO History

Similar Articles

@rohanpaul_ai: Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not…

Arena, the AI leaderboard everyone uses, is now a $100M business

Rethinking how we measure AI intelligence

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Agent Arena

Submit Feedback

Similar Articles

@rohanpaul_ai: Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not…

Arena, the AI leaderboard everyone uses, is now a $100M business

Rethinking how we measure AI intelligence

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments