When several AI models were priced equally for a week, actual token usage revealed preference differences from leaderboard rankings, showing that coding and general chat have different top models and long context usage concentrated on two trusted models.
The thing I trust more than benchmarks is boring: what people actually use when price is not pushing them one way. A bunch of models got put at roughly the same per million token price for a stretch this month, which removes the variable that usually dominates these conversations. When one model is far cheaper than everything else, you cannot tell whether people use it because it is good or because it is cheap. Remove that variable and the usage graph becomes a much cleaner signal of preference. What I have been watching is the live usage share, not the arena vote and not the benchmark. A few things stood out in the first week. The model sitting at the top of the coding arena was not the most used model in actual traffic. It was not even second. The most used model by token volume was one that ranks somewhere in the middle on most public leaderboards. People reached for it more when price was equalized, which is the opposite of what a leaderboard first mental model would predict. Long context usage concentrated on two models almost completely. Once price was flat, the long context calls collapsed onto a small number of models rather than spreading out. That suggests people had been using whichever model was cheapest per token for long context regardless of quality, and when that incentive disappeared they went back to the two they actually trusted. The spread between coding and general chat usage was wider than I expected. The top model for coding traffic and the top model for general traffic were different, by a meaningful margin. The "one model to rule them all" framing does not hold up when you look at usage by task type instead of usage in aggregate. The thing I keep coming back to is that a leaderboard is a snapshot of opinions under a scoring rule someone else wrote. Live usage is just people spending real tokens, which is closer to what they actually pick than a vote. They are measuring different things and they should be expected to disagree. The disagreement is the interesting part. I have been pulling the numbers from the public consumption page one of the aggregators put up, the kind that publishes live per model token share instead of just a vote count. I do not care who wins the promo. I am interested in the gap between the two rankings, because when they agree the model is probably genuinely strong, and when they disagree you have found a model that is either overrated or underrated by the benchmark crowd. That is where the interesting picks live. One nuance I want to flag before someone else does. Usage share is not quality. A model can be heavily used because it is the default in a popular tool, not because anyone chose it. The signal gets cleaner when price is equalized, because the default incentive is weaker, but it is still not a pure quality measure. What it is, is just what people chose when it cost them something. I think that matters more than the debates give it credit for. The broader pattern I am watching for over the next two weeks is whether the usage ranking stabilizes or keeps drifting. If it stabilizes, the equal price condition found a real preference order. If it keeps drifting, people are still exploring and the early usage numbers are noise. Either way it is more useful than refreshing an arena that barely moves. On my own side, the reason I even have per task usage to compare is that I push everything through one routing layer instead of six direct api keys, zenmux in my case, and it logs per model token spend without me adding it. The tool is not the point. Having your own usage log is what lets you notice the gap between what leaderboards say and what your traffic actually does.
A ranking of AI models by real usage, cost, and speed reveals that benchmark champions often trail in actual adoption, with cheaper/faster models like Flash Lite and GPT-5 leading over premium counterparts like Gemini 3.1 Pro.
An analysis of declining token prices for AI models despite new releases like GLM 5.2 and Kimi 2.7, suggesting possible diminishing returns from expensive models.
A discussion of how AI assistant usage is shifting from single-model loyalty to multi-model switching, as reflected in market share data showing ChatGPT below 50% for the first time, with users increasingly bouncing between models based on task.
Analysis of OpenRouter data shows that Chinese AI models have become the most used in Kilo Code's coding agent, accounting for 58% of token usage, challenging the dominance of Claude and GPT due to lower cost and longer context windows.
A developer tested five AI models on tool calling tasks and found that cheaper models perform within 2% of expensive models like Opus, with Tencent's Hunyuan under $1.50 vs Opus's $15, leading to a daily cost reduction from $40 to $9 by routing simpler tasks to cheaper models.