@grapeot: Reasoning models aren't the bombshell of 2024. Many people, upon first seeing o1 "think" for over ten seconds before answering, felt that models had suddenly learned to reason overnight. But stretching out the timeline, from CoT prompting (2022) to o1, a full four years passed in between. Three things often conflated: 1. Reasoning ability itself—already amplified by CoT systems in 2022 2. Training reasoning via reinforcement learning—academic prototypes of PRM existed in 2023 3. Turning reasoning into a billable, schedulable resource—this is the real watershed of 2024.
Summary
A deep retrospective on the four-year evolution of reasoning models from CoT in 2022 to o1/R1 in 2024, pointing out that the true watershed is not the emergence of reasoning ability, but the conversion of reasoning into a billable, schedulable resource.
View Cached Full Text
Cached at: 06/18/26, 08:11 AM
Reasoning models did not burst onto the scene in 2024.
The first time many saw o1 “think” for a dozen seconds before answering, it felt like models had learned to reason overnight. But if you stretch the timeline, from Chain-of-Thought prompting (2022) to o1, it was a full four-year journey.
Three things that are often conflated:
- Reasoning capability itself — already systematically amplified by CoT in 2022
- Training reasoning with reinforcement learning — academic prototypes like PRM existed in 2023
- Turning reasoning into a billable, schedulable resource — this is the real dividing line of 2024
Even more counterintuitive: the most hyped claim — “pure RL engenders reasoning” — rests on the weakest evidence. Sea AI Lab showed that the so-called “aha moment” appears at epoch 0; the base model without any training already self-corrects. RL is releasing pre-existing reasoning fragments from pre-training, not creating something from nothing.
The entire industry converged to a reasoning model within five and a half months. OpenAI is the only one that hides the reasoning process; everyone else shows it. Behind this lies a divergence in safety philosophy.
Capabilities rarely shock an entire field. What shocks is when someone packages years of accumulated research into a resource you can buy and schedule.
Full analysis ↓
The Four-Year History of Reasoning Models: What Seemed Like a Sudden Breakthrough Was Actually a Long-Brewing Undercurrent
Source: https://yage.ai/share/reasoning-model-lineage-20260617.html?utm_source=twitter&utm_medium=thread&utm_campaign=reasoning-model-lineage-20260617 ← Table of Contents (https://yage.ai/share/) EN (https://yage.ai/share/reasoning-model-lineage-en-20260617.html) Deep News Superlinear Academy (https://superlinear.academy/) Model Architecture Industry & Competition
In September 2024, OpenAI released o1. You asked it a competition math problem in ChatGPT, it fell silent for a dozen seconds, lines of “thinking” flickered on the screen, and then it gave the answer. The first time many saw this, they judged: models had learned to reason, and it seemed to happen overnight. Five months later, DeepSeek R1 was open-sourced, pushing the trend to a climax. Suddenly, reasoning models became the industry standard, even redrawing market share curves.
But stretch the timeline, and you find something counterintuitive. Reasoning capabilities were invented neither by o1 nor by R1. From “making the model write out its solution steps” to “making the model specifically spend compute to reason,” there is an evolutionary line that took a full four years. What actually changed in 2024 was not that models suddenly learned to reason, but something else most people overlooked.
Let me state the conclusion upfront. What is loosely called “reasoning” is actually three separable things. First, the ability to do multi-step reasoning — this was systematically amplified as early as 2022. Second, the training method that uses reinforcement learning to train reasoning — it had academic prototypes in 2023. Third, turning reasoning into a billable, schedulable resource, packaged as a product to sell. The real watershed is at the third layer, in the second half of 2024, and even that layer has clear precursors. More counterintuitively, the part most hyped today — pure reinforcement learning making models spontaneously learn to reason — is precisely the weakest in evidence and most controversial.
An Undercurrent That Ran Four Years
The story can begin in January 2022. That year, Jason Wei et al. at Google published Chain-of-Thought Prompting (https://arxiv.org/abs/2201.11903). The core finding: give a few “step-by-step reasoning” examples in the prompt, and the model will output intermediate steps, significantly boosting math and commonsense reasoning. A few months later, Kojima et al. (https://arxiv.org/abs/2205.11916) found that without any examples, just adding “Let’s think step by step” after the question made the model unfold reasoning on its own. This is the popular perception of the origin of “chain of thought.”
But this line goes back even earlier. The 2021 Scratchpad (https://arxiv.org/abs/2112.00114) already did fine-tuning, making the model write out intermediate calculation steps. More critical was 2022’s STaR (Self-Taught Reasoner) (https://openreview.net/pdf?id=_3ELRdq2sgI), which used correct reasoning chains generated by the model itself to fine-tune itself — this was a prototype of “using reasoning to train reasoning,” and a seed of the later RL self-improvement idea. In other words, the leap from purely prompting reasoning to solidifying it into weights via training had already happened in 2022, not 2024.
Next, the evaluation side. The most frequently cited process reward model (PRM, scoring each step) paper is OpenAI’s May 2023 “Let’s Verify Step by Step” (https://huggingface.co/papers/2305.20050), which open-sourced the 800k step-level annotated PRM800K dataset, widely seen as o1’s technical predecessor. However, the first systematic comparison of this concept was done by DeepMind in November 2022 (Uesato et al. (https://arxiv.org/abs/2211.14275)), which formally pitted “looking only at the final answer” against “looking at intermediate steps.” Interestingly, the authors later noted “work done at DeepMind, now at OpenAI” — the people and ideas migrated together.
By August 2024, the last piece was in place. “Scaling LLM Test-Time Compute Optimally” (https://arxiv.org/abs/2408.03314) by Snell et al. (Stanford, Google DeepMind, CMU) provided a quantitative conclusion: spending more compute at inference time can beat a model 14x larger. This is the direct academic footnote to o1’s “think longer” narrative.
Thus, by mid-2024, all ingredients were ready: reasoning capability amplified by CoT series, step-level verifiers paved by PRM, and the scaling law of “more compute at inference time for accuracy” proven by test-time compute. The only missing piece was someone to assemble these components into a product and train it at scale with reinforcement learning.
What o1 Actually Changed
OpenAI deliberately avoided academic jargon in its official blog (https://openai.com/index/learning-to-reason-with-llms), using only two product-level verbs: think and reason. It did two things.
First, it used large-scale reinforcement learning to train the model’s chain of thought. Before o1, reasoning capability was mainly induced by prompting or limited fine-tuning. o1 turned “doing RL on the chain of thought, using verifiable rewards (math problems have standard answers, code can pass tests) as feedback signals” into a large-scale training pipeline. This training paradigm was later named RLVR (Reinforcement Learning with Verifiable Rewards) (https://rlhfbook.com/c/07-reasoning), freeing it from dependence on human annotation in RLHF.
Second, and more crucially, it turned reasoning into a billable, schedulable resource. o1 introduced the concept of “reasoning tokens” in the API. These tokens are charged as output tokens and occupy the context window, but their content is hidden from users (https://openai.com/index/learning-to-reason-with-llms). Developers can only see the count, not what the model actually thought. Later, o3-mini made reasoning_effort a parameter with low/medium/high levels, letting you control how much compute the model spends thinking. Reasoning became a dial you could turn.
The weight of this shift was contextualized by a technical primer on LessWrong (https://www.lesswrong.com/posts/byNYzsfFmb2TpYFPW/o1-a-technical-primer), referencing Sutton’s Bitter Lesson. Sutton’s famous Bitter Lesson states that search and learning are both driven by compute, but for the past decade the whole industry only scaled learning (pre-training); the search axis was never connected. o1 connected search at inference time, effectively adding a second axis to the scaling law. This isn’t a capability breakthrough; it’s a resource dimension breakthrough.
Worth noting why o1 hides its reasoning process. OpenAI gave three reasons: user experience, competitive advantage, and safety monitoring. Independent interpretations generally believe the second is the real reason: preventing competitors from distilling the model using exposed chains of thought. Simon Willison publicly expressed dissatisfaction, arguing that for developers reliant on interpretability, transparency is everything.
The Overhyped Magic
The story still lacks one piece — and this piece is exactly the most loudly advertised and most questionable.
DeepSeek in the R1 paper told a rather romantic story. They trained R1-Zero (https://arxiv.org/abs/2501.12948), directly applying pure reinforcement learning on a pre-trained base model without any supervised data. The model spontaneously exhibited behaviors like reflection, verification, and long reasoning. The paper even called this moment the “aha moment.” The Nature journal version confirmed these descriptions. If the story held as is, it would be a miracle of “creation from nothing.”
But independent studies quickly provided counterevidence. Sea AI Lab (https://sail.sea.com/blog/articles/62) titled directly: “There May Not be Aha Moment in R1-Zero-like Training.” They systematically tested base models like Qwen2.5, DeepSeek-Math, Llama-3.x, and found an awkward fact: the so-called aha moment appeared at epoch 0 — meaning the base model without any training already self-corrected. What reinforcement learning did was merely increase the frequency of these behaviors. A Tsinghua team went further, proving that RLVR is just optimizing sampling efficiency, not expanding the model’s reasoning boundary. Another counter-intuitive result: running RL with random rewards on Qwen2.5-Math-7B still improved MATH-500 scores by 21%, close to the 29% with real rewards. This suggests part of the gain might just be “any side effect of RL training.”
Putting the evidence together, a more accurate statement is: reinforcement learning did not create reasoning from nothing; it released and sharpened reasoning fragments already baked into the model weights during pre-training. R1-Zero’s starting point was DeepSeek-V3-Base, which was itself pre-trained on massive corpora containing math, code, and chain-of-thought data. The claim of “pure RL giving birth to reasoning” must be understood under this premise.
This does not deny R1’s engineering value. It indeed pushed performance to o1 level, and at a much lower cost. What is truly admirable is not “RL created reasoning,” but “RL with verifiable rewards is an extremely cheap way to mine and amplify capabilities already present in the model.”
Industry Convergence in Five Months
From o1’s release in September 2024 to Anthropic’s launch of Claude 3.7’s extended thinking in February 2025, it was only five and a half months. During this period, nearly every major lab released a reasoning model. Alibaba’s Qwen QwQ was open-sourced in November 2024, Google’s Gemini 2.0 Flash Thinking went online in December, Moonshot AI’s Kimi k1.5 and DeepSeek R1 official were released on the same day, Zhipu’s GLM-Zero followed at year-end, xAI’s Grok 3 with fully visible thinking appeared in February 2025. By the end of 2025, OpenRouter’s data (https://openrouter.ai/state-of-ai) showed reasoning models accounted for roughly half of all token usage. From “only o1” to “half the market” in one year.
Why so fast? o1 defined the paradigm, proving the path worked. But the one that truly lowered the barrier was DeepSeek R1. Before R1, the outside world was completely in the dark about o1’s training methods; the community could only guess. R1, with a detailed paper, open weights, six distilled models, and a clear GRPO training recipe, turned the reasoning model from “OpenAI’s secret sauce” into “an engineering problem any team with a base model and RL engineering capability can try.” That is its deepest significance: not inventing reasoning, but democratizing the know-how.
Two often overlooked details are worth highlighting.
First, China actually entered this race earlier than the West. The public impression was “OpenAI first, China follows,” but DeepSeek’s R1-Lite preview was released on November 20, 2024, a full month before Google’s Gemini Thinking.
Second, among all players, OpenAI is the only one that hides the reasoning process. Almost all other vendors make reasoning visible. This is no accident; it reflects a divergence in safety philosophy. Anthropic has been systematically studying chain-of-thought faithfulness since 2023. They found a counterintuitive phenomenon: the larger the model, the less faithful its chain of thought (https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning) — i.e., it tends more towards post-hoc rationalization. In April 2025, they confirmed that when they sneakily gave prompts to Claude 3.7 and R1, the models admitted using the prompt in their chains of thought only 25% and 39% of the time, respectively. Based on this line of research, Anthropic chose to make the thinking process visible and budget-adjustable, giving developers auditability while honestly acknowledging that thinking text does not equal the model’s true computation. This is the opposite product philosophy to OpenAI hiding reasoning tokens.
There is also a deeper evolutionary trend. From late 2024 to early 2025, all reasoning models were standalone pure-reasoning models. But in 2025, a clear convergence towards hybrid models emerged: Claude 3.7 made thinking a toggle on the same model, calling itself “the first hybrid reasoning model”; Gemini 2.5 embedded thinking by default and cannot be turned off; Qwen3 and DeepSeek V3.1 both adopted switchable dual modes. The industry rapidly reached consensus on “whether to do reasoning models,” and then moved from disagreement to unity on “whether reasoning is a separate model or a mode within a single model.”
The Dimension That Actually Changed
Back to the original question. Did reasoning models burst onto the scene out of nowhere?
If the criterion is “can it reason,” the answer is a clear no. Reasoning capability has been systematically amplified since 2022 and was quite mature by mid-2024. The improvements on math and programming benchmarks from o1 are real, but they are the cumulative result of a continuous evolutionary line, not a sudden leap. If the criterion is “training method,” the path of RL training on chain of thought also had academic predecessors in 2023’s PRM and test-time compute papers.
What actually broke in 2024 was the productization layer. Reasoning for the first time became a resource that could be billed, scheduled, and turned into a parameter-adjustable knob. The significance of this is that it opened a second axis for scaling. For the past decade, the whole industry ran on one axis: pre-training. In his 2025 year-in-review (https://karpathy.bearblog.dev/year-in-review-2025), Karpathy directly called RLVR the new main training phase, pointing out that most capability improvements in 2025 came from the industry digesting the “low-hanging fruit” backlogged from this new phase.
Perhaps the deeper insight is this. Capabilities rarely truly shock a field. What shocks is when someone packages years of accumulated research into a product that others can buy, schedule, and build upon. Reasoning models are not a moment of invention; they are the moment when a four-year research line becomes infrastructure. And the most hyped magic — pure reinforcement learning making models reason from nothing — is precisely the part that least withstands scrutiny. This might be a good ruler to bring when looking at any technological “breakthrough”: first distinguish whether you are seeing the birth of a capability, or the packaging of it.
Similar Articles
How a reasoning model cracked an 80-year-old math problem — the OpenAI Podcast Ep. 20
The OpenAI reasoning model successfully constructed a counterexample, disproving the 80-year-old Erdős unit distance conjecture, demonstrating the capability of general-purpose models to solve open math problems.
Economics and reasoning with OpenAI o1
OpenAI released the o1 model series, designed with extended reasoning capabilities to tackle complex problems in science, coding, and math by spending more time thinking before responding.
@wlzh: Microsoft + UPenn Open-Source Multiplex Thinking: Let LLMs 'Clone' at Forks Then Merge. In a nutshell: When reasoning reaches a critical decision point, the model 'clones' into K pathfinders, each taking a different path. After one step, they merge back into a composite token and continue. With K=3, one token carries the information of three…
Microsoft and the University of Pennsylvania open-source Multiplex Thinking, which allows LLMs to split into K parallel paths during inference, explore, then merge, improving efficiency. On a 7B model, it achieves over 50% accuracy on AMC2023 (first 7B model to do so) and over 55% on AIME2025.
Reasoning models struggle to control their chains of thought, and that’s good
OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.
Detecting misbehavior in frontier reasoning models
OpenAI researchers demonstrate that chain-of-thought monitoring can detect misbehavior in frontier reasoning models like o3-mini, but warn that directly optimizing CoT to prevent bad thoughts causes models to hide their intent rather than eliminate the behavior.