@snowboat84: https://x.com/snowboat84/status/2065215177029787705
Summary
This article is the middle part of the AI Engineering Landscape series, detailing core techniques such as inference optimization, model slimming (quantization, distillation, pruning, MoE), and speculative decoding, while reviewing the latest advances from hardware to the engineering stack.
View Cached Full Text
Cached at: 06/12/26, 02:49 AM
The Engineering Landscape of AI (Part 2): Inference, Post-Training, Alignment, and Safety
The first seven chapters of the previous post first discussed the runtime state of the AI engineering stack: from gaming GPUs in the 1990s to 10,000-GPU clusters in 2026, from wafers and HBM to nuclear power and SMRs, from 4D parallelism to KV cache and PagedAttention. The main thread connecting them is “how hardware is made, how power is supplied, how models are trained, and how inference runs”:
https://x.com/snowboat84/status/2061962883651731602
Next, the inference section in the previous post covered PagedAttention and PD separation, but there was a large chunk of inference optimization engineering left out. This post continues from the previous one, introducing more about inference, then moves to the other half of the coin beyond the model itself: how the model itself is transformed (post-training, alignment, safety).
I. A Crash Course on Inference Optimization: Continuing from the Previous Post
Let’s first briefly review what “inference” is. Training is the process of feeding data to a model so it learns parameters. It takes months to run once, uses tens of thousands of GPUs, and costs hundreds of millions of dollars. Inference is when the trained model is deployed into service for real users. Each request involves the user inputting a prompt and the model generating a response. In engineering, these are two completely different stacks. Training is about “whether a SOTA model can be trained within a reasonable time,” while inference is about “how many users can be served per second, the cost per token, and the latency.” The product experience and unit economics of an AI company are primarily determined by the inference layer.
Chapters 6 and 7 of the previous post covered the basic structure of inference. Once a user’s prompt comes in, it first goes through the prefill stage, processing the entire input at once. Then it enters the decode stage, generating one token at a time. The KV cache stores previously computed attention states to avoid redundant computation. PagedAttention manages the KV cache like virtual memory in an operating system, using paging to reduce fragmentation. Batching combines multiple concurrent requests together to feed the GPU, improving throughput. PD separation splits prefill and decode onto different hardware configurations, optimizing each separately. These are the foundations of inference.
This chapter continues by discussing several other engineering aspects of production-grade inference, divided into six sections, each focusing on core mechanisms and typical data.
1.1 Model Slimming: Quantization, Distillation, Pruning, MoE
The goal of model slimming is straightforward: make the model itself smaller, making inference services cheaper and faster. The real bottleneck in LLM inference is memory bandwidth. During decode, for every token generated, the entire set of weights must be moved from HBM to the GPU compute units. For a 70B model stored in FP16, that’s a 140 GB transfer. Halving the weights halves the single-token latency, and with the same hardware, you can fit more concurrent requests to increase throughput. Models that originally required multiple GPUs might even fit on a single card.
The four types of techniques below (quantization, distillation, pruning, MoE) share the same goal but happen at different stages. Quantization is typically a one-time conversion after training. Distillation and pruning are independent training processes that require re-running training to produce a smaller model. MoE is an architectural choice, decided during base model training. None of them happen automatically during inference; all are methods to make the model smaller beforehand, benefiting subsequent inference. The industry groups these together by purpose, and this chapter follows that classification.
1.1.1 Quantization
Quantization compresses model weights from the default FP16/BF16 (16-bit floating point) to fewer bits. Common steps include FP8 (halving memory usage), INT4 or FP4 (reducing memory to a quarter).
There are three mainstream post-training quantization algorithms. GPTQ uses second-order Hessian information to quantize columns one by one, adjusting subsequent columns to compensate for errors. Since 2023, it has been the de facto standard in the open-source community. AWQ (Activation-aware Weight Quantization) observes that positions with large activation values are more sensitive to quantization, so it specifically protects those weights’ values, often achieving better accuracy than GPTQ. SmoothQuant uses a mathematical transformation to shift outliers in activation values to the weights, making both sides easier to quantize. The common adversary of all three algorithms is the outlier values in activations; a few extremely large activation values can ruin the entire quantization, and each algorithm essentially fights against them.
The FP4 line entered the mainstream hardware roadmap after Blackwell. Two formats coexist: NVIDIA’s NVFP4 and the open standard MXFP4. The Blackwell architecture natively supports FP4 computation. FP4 can reduce the weights of a 70B model from around 140 GB to around 35 GB, but actual inference also includes overhead from KV cache, activations, runtime, etc. Whether it can run in production on a single card depends on the specific scenario and context length. The cost is a higher risk of accuracy loss. Aggressive low-precision scenarios generally require QAT (Quantization-Aware Training), where the precision loss of quantization is simulated during training so the model adapts. QAT offers better accuracy than PTQ but at a higher cost.
1.1.2 Distillation: The Teacher-Student Paradigm
The setup for distillation is simple: use a large model as a teacher to train a smaller model as a student.
The teacher is usually a powerful, expensive model that already exists. The student is the smaller model to be trained, with fewer parameters, cheaper inference, but capabilities not yet mature. During training, the teacher generates responses to a batch of prompts, and the student learns from the teacher’s outputs. But the student learns far more than just the superficial “token sequence.” The truly critical part is the teacher’s complete probability distribution at each generation position (the next-token logits).
For example, given the context “The weather today is,” the teacher might assign 0.4 probability to “very good,” 0.3 to “nice,” 0.15 to “sunny,” and 0.15 to other words combined. If the student only learns which word was ultimately chosen (“very good”), it loses important information: the teacher knows that “nice” and “sunny” are also reasonable options. The key to distillation is making the student directly fit the teacher’s entire probability distribution. This way, the small model learns not just the teacher’s answer but also the teacher’s “judgment distribution,” which represents some of the teacher’s implicit knowledge.
This is why a distilled small model can retain most of the capability of the original model even when its parameters are reduced to a fraction (or even a tenth). It learns the teacher’s judgment space at each position, not just a specific output. This paradigm was proposed by Hinton in 2015 (paper titled “Distilling the Knowledge in a Neural Network”) and became an industry standard practice in the LLM era. Distilling GPT-4 into GPT-4o-mini, Claude 3 Opus into Haiku – the industry consensus is that it follows this path (specific recipes are kept secret by each company).
A distilled small model has two use cases. One is independent deployment as a low-cost service version for scenarios where using GPT-4 is too expensive. The other is as a draft model for speculative decoding, which Section 1.2 will cover.
1.1.3 Pruning and NAS
Pruning aims to remove weights or neurons in the model that are less important, making the model smaller. There are methods to determine “which are not important” and “at what granularity to delete” (based on weight values, gradients, activation frequency; deleting individual weights or entire attention heads and even layers). However, in engineering, pruning alone has limited effect because model weights have strong dependencies; crudely removing some parts can cause the remaining performance to drop.
Therefore, industrial practice rarely uses pruning alone. It is common to combine it with distillation and quantization. NVIDIA’s Minitron and Puzzletron projects follow this combo: first prune some structure, then use the original model as a teacher to distill back the lost capabilities, and finally deploy with FP8.
NAS (Neural Architecture Search) takes a different path. Pruning deletes from an existing model; NAS directly searches for a better network structure. Early NAS cost more to search once than to train a model, so it was rarely used in the industry. The new generation of NAS (2024-2026) performs structural pruning-like searches directly on trained models, making the cost controllable, but it is still not as widely used industrially as distillation and quantization.
1.1.4 Mixture of Experts (MoE)
MoE is the most different category. In a standard dense model, every token passes through all parameters. MoE divides parameters into multiple “experts,” and each token activates only a few of them (typically 8 chosen 2, meaning 2 out of 8 experts are selected).
The effect is that the total number of parameters can be hundreds of billions, but the actual activated computation per token uses only a small fraction. Mixtral 8x7B, DeepSeek V3 (671B total parameters, ~37B activated), and Qwen MoE are representatives of this path.
The engineering difficulties of MoE inference are also clear. All experts must be stored in GPU memory (DeepSeek V3, even with FP8 quantization, requires hundreds of GB of GPU memory, not fitting on a single H100). Expert selection is dynamic (each token passes through a router to decide which experts to activate), and the communication overhead between cards is high (experts may be distributed across different cards). A series of papers from 2025-2026 are targeting the approach of “offloading infrequently used experts to CPU memory and scheduling them on demand,” allowing MoE inference to run on smaller hardware.
1.2 Speculative Decoding: Trading Compute for Latency
The fundamental problem of autoregressive decoding is that it can only output one token at a time. Generating 100 tokens means 100 sequential steps, and the GPU is mostly idle while waiting for the KV cache to be moved from HBM to the compute units. Compute capacity is actually wasted.
The idea of speculative decoding is elegant. Use a small, fast draft model to first guess N tokens, then use the large model to verify these N tokens in parallel. Accepted tokens are kept; the position of the first rejected token is discarded, and generation continues from there. A key property is losslessness: the final output is exactly equivalent to what the large model would generate token by token. Guessing correctly saves speed; if wrong, it just falls back, and the result remains unchanged.
The evolution of this path over three generations is clear. The first generation was the “draft plus target dual model” approach, using a 7B draft and a 70B target. The draft model itself needs to be trained or distilled. The problem is that the draft model’s own forward pass also takes time, limiting the speedup ratio. The second generation, Medusa (2023), skipped the independent draft model by adding several lightweight decoding heads directly to the last layer of the target model. Each head predicts a token for a future position, using tree attention to verify multiple token combinations in parallel. The third generation, the EAGLE series (2024-2026), performs autoregressive prediction on the model’s internal feature layers rather than the token layer, going further than Medusa.
By 2026, speculative decoding has become part of the capability list of mainstream inference engines, increasingly common in high-throughput services. vLLM supports Medusa and EAGLE draft methods by default. TensorRT-LLM provides the broadest support for speculative decoding. Typical speedups in production environments are on the order of 2-4x, but gains vary significantly across scenarios (draft model quality, target model architecture, batch size, sequence length all affect the acceptance rate). Overall, it remains one of the highest ROI single techniques in inference optimization over the past three years. The implementation is complex but the engineering effort is manageable, with significant benefits. MoE scenarios are a natural fit for speculative decoding, as MoE inference inherently has a lot of spare compute (only part of the experts are activated), which speculative decoding can utilize.
1.3 Inference Engine Ecosystem
Packaging these technologies into usable systems is the job of inference engines. In 2026, the three major international engines each have distinct positions.
vLLM is the birthplace of PagedAttention, open-sourced by the UC Berkeley team in 2023. It supports the broadest range of models (almost any open-source model can run), has fast cold start, and is the easiest to use, making it the fastest path to production. Since 2024, vLLM has essentially become the de facto standard for open-source inference engines, suitable for teams needing flexibility and rapid deployment. Many startups and researchers use it.
TensorRT-LLM is NVIDIA’s official data center-grade inference framework. It deeply integrates all hardware features of NVIDIA GPUs. Compiled inference throughput is the highest. On equivalent hardware, it offers a significant advantage over vLLM in high-concurrency scenarios. The trade-off is slower cold start (requires compilation) and support for only NVIDIA GPUs. It is suitable for large-scale production deployments with a single model running long-term. It also has the most comprehensive support for mixed-precision quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant) and speculative decoding methods.
SGLang specializes in “shared prefix scenarios.” In chatbot multi-turn conversations, RAG systems, and agent tool calling, each request prompt contains a large amount of repeated prefix (system prompt, historical conversations, etc.). SGLang’s RadixAttention automatically caches and reuses these prefixes. For dialogue and RAG systems, SGLang is a reasonable choice.
There is its own ecosystem in China. Shanghai AI Lab’s lmdeploy is one of the most active open-source inference engines domestically. Its core trade-off is that it supports fewer international models than vLLM, but it has deeper adaptation to domestic hardware (Ascend, Cambricon, Hygon). This line provides an inverted contrast to Section 2.5 “China’s AI Hardware Ecosystem” in the previous post.
Inference engines are the core, but a production-grade inference service also needs peripheral layers like prefix caching, API gateways, routing, and multi-GPU tensor parallelism. Kubernetes-native llm-d projects and the “AI Factory” architecture (next-generation data center design with separate storage and computation) are new trends in this layer.
1.4 Inference Scaling: The Cost, Latency, Throughput Triangle
Inference systems have three core metrics: cost, latency, and throughput. They form a triangle where you cannot have all three. Larger batches increase throughput but increase latency for individual requests. Lower latency means fewer users served per unit of compute, increasing cost. Lower costs rely on maximizing batching, which increases latency.
Different application scenarios have different positions in this triangle. Dialogue systems prioritize low latency because users are waiting. Batch processing tasks (e.g., overnight analysis, background summarization) prioritize high throughput and low cost since no one is watching. Agent-type applications are different; the latency of each inference step accumulates into total latency, making single-step latency particularly sensitive.
There is another binary classification: Prefill-bound and Decode-bound. Long-context RAG systems (prefill 50k+ tokens, decode a few hundred tokens) are prefill-bound, bottlenecked by compute. Long generation scenarios (prefill a few hundred tokens, decode thousands of tokens) are decode-bound, bottlenecked by memory bandwidth. The optimization focus for the two types is completely different.
The practical engineering implication is that no inference system can optimize cost, latency, and throughput simultaneously. You must select an application scenario, define the optimization target, and then adjust batch size, batching strategy, and hardware configuration accordingly.
To give readers a cost anchor, the following numbers are public estimates or calculations. Actual values depend on variables like context length, output length, model routing, and cache hit rate, so do not use them as definitive financial data. Multiple public estimates suggest the daily cost for OpenAI’s ChatGPT service is in the millions of dollars. OpenAI has not disclosed specific figures. Based on public API pricing, the marginal cost for a complex query using a model like GPT-4o is on the order of cents. Claude and Gemini are similar. The API price curve has dropped by more than an order of magnitude over the past three years, driven by a combination of engineering optimization, hardware iteration, and competition. For a mid-sized company deploying its own LLM service, the annual total cost (hardware capex + electricity + operations) starts in the tens of millions of dollars. For an AI product with one million daily active users, the annual inference infrastructure cost is commonly in the tens of millions to hundreds of millions of dollars. This is why all major companies have dedicated inference optimization teams of dozens of people. A company with an annual inference expenditure of $1 billion saving 5% is $50 million in cash.
A crucial clarification: the various optimizations mentioned earlier (quantization, PagedAttention, continuous batching, speculative decoding, Flash Attention) do not multiply simply. They interact with each other. For example, quantization reduces memory usage, indirectly allowing larger batch sizes, but speculative decoding’s benefits diminish with larger batch sizes because the GPU is less idle. In real production, you need to do composite performance benchmarks; you cannot multiply individual “speedup Nx” figures to get the total speedup.
Connecting back to the electricity line from Chapter 4 is the metric “tokens per watt.” The true efficiency of an inference service is ultimately how many tokens it can generate per watt of electricity. This reflects hardware-level efficiency (compute per unit power of GPU or ASIC), data center-level efficiency (PUE), and inference-level efficiency (all optimizations combined). This metric unifies the electricity discussion from the previous post and the inference optimization discussion in this chapter under the same objective function.
1.5 Test-Time Compute: Inference Shifts from “Running the Model” to “Determining How Strong the Model Is”
The deepest paradigm change in the inference layer from 2024 to 2026 is “test-time compute.”
In the traditional view, a model’s capability is determined during training. After training, inference is just “running this fixed capability.” The new view turns this around: spending more compute during inference can make the same set of weights exhibit stronger capabilities. Letting the model think for a while, generating longer reasoning traces, multiple sampling and aggregation, tree search, self-refinement – these can yield significantly better results on math, programming, and complex reasoning tasks.
First, distinguish the “inference” in this section from the “inference” in previous sections. They share the same Chinese word but correspond to different English words. The “inference” in sections 1.1 to 1.4 corresponds to the English “inference,” meaning a model’s forward pass to compute output from input. The “inference” in this section on “test-time compute” corresponds to the English “reasoning,” referring to the model’s internal thinking process. “Spending more compute during inference to become stronger” here means precisely “doing more reasoning computation during the inference phase.”
OpenAI’s o1 (September 2024) was the first to publicly embrace this path. Subsequent models like o3, DeepSeek R1 (January 2025), Anthropic’s extended thinking, GPT-5.5 Pro, and Claude Opus 4.x all have built-in reasoning modes. The main engineering implementation paths are: long reasoning traces (generating thousands to tens of thousands of intermediate thought tokens before the final answer), Best-of-N (sampling multiple different answers and using a verifier model to pick the best), tree search (expanding the reasoning process into a tree and pruning based on a value function), and self-refinement (having the model critique its own answer and then revise it).
The theoretical roots are relatively solid. A single forward pass of a Transformer is a fixed-depth circuit, its expressive power limited by complexity theory. Many seemingly simple logical or arithmetic problems cannot be expressed in a fixed-depth circuit at all. Chain-of-thought breaks this limit: generating N intermediate tokens is equivalent to dynamically expanding that depth N-fold. A series of papers by Merrill and Sabharwal (2023-2024) rigorously proved that a Transformer plus chain-of-thought is strictly more expressive than a single forward pass. The paper “Scaling LLM Test-Time Compute Optimally” by Snell et al. (August 2024, arXiv:2408.03314) provided empirical scaling laws: on medium-difficulty tasks, 14x longer reasoning is equivalent to a 14x larger model. One month after this paper was published, OpenAI released o1.
Sam Altman stated this directly in his “Three Observations” blog: the intelligence of an AI model is roughly equal to the logarithm of the resources used for training and running; these resources are primarily training compute, data, and inference compute. It seems you can invest arbitrarily more money and get continuous and predictable returns. This places inference-time compute scaling on the same level as training-time compute scaling.
The economic implication is that spending ten times more compute during inference might jump the model’s capability up a level. This elevates inference optimization from “saving money” to “determining capability.” Two scaling laws now coexist: increasing compute during training (larger models, more data) and increasing compute during inference (longer reasoning). Which one has higher ROI is one of the biggest open questions in the AI industry in 2026.
This is the clearest point of convergence between models and engineering. The same set of weights, paired with different test-time compute budgets, can produce products with completely different capabilities.
1.6 Summary: One Set of Weights, Two Products
The techniques above are not mutually exclusive; they can be stacked. A real comparison between naive PyTorch inference and a fully optimized inference engine can show a 5-8x difference in end-to-end cost efficiency. Each layer of stacking has diminishing marginal returns, but combined, they can compress the cost per token to one-fifth to one-eighth of the original.
A deeper judgment is that the gap between optimized and unoptimized is larger than the gap between GPU generations. Naive PyTorch inference running on a B200 is slower than an optimized vLLM or TensorRT-LLM running on an H100; the latter is faster and cheaper. For most companies, instead of chasing the latest GPU, it is better to get the inference software stack right on existing GPUs. This judgment is particularly important for small and medium companies: they don’t need to rush to buy B200s; using H100 clusters well can serve a large number of users. The significance for large companies is equally concrete: a company spending $1B annually on inference could take 5%, or $50M, by optimizing. This is why OpenAI, Anthropic, and Google each have dedicated inference optimization teams of dozens of people.
Returning to the section title, “One set of weights, two products” is already very concrete today. OpenAI offers reasoning effort levels of low/medium/high in its API, with price and latency differing by several times, but the underlying weights are the same. Anthropic’s Claude extended thinking is an on/off switch; when turned on, the token consumption of a single response can increase more than tenfold, but SWE-bench Verified and math accuracy increase significantly. DeepSeek R1 is a reasoning post-trained version built on V3-Base, sharing the base architecture and many parameters with the general-purpose V3 version.
On the engineering level, through inference budgets and post-training heads, the same base model can be sliced into “fast product” and “slow product” business models. This layer does not require training a new model to create distinctly different price tiers. It is the fastest-changing part of the LLM business model from 2025 to 2026.
This concludes the runtime state. Next, we enter the transformation state, discussing how a base model that can “continue writing text” is transformed into the ChatGPT, Claude, and Gemini you use today. The core of this line is called post-training.
II. Introduction to Post-Training: From Base Model to Dialogue Model
The “training state” discussed in Chapter 5 of the previous post was precisely pre-training: using massive amounts of text for next-token prediction, resulting in a “base model” that can “continue writing text.” After pre-training, there is another process called post-training, which transforms this base model into the ChatGPT, Claude, and Gemini you use today. Making it able to answer questions, follow instructions, reject harmful requests, and align with human conversational style—all of this relies on this step.
This chapter discusses the overall motivation and the most basic methods of post-training. The next chapter will expand on the main line of RLHF (Reinforcement Learning from Human Feedback).
2.1 Why Post-Training Is Needed
The training objective of pre-training is “given the preceding tokens, predict the next token.” This is not the same as “becoming a useful assistant.” What a base model does is essentially continuation: given a certain start, it continues with what follows in its training data. If the training data contains garbage text, off-topic passages, or harmful content (which constitutes a significant portion of internet text), the base model will continue accordingly.
The most intuitive comparison is between the base model of GPT-3 and ChatGPT. Ask both “How to make a bomb.” The base model, following the related continuations in its training data, would actually give you steps (because this type of text is not uncommon online). ChatGPT’s GPT-3.5 would refuse and explain why it cannot answer. The underlying models are not vastly different; the gap lies almost entirely in post-training. Another example: given “User question: What’s the temperature in New York today?” the base model would most likely continue with another question or some irrelevant chatter, rather than answering. It fundamentally hasn’t “learned” question-answering; it has only learned text continuation.
The real secret of the ChatGPT moment (November 2022) lies here. The product leap did not only come from a scale upgrade of the base model, but also from OpenAI’s systematic application of the InstructGPT-style SFT+RLHF post-training pipeline to the conversational scenario. There is a famous data point from the InstructGPT paper: the 1.3B parameter InstructGPT received higher human preference scores than the 175B parameter GPT-3. With a hundred times fewer parameters, users preferred it. This event made the industry realize for the first time that the impact of post-training on the final product could outweigh the parameter count itself.
Working on AI products for several years, I have repeatedly felt the importance of this. Given the same open-source base model, two different companies doing their own post-training can achieve an order of magnitude difference in the usability of the final product. “How strong is the model itself” is an attractive topic, but what truly determines the product’s form from an engineering perspective is the post-training line.
2.2 SFT: Supervised Fine-Tuning, From Continuation to Dialogue
The most basic method of post-training is SFT (Supervised Fine-Tuning). The specific approach is to collect a batch of high-quality “prompt + ideal response” pairs and fine-tune the model on this data, teaching it to imitate this response pattern.
There is a clear evolution in data scale. The early InstructGPT used a few tens of thousands of SFT data points, written by OpenAI’s own labelers. After Llama 2, the open-source community’s SFT datasets (OpenAssistant, Alpaca, ShareGPT, etc.) scaled to hundreds of thousands. Modern major companies’ SFT data easily reaches hundreds of thousands to millions, composed of a mix of dedicated data labeling teams and synthetic data.
However, data scale is not everything. What truly determines the output quality of SFT is the diversity of the data, consistency of style, and the skill of the labelers. Ten thousand high-quality data points train a better model than one hundred thousand mediocre ones. Therefore, the effort major companies put into SFT is mainly on data filtering and quality control, not simply piling up data.
SFT’s limitations are also clear. The model learns “what a response looks like an ideal answer,” not “what response is truly correct.” If labelers prefer long, confident, and structured answers, the model will mimic a style of “long and confidently fabricated.” This is why many SFT-only open-source models have an AI smell (verbose, bullet-point heavy, overconfident phrasing). They mimic the surface features of the answers but never get the chance to learn “when to admit not knowing.”
To bridge the gap between “looks right” and “is actually right,” SFT is insufficient. Preference-based methods like RLHF or DPO (Direct Preference Optimization) are needed. This is the main line of the next chapter.
2.3 Post-Training Pipeline Overview
Post-training is a pipeline, not a single technique. The complete flow roughly is: start from a pre-trained base model, first do SFT to let the model learn the basic response format, then collect human preference data to train a reward model (or directly use DPO to skip this step), then use RLHF (core algorithm PPO, Proximal Policy Optimization) or DPO to fine-tune the model to maximize reward, and finally run an evaluation to decide if it can be released.
There is a clear pattern in the iteration cadence of major companies. Base model training has a long cycle: 3 to 6 months per round (data preparation, training, evaluation). Post-training has a much shorter iteration cycle: every 2 to 4 weeks, a small iteration can be done. Add a new batch of labeled data, tweak the reward model, run PPO, A/B test. This is why the frequency of “feeling the model has changed” is much higher than the frequency of “capability generation upgrade.” The vast majority of perceived differences come from changes in post-training, not from base model upgrades.
Post-training is the true shaping period for a model’s personality, style, and safety behavior. The version degradation discussed in LLM Business Pain Point 4.3 (users complaining that a new version is colder, shorter, or refuses more) almost entirely stems from alignment adjustments in post-training, not the underlying model becoming weaker. Understanding this is important because outsiders easily misinterpret model behavior changes as “the model got dumber,” when in fact it is often post-training recalibrating the trade-off between Helpful and Harmless.
2.4 Post-Training Economics: Labeled Data Is the Real Bottleneck
When pre-training talks about data scarcity, it refers to “high-quality internet text becoming harder to find.” When post-training talks about data scarcity, it refers to something else: “the cost of high-quality human preference labeling is extremely high and cannot be solved by crawling.”
Pre-training data can, in principle, be crawled from the internet. Aggregating all reasonable sources of text yields a training set of trillions of tokens. Post-training data cannot be obtained this way. You need structured data like: “this is a prompt, these are response A and response B, a human labeler thinks A is better because of reason X.” This type of data cannot be crawled from the entire internet; it must be created by organizing people to write, compare, and judge.
The market price for a piece of “expert-level preference annotation” is not low. For a preference pair in a general domain (“Which of these two responses is more helpful to the user?”), the cost might be a few dollars per pair. But for specialized domains like medicine, law, math, or code, the labeler needs a real professional background, and the cost per pair can reach $50 to $200. A complete RLHF iteration for a SOTA model can cost millions to tens of millions of dollars in labeling costs alone.
This demand has fostered a whole industry of data labeling companies. Scale AI is the most famous. Founded in 2016 by Alexandr Wang after dropping out of MIT, it initially focused on data labeling for autonomous driving (Waymo, Cruise) and defense (Pentagon contracts). It only shifted towards LLM RLHF preference labeling as its biggest business in 2023. Valued at $13.8 billion in May 2024, Meta bought 49% non-voting equity for approximately $14.3 billion in June 2025, pushing the valuation to around $29 billion, and Alexandr Wang was poached by Meta to lead Superintelligence Labs.
This deal had side effects. OpenAI, Google, etc., concerned about training data flowing to Meta, started shifting to other suppliers in the second half of 2025, diverting Scale AI’s market share in LLM labeling. Capital moves did not cause a qualitative leap in the industry’s data labeling capabilities but rather reshuffled the players. Scale AI’s 2024 revenue was approximately $870 million (according to public estimates like Sacra’s).
Surge AI is another labeling company known for quality. According to reports, its 2024 revenue reached the tens of billions of dollars level (specific financials not public). OpenAI and Anthropic are both its major customers. Along with a batch of other companies like Labelbox, Toloka, and Snorkel AI, they support the data supply for major companies’ post-training efforts.
It’s worth noting the relationship between major companies and labeling firms. OpenAI does not employ its own labelers; its labeling pipeline is entirely outsourced. Early on, it used Sama (a Kenyan BPO company) to employ local workers for harmful content labeling, paying $1-2/hour as revealed in a 2022 exposé, which sparked controversy. Later, it mainly used Surge AI and Scale AI, plus some internal contractor teams. Anthropic initially invested more internal resources, emphasizing high-quality labelers with judgment (including people with relevant PhD backgrounds) for preference labeling, making it a relatively rare major company that didn’t outsource this layer entirely.
Data scarcity in post-training is more acute than in pre-training. Pre-training data can be expanded with synthetic data (Chapter 5 will cover this). Post-training data, although AI feedback can replace some human annotation (the RLAIF path), currently the highest quality preference signals still must come from discerning humans. This is why companies like Scale AI and Surge AI can command valuations of hundreds of billions of dollars in the large model era. What they sell is essentially “discerning human time”; the software is just a tool.
Having covered the introduction to post-training, the next chapter details the main line of RLHF, and Chapter 4 covers its simplified path, DPO.
III. RLHF: Reinforcement Learning from Human Feedback
This is the most classic and important line in post-training. This chapter discusses the engineering flow, history, side effects, and real-world costs of RLHF.
3.1 The Three-Step RLHF Process
The core idea of RLHF (Reinforcement Learning from Human Feedback) is to turn “what kind of answers humans like” into an optimizable objective function. It consists of three steps.
Step one is SFT initialization. As discussed in the previous chapter, use a batch of high-quality “prompt plus ideal response” pairs to teach the model the basic response format. This step is just a warm-up; models that go directly to RL without SFT have difficulty converging.
Step two is training a reward model. Collect a large amount of human preference data. Each data point is: “for the same prompt, two different responses, the labeler thinks A is better than B.” Use this data to train a reward model. The input is “prompt plus response,” and the output is a scalar score. This score represents “how good a human labeler would think this response is.”
Step three uses the reward model as the reward function for RL and runs PPO (Proximal Policy Optimization, an RL algorithm proposed by OpenAI in 2017) to fine-tune the original model to maximize the reward. During this step, the model generates responses, the reward model scores them, and the PPO algorithm uses this score to update the model’s weights. To prevent the fine-tuned model from drifting too far from the original SFT model (which could lead to gibberish), a KL divergence penalty is added, constraining the new model to not deviate too much from the SFT model.
The key judgment is that everything depends on the reward model. If the reward model is wrong (systematically biased towards a certain style, missing a certain dimension), the entire RL process will push the model in the wrong direction. This is why major companies often invest more effort in the reward model than in the RL itself.
3.2 From InstructGPT to ChatGPT: The Industrialization of RLHF
RLHF was not invented by OpenAI. The basic framework was proposed in the 2017 paper “Deep Reinforcement Learning from Human Preferences” by Christiano, Leike, et al., applied to Atari games at the time. In 2020, OpenAI applied this method to text summarization tasks (Stiennon et al.), proving that summarization models trained with RLHF were more popular than those written by human labelers.
The true moment of industrialization was the March 2022 InstructGPT paper (Ouyang et al., arXiv:2203.02155). OpenAI systematically applied RLHF to GPT-3 for the first time, demonstrating that this recipe works on general-purpose language models. The data point from that paper—“1.3B InstructGPT’s preference scoring outperforms 175B GPT-3”—made the industry realize for the first time that the impact of post-training on the final product could outweigh the parameter count.
The November 2022 ChatGPT was the consumer productization of the InstructGPT recipe. OpenAI used the same RLHF pipeline to adapt GPT-3.5 into a conversational interface and launched it to the public, unexpectedly igniting the consumer LLM market.
The recipe spread rapidly. Anthropic published the Constitutional AI paper in December 2022 (Bai et al., arXiv:2212.08073), extending the RLHF framework to “replace some human feedback with AI feedback” (RLAIF, covered in Chapter 5). When Meta released Llama 2 in July 2023, it disclosed the complete RLHF training recipe (PPO
Similar Articles
@snowboat84: https://x.com/snowboat84/status/2061962883651731602
This article is the first part of the AI Engineering Panorama series. From a historical perspective, it reviews the evolution of GPUs from gaming graphics cards to AI accelerators, the bold bet of CUDA, the independent path of Google's TPU, and why NVIDIA ultimately prevailed. It also provides a detailed analysis of the underlying logic of AI infrastructure such as chips, supply chain, networking, and power.
@Michaelzsguo: https://x.com/Michaelzsguo/status/2053217839729791221
This article is a guide for local large model deployment, covering hardware selection, memory calculations, Runtime tool comparisons, and model quantization options, helping users from getting started to optimizing their local inference experience.
@snowboat84: https://x.com/snowboat84/status/2064135804092645410
This article systematically reviews the evolution of the world model concept from Craik's psychological metaphor in 1943 to the industry explosion in 2024-2026. It details the core ideas and representative works of symbolic AI and deep learning schools (Schmidhuber-Ha, Dreamer series, JEPA, video generation direction), and points out the current state of definition confusion and competition among various schools.
@RealCodedAlpha: https://x.com/RealCodedAlpha/status/2064921935507837260
An in-depth article on mastering OpenAI Codex, covering a complete knowledge system from mental models to practical applications such as large-scale code migration, security auditing, performance optimization, team collaboration, building a personal AI operating system, and product development.
@snowboat84: https://x.com/snowboat84/status/2062686432335184321
This article explores the deep connections between physics and deep learning, analyzes the isomorphism of phenomena such as Scaling Law and emergence with concepts like critical scaling laws and phase transitions in physics, and reviews the current status and prospects of applying physical methodologies in AI.