@AaronWeiHuang: Our new blog looks at how FP4 is moving beyond compression into a practical primitive for training and inference across…

X AI KOLs Following News

Summary

NVIDIA's blog details how FP4, with the NVFP4 format and Blackwell hardware, has evolved from a compression trick to a practical primitive for training and inference across LLMs and diffusion models, achieving near 16-bit accuracy.

Our new blog looks at how FP4 is moving beyond compression into a practical primitive for training and inference across both LLMs and diffusion models: https://research.nvidia.com/labs/eai/blogs/pushing-intelligence-to-4-bit… 1. Why Four Bits Is Hard: Only 15 values make scaling critical. 2. NVFP4: Smaller blocks and finer scales improve accuracy. 3. LLMs with FP4: W4A4 compute, training, inference, and fused kernels. 4. Video Generation: End-to-end 4-bit training and real-time inference. 5. KV Cache Quantization: Compression, K-smoothing, and parallel dequantization. 6. FP4 Attention: Low-precision attention with dedicated softmax handling. The takeaway: Together, co-designed formats, kernels, memory systems, and models make FP4 a compelling choice for the future of large models. @yukangchen_ @WeianMaoX @ShuaiYa68505475 @songhan_mit
Original Article
View Cached Full Text

Cached at: 07/01/26, 04:01 AM

Our new blog looks at how FP4 is moving beyond compression into a practical primitive for training and inference across both LLMs and diffusion models:

https://research.nvidia.com/labs/eai/blogs/pushing-intelligence-to-4-bit…

  1. Why Four Bits Is Hard: Only 15 values make scaling critical.
  2. NVFP4: Smaller blocks and finer scales improve accuracy.
  3. LLMs with FP4: W4A4 compute, training, inference, and fused kernels.
  4. Video Generation: End-to-end 4-bit training and real-time inference.
  5. KV Cache Quantization: Compression, K-smoothing, and parallel dequantization.
  6. FP4 Attention: Low-precision attention with dedicated softmax handling.

The takeaway: Together, co-designed formats, kernels, memory systems, and models make FP4 a compelling choice for the future of large models.

@yukangchen_ @WeianMaoX @ShuaiYa68505475 @songhan_mit


Pushing Intelligence to 4-bit

Source: https://research.nvidia.com/labs/eai/blogs/pushing-intelligence-to-4-bit/ Four-bit floating point (FP4) encodes each value in just sixteen levels. Until recently that was usable only for storage; with NVIDIA’s NVFP4 format and Blackwell hardware, it now supports the full lifecycle of large models—training, inference, and even long-video generation—at close to 16-bit accuracy. This post explains how FP4 reaches that point across LLMs, diffusion, and video, and what remains unsolved.

The core difficulty is representational: four bits afford only fifteen distinct magnitudes, so accuracy depends entirely on how those magnitudes are scaled. NVFP4 addresses this with fine-grained, two-level scaling that Blackwell executes natively—and the payoff now extends across the whole stack, from LLM weights and activations to the KV cache, attention, and full diffusion-based video generation.

FP4 has moved from a storage-only compression trick to a primitive for both inference and training. NVFP4 gives Blackwell Tensor Cores a practical 4-bit path for weights, activations, the KV cache, and attention—across language, vision, and video models alike.

In this post

  1. Why four bits is hard
  2. NVFP4: a smarter ruler
  3. LLMs with FP4
  4. Video generation with FP4
  5. KV cache quantization with FP4
  6. FP4 attention

1. Why Four Bits Is Hard

Most people meet quantization through integers—INT8, INT4, weights packed into fewer bits with one scale factor. FP4 is different: still four bits, but with a floating-point shape (sign, exponent, mantissa). The E2M1 flavor has 1 sign, 2 exponent, 1 mantissa bit, which yields exactlyfifteen distinct values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}.

Think of FP4 as atiny rulerwith only a handful of marks. A 4-bit code only pickswhich mark; something else has to decidewhere the ruler is placedover the real numbers. That “something” is the scale, and it is what the format’s accuracy ultimately depends on—whether the tensor is an LLM’s weights, a diffusion model’s activations, or a KV cache. Fifteen marks is almost nothing: if a tensor mixes tiny values, big outliers, and everything between, one global ruler wastes most of its marks on empty space.

The fix is to stop using one ruler for the whole tensor.MXFP4, the Open Compute Project’s microscaling format, gives every block of 32 values its own power-of-two scale (an E8M0 exponent)[2], so one outlier only distorts its local block. The catch is that a power-of-two dial is coarse: if the ideal scale sits between two powers of two, MXFP4 must round it, and the tiny 4-bit budget pays for the error. NVFP4 sharpens exactly this—and NVIDIA’s own diagram says it best:

NVIDIA diagram comparing MXFP4’s coarse 32-value power-of-two block scaling against NVFP4’s finer 16-value blocks each with a dynamically computed FP8 scaleFigure 1. MXFP4 puts one coarse power-of-two scale on every 32 values; NVFP4 puts a finer, dynamically computed scale on every 16. Smaller blocks and a sharper dial fit awkward distributions more tightly.Source: NVIDIA, “Introducing NVFP4 for Efficient and Accurate Low-Precision Inference,” NVIDIA Technical Blog, 2025[1].## 2. NVFP4: A Smarter Ruler

NVFP4 keeps the recipe—4-bit E2M1 values plus block scaling—but sharpens both knobs. It uses one sharedFP8 (E4M3) scale per 16-value block, plus a higher-levelFP32 per-tensor scale[1]. Two changes versus MXFP4, both decisive:

  • Smaller blocks(16 vs 32): one outlier contaminates half as many values.
  • A finer dial(E4M3 with mantissa bits, vs power-of-two E8M0): the ruler lands where the data actually is. NVIDIA reports ~88% lower quantization error than power-of-two scaling[1].
  • Two levels: the FP32 per-tensor scale normalizes the tensor so each block’s E4M3 scale fits cleanly into FP8—global range up top, local fit per block.

NVIDIA diagram of NVFP4’s two-level scaling: a 4-bit E2M1 element, groups of 16 values each sharing an FP8 E4M3 block scale, and a global FP32 per-tensor scaleFigure 2. NVFP4’s two-level structure: 4-bit E2M1 elements, an FP8 (E4M3) scale shared across each 16-value micro-block, and a global FP32 per-tensor scale.Source: NVIDIA, “Introducing NVFP4 for Efficient and Accurate Low-Precision Inference,” NVIDIA Technical Blog, 2025[1].FormatMental modelScalingTradeoffFP4 E2M1A 15-mark codebook.Needs external scaling.Only 15 values.MXFP4A power-of-two dial per 32 values.E8M0 / 32-value block.Simple, but coarse.NVFP4A finer dial per 16 values + a global trim.E4M3 / 16-value block + FP32 tensor scale.Better accuracy; needs Blackwell. ~4.5 bits/value vs MXFP4’s ~4.25. This post follows FP4 up the stack—from LLMs to diffusion to video—and into the two places it is hardest: the KV cache and attention. For the long-videoparallelismandKV-cacheinfrastructure that sits alongside this story, see our companion posts onscaling video training with parallelismandKV cache compression.

3. LLMs with FP4

FP4 was first applied at scale to LLM inference, which is bottlenecked by exactly the operations low precision most helps: moving weights from memory, multiplying weights by activations, and storing a KV cache that grows with context. The most demanding target isW4A4—both weightsandactivations in four bits—because quantizing both is what accelerates the arithmetic, not merely the storage. That arithmetic is dominated by matrix multiplication, which in W4A4 executes entirely in 4-bit on Blackwell Tensor Cores:

W4A4 NVFP4 matrix multiply on a Blackwell Tensor CoreFigure 3 — NVFP4 W4A4 matrix multiplyOutput = Weights x Activations, both operands in 4-bit, on a Blackwell Tensor CoreW · weightsNVFP4 4-bit×X · activationsNVFP4 4-bitBlackwell Tensor CoreW4A4 GEMM · low precisionA · B = CY · outputaccumulatedmemory traffic per elementBF1616-bitNVFP44-bit · ¼W4A4: both weights and activations in 4 bitsup to 4× GEMM throughput · ¼ the memory traffic vs BF16weights AND activations are 4-bit — the multiply itself runs in low precisiontrained NVFP4-aware, so W4A4 inference keeps qualityFigure 3. W4A4 NVFP4 inference: both weights and activations are 4-bit, so the dominant matrix multiplies run directly on Blackwell Tensor Cores in low precision — up to a 4× throughput ceiling over BF16, with far less memory traffic [1][5].Those four-bit GEMMs are not free to feed, though. Every activation has to be quantized to NVFP4 just before the multiply, and if that cast runs as its own kernel it adds an extra round-trip through HBM—an op-overhead tax that can quietly erode the 4× ceiling, especially on smaller, memory-bound layers. The standard fix iskernel fusion: fold the activation quantization into the GEMM’s prologue (and the preceding normalization into the epilogue) so the cast happens inline, in registers, with no extra pass over memory. Production NVFP4 stacks lean on exactly these fused kernels—NVIDIA’sTransformerEnginesupplies the fused quantize-and-GEMM path, and theTensorRT Model Optimizersupplies the NVFP4 quantization recipes[9][10].

Does the accuracy survive? Largely, yes. Post-training-quantizingDeepSeek-R1to NVFP4 stays within about1% of FP8across reasoning and knowledge benchmarks (MMLU-Pro 85→84, GPQA 81→80, AIME 2024 actually up 89→91)[1]. And NVFP4 consistently beats MXFP4: in a head-to-head pretraining run, MXFP4 needed36% more tokensto reach the same loss as NVFP4[12][3]. This accuracy headroom is what lets the format’s efficiency gains be realized without retraining.

FP4 across the LLM lifecycle — training and inferenceTraining (Llama-3.1 405B pretraining)1.0×FP81.9×NVFP4Inference (peak throughput, Blackwell)1.0×FP83.0×NVFP4Relative to FP8 (= 1.0×). Inference is same-hardware peak throughput; training is Llama-3.1 405B pretraining.Figure 4. NVFP4 across the LLM lifecycle: about 1.9× faster pretraining than FP8 (Llama-3.1 405B) and up to ~3× the peak inference throughput of FP8 on Blackwell, at FP8-level accuracy and ahead of MXFP4 [11][1].This is no longer a lab result. NVFP4 inference ships in TensorRT-LLM, the quantization recipes ship in the TensorRT Model Optimizer, and NVFP4 checkpoints of frontier models (including DeepSeek-R1) are published for Blackwell deployment[1][11]. Newer designs go further and build FP4 into the architecture itself:DeepSeek-V4—a 1.6-trillion-parameter Mixture-of-Experts model—trains its expert weights and its sparse-attention indexer directly in FP4 with quantization-aware training, keeping the remaining components in FP8. The largest part of the model is therefore stored and computed in four bits by design, rather than quantized after the fact[15]. OpenAI’s open-weightGPT-OSSmodels make the same choice with a different format: their MoE experts—over 90% of the parameters—are trained quantization-aware in MXFP4, which is what lets the 120B model run on a single 80 GB GPU[16]. The contrast is instructive: both bake 4-bit into the model through QAT, but DeepSeek-V4 uses NVFP4’s finer 16-value blocks while GPT-OSS uses MXFP4’s simpler 32-value blocks—the same tradeoff from Section 2, now decided inside frontier models.

4. Video Generation with FP4

Diffusion and video generation are a more recent target, and the same W4A4 approach applies: a diffusion transformer (DiT) is largely the matrix multiplies of the figure above, repeated across many denoising steps. Quantization works here too: methods likeViDiT-Qpush DiTs to W8A8 and W4A8 with negligible visual loss using custom kernels[14]. The aggressive step is to go all the way toW4A4 NVFP4, end to end.

This direction builds on our earlier Blackwell work. In February 2025, ourRTX 5090 setup reportdocumented early consumer-GPU access to native FP4, and ourSVDQuant + NVFP4 demothen showed FLUX running 4× smaller and 3× faster than BF16 with near-16-bit quality—and better image quality than INT4[23][24]. LongLive-2.0 extends that inference-first line into an end-to-end long-video system spanning training, W4A4 execution, the KV cache, and attention.

LongLive-2.0is the clearest example: an autoregressive (AR) long-video model built on Wan2.2-TI2V-5B that runs NVFP4 inbothtraining and inference—to our knowledge the first end-to-end NVFP4 recipe for long video generation[5][8]. The result is a 5B model that generates minute-long, 720p video in real time:

A minute of video, drawn chunk by chunk — in real timeFive real NVFP4-generated frames of a robot planting a seedling, with a green generating highlight stepping across each frame, an FPS callout pill, and a speedup ribbon.A minute of video, drawn chunk by chunk — in real time45.7 FPS · 1280×720 · ~2× real-timegenerating…3.3 FPS (50-step base) → 45.7 FPS · ~14× faster, same 720peach chunk is generated from the clean history before itthe past stays frozen — only the next chunk is computedFigure 5. Real NVFP4-generated frames from LongLive-2.0 (a robot planting a seedling): a 5B autoregressive model streams 720p video in real time, each chunk built from the clean history before it. Frames from the LongLive-2.0 paper teaser [5].### Training: one NVFP4 pipeline, two stages

Quality survives the drop to four bits because the model is trained NVFP4-aware rather than quantized after the fact, and that training runs in two NVFP4 stages. First, the bidirectional base model is fine-tuned into a chunk-level AR generator with clean-context teacher forcing—each chunk is denoised conditioned on the clean history before it. To fit the long sequences,Balanced sequence parallelismshards the paired clean and noisy chunks across GPUs so the loss-bearing work stays even, while NVFP4 accelerates the GEMM-heavy DiT; together they cut 64-second AR-training iteration time up to**2.1×**over BF16+SP (plain BF16 without SP runs out of memory).

Second, to reach real-time speed, the model is distilled to a few denoising steps withdistribution-matching distillation (DMD)—run, again, in four bits. DMD co-locates three networks on each GPU: a generator, a real-score model, and a fake-score model. All three share a frozenW4A4 NVFP4backbone and train only small LoRA adapters—an idea borrowed fromQeRL, which showed that pairing an NVFP4 backbone with LoRA makes even reinforcement learning cheap (its quantization noise can even aid exploration)[21]. An adaptive “4-or-6” scale search picks the lower-error magnitude per block, and because the distillation is single-stage—no ODE initialization or progressive long-tuning—the trained LoRA simply plugs into the 4-step AR model and halves it to 2 steps with no further training. Quantizing the three branches one after another walks DMD peak memory from70.5 GB to 49.0 GBper GPU (0.69×).

NVFP4 DMD distillation — frozen 4-bit backbones, trainable LoRAThree models, one GPU — only the green LoRA modules train.updateupdaterollout video chunksFake ScoreNVFP4 · frozenLoRAtrainableReal ScoreNVFP4 · frozen · no LoRAGeneratorNVFP4 · frozenLoRAtrainableDiffusion LossDMD Lossreal − fake scoreFigure 6. NVFP4 DMD distillation in LongLive-2.0. The generator, real-score, and fake-score models are co-located in a frozen W4A4 NVFP4 setup; only the small green LoRA modules are trainable. The generator rolls out video chunks (left loop) that both score models evaluate—theDMD loss(real − fake) updates the generator’s LoRA, while adiffusion lossupdates the fake-score LoRA—and the trained LoRA later converts the 4-step model to 2 steps. Redrawn from the LongLive-2.0 paper[5].### Inference: W4A4, an NVFP4 KV cache, and overlapped decode

At deployment the generator runs in W4A4 NVFP4, the KV cache is stored in NVFP4, and VAE decoding is overlapped with denoising on a separate GPU so it never extends the critical path. Stacked, these carry the 5B model to 45.7 FPS at 720p. The two sides reinforce each other: the same NVFP4 recipe that makes inference fast is what made the long-video fine-tune affordable to train in the first place.

Figure 7 — LongLive-2.0: FP4 cuts training time and inference memoryAR training — 64s iteration (s)lower = better1372.9BF16+SP1196.5Balanced SP639.5NVFP4 + Bal SP2.1× fasterInference — peak memory (GB)lower = better36.4BF1624.8 FPS29.7+NVFP432.0 FPS19.4+KV cache29.7 FPSFigure 7. LongLive-2.0’s measured results. Left: NVFP4 with Balanced sequence parallelism cuts 64s AR-training iteration time up to 2.1× (plain BF16 without SP runs out of memory). Right: W4A4 NVFP4 then an NVFP4 KV cache drop inference peak memory from 36.4 GB to 19.4 GB while raising throughput (24.8 → 32.0 FPS) [5].The NVFP4 KV cache is what lets the model retain a minute of generated history within a fixed memory budget. Because it raises a distinct set of problems, we treat it on its own.

5. KV Cache Quantization with FP4

In any autoregressive model, the keys and values of past tokens are the model’s memory—and that memory grows linearly with length until it dominates everything. For LLMs this is the long-context wall:KVQuantshowed that quantizing the cache to ~3 bits preserves accuracy well enough to reach10-million-tokencontexts[13]. AR video has the exact same problem, only heavier—each generated chunk becomes history that later chunks attend to, and video tokens are far larger than text tokens.Quant VideoGen, for one, pushes the cache to just2 bitsfor autoregressive video diffusion—up to 7× smaller with under 4% latency overhead—via semantic-aware smoothing and progressive residual quantization[22].

**K-smoothing is a shared ingredient, not a point of contrast.**Keys often contain offsets and outliers that waste the limited resolution of a low-bit codebook. Smoothing or centering K before quantization tightens its effective range, so more of those scarce levels represent useful variation. Quant VideoGen uses semantic-aware smoothing; LongLive-2.0 subtracts each key vector’s channel mean before NVFP4 micro-block quantization; and SageAttention3 explicitly inherits K-smoothing from SageAttention in its FP4 attention kernel. The exact recipes differ, but the principle is broadly effective: smooth first, then quantize[22][5][7].

Quantizing the cache to NVFP4 attacks this directly. NVIDIA reports the NVFP4 KV cache cutting footprint up to**~50% versus FP8**with under 1% accuracy loss, and beating MXFP4 KV cache by ~5% thanks to the finer block scaling[4]. The deeper advantage ishardware: NVFP4 dequantizes along Blackwell’s native FP4→FP8 datapath, whereas generic INT4/INT2 KV caches have no such datapath and must dequantize in software[1].

Quant VideoGen and NVFP4 therefore target different points on the quality–memory frontier. This is not an apples-to-apples benchmark, but the design tradeoff is clear:INT2 is capacity-first, using an extremely compact code and progressive residual refinement to maximize compression;NVFP4 is quality-first, using twice the bit width and floating-point dynamic range to preserve more quality headroom when small cache errors compound over a long video. NVFP4 also keeps K and V in the same Blackwell-native format used bySageAttention3, so an NVFP4 KV cache can feed an NVFP4 attention path without introducing an INT2-to-FP4 format boundary[22][7].

And dequantization is the catch. KV is generated autoregressively, so the cache is re-readin full on every single decode step—if it is stored quantized, it must be dequantized again and again in a tight per-step loop, a recurring, bandwidth-bound tax rather than a one-time conversion. A naive implementation does this one cached chunk per kernel launch, and the launch latency stacks up. LongLive-2.0’s answer is a customparallel dequantization kernelthat rebuilds every in-window chunk in a single launch, keeping total dequant overhead under 2%[5][6]:

Reconstructing the NVFP4 KV cache before attentionNVFP4 cache ≈ 3.6× smaller · dequant overhead < 2%time →Serial — 5 launches, one chunk at a timec0c1c2c3c4Parallel — 1 fused launchc0c1c2c3c4latency savedSerial: 5 launches run one after another — latency stacks up.Parallel: one fused launch dequantizes every chunk at once.Figure 8. Reconstructing the NVFP4 KV cache before attention. A naive path dequantizes one chunk per kernel launch, so latency stacks up across the sliding window (top); LongLive-2.0’s fused kernel rebuilds every in-window chunk in a single launch (bottom), keeping total dequant overhead under 2% [5].## 6. FP4 Attention

With weights, activations, and the KV cache in four bits, the remaining component is attention itself—the two matrix multiplies that turn queries and keys into scores, and scores into a weighted sum of values. Attention has followed the same precision curve as the rest of the model, one format at a time.

FlashAttention-2 set the modern baseline by computing exact attention in FP16/BF16[17]. FlashAttention-3 then added an FP8 path on Hopper—running both matmuls in 8-bit and reaching roughly 1.2 PFLOP/s on an H100[18]. The SageAttention line went lower still, using integers: the original quantized the query–key score matmul to INT8 (keeping the probability×value matmul in FP16) for about 2.1× over FlashAttention-2[19], and SageAttention2 took queries and keys to INT4 with the probabilities and values in FP8 for about 3×[20]. SageAttention3 is the most recent step: both matmuls in NVFP4 on Blackwell.

The precision ladder for attention: fewer bits, more speed.The precision ladder for attention: fewer bits, more speedBF16 / FP16FlashAttention-2Exact softmax, 16-bit baselineHardware: A100 / H10016 bits1xFP8FlashAttention-38-bit QK / PV, FP32 accumHardware: Hopper (H100)8 bits~1.2 PFLOPs/sINT8 → INT4SageAttention / Sage2INT8/INT4 Q,K; FP8 P,VHardware: RTX 4090 / Hopper8 → 4 bits~2.1x → ~3x vs FA2FP4 (NVFP4)SageAttention34-bit float, BOTH matmulsHardware: Blackwell4 bits~5x / 1038 TOPSfaster →precision drops, throughput risesFigure 9. Attention has tracked the same precision curve as the rest of the model—FP16 (FlashAttention-2) → FP8 (FlashAttention-3) → INT8/INT4 (SageAttention 1/2) → FP4 (SageAttention3)—each step trading representational range for throughput [7].Pushing attention to four bits is delicate, and one tensor is the reason. Q, K, and V are roughly zero-centered with a wide range, so ordinary block-wise FP4 handles them. The softmax mapPis the hard one: after softmax its values live in [0, 1], crammed near zero, so a naive 4-bit scale wastes almost all of its range.SageAttention3solves this with a two-level trick—firststretcheach row of P by a per-token FP32 factor so it fills the representable range, then quantize. Crucially, both matmuls run in NVFP4: P’svaluesend up in 4-bit (only its blockscaleis FP8)[7].

FP4 attention (SageAttention3), step by step1038 TOPS · ~5× FlashAttention2 (RTX 5090)NVFP4 99.52% vs MXFP4 98.37%InputsQ, K, Vquantize → FP4E2M1 · 1×16 blockFP8 E4M3 scaleS = Q·KᵀFP4 Tensor Coresonline softmax→ P (full precision)O = P·VFP4 Tensor Coresoutput OKEY TRICK — why FP4 attention worksP ∈ [0,1] → stretch ÷(448×6) in FP32, then quantize P to FP4P values FP4, scale FP8Q, K, V and even the softmax map P are quantized to 4 bits — both matmuls run on FP4 Tensor Cores.The trick: stretch P to fill the range before quantizing, so [0,1] probabilities survive 4-bit.Figure 10. SageAttention3 runs both attention matmuls in NVFP4 on Blackwell. The hard part is the softmax map P, whose [0,1] values are stretched (×1/(448·6) in FP32) to fill the range before 4-bit quantization — lifting cosine similarity from 93.3% (direct) to 99.5% — reaching 1038 TOPS, ~5× FlashAttention2 on an RTX 5090 [7].The result is substantial: 1038 TOPS on an RTX 5090, roughly 5× the fastest FlashAttention available there, and about 2.4–3× end-to-end on video diffusion—with the two-level stretch raising P’s cosine similarity from 93.3% to 99.5%, and NVFP4 again ahead of MXFP4[7]. It is not free: FP4 attention still incurs more accuracy risk than FP4 GEMMs, which is why it remains an active research area rather than a settled one.

Closing: FP4 Is The Future

Historically, 4-bit precision implied a large accuracy penalty. On Blackwell, NVFP4 reduces that penalty to roughly 1% on many tasks while delivering substantial speedups, and the approach now spans the landscape: LLMs (DeepSeek-R1), diffusion and image generation (FLUX), and autoregressive video (LongLive-2.0). The recurring lesson is that FP4 works only when the format, the kernels, the cache, and attention are co-designed—fine-grained scaling, fused kernels, hardware-accelerated dequantization, and the dedicated handling that the softmax map requires.

It is also far from finished. The open problems are where the next round of speed and quality will come from:

  • **Better scales, lower quantization error.**Smarter per-block scale search, rotations / Hadamard transforms, and outlier handling to squeeze more signal out of fifteen marks.
  • **KV-cache dequantization efficiency.**The autoregressive loop-dequant tax is real; faster fused dequant (and storing more of attention in low precision) is wide open.
  • **FP4 attention quality.**The performance loss is still higher than for FP4 GEMMs—closing that gap would put the entire model on a 4-bit path.
  • **QAT vs PTQ.**PTQ is cheap but leaves accuracy on the table; quantization-aware (or fully quantized) training recovers it but costs compute. Narrowing that gap—cheap recipes with QAT-level quality—may matter most of all.

Four-bit precision is becoming a practical default for large-model inference, and increasingly for training. Realizing it fully is as much a systems problem—formats, kernels, and memory—as a modeling one.

References

  1. **Introducing NVFP4 for Efficient and Accurate Low-Precision Inference.**Eduardo Alvarez, Omri Almog, Eric Chung, Simon Layton, Dusan Stosic, Ronny Krashinsky, Kyle Aubrey, NVIDIA Technical Blog, 2025.NVIDIA Technical Blog
  2. **OCP Microscaling Formats (MX) Specification.**Open Compute Project, 2023.OCP Specification
  3. **NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit.**Kirthi Devleker and Farshad Ghodsian, NVIDIA Technical Blog, 2025.NVIDIA Technical Blog
  4. **Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache.**Eduardo Alvarez, Wei-Ming Chen, Huizi Mao, NVIDIA Technical Blog, 2025.NVIDIA Technical Blog
  5. **LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation.**Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han. arXiv, 2026.arXiv:2605.18739andproject page; code atgithub.com/NVlabs/LongLive
  6. **LongLive2.0 Documentation.**NVlabs, 2026.Documentation
  7. **SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training.**Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, Jun Zhu. arXiv, 2025.arXiv:2505.11594
  8. **Wan2.2: Open and Advanced Large-Scale Video Generative Models (Wan2.2-TI2V-5B base model).**Wan Team, 2025.Wan2.2 repository
  9. **NVIDIA TransformerEngine.**NVIDIA, 2024. Library of fused low-precision (FP8/FP4) quantization-and-GEMM kernels for Hopper and Blackwell GPUs.github.com/NVIDIA/TransformerEngine
  10. **NVIDIA TensorRT Model Optimizer.**NVIDIA, 2024. Quantization toolkit with NVFP4 post-training and quantization-aware recipes for efficient deployment.github.com/NVIDIA/TensorRT-Model-Optimizer
  11. **3 Ways NVFP4 Accelerates AI Training and Inference.**NVIDIA Technical Blog, 2025.NVIDIA Technical Blog
  12. **Pretraining LLMs with NVFP4.**NVIDIA, arXiv, 2025.arXiv:2509.25149
  13. **KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.**Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami. NeurIPS, 2024.arXiv:2401.18079
  14. **ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation.**Tianchen Zhao et al. arXiv, 2024.arXiv:2406.02540
  15. **DeepSeek-V4.**DeepSeek-AI, 2026. A 1.6T-parameter Mixture-of-Experts model that applies FP4 quantization-aware training to its expert weights and sparse-attention indexer (FP8 elsewhere), targeting NVIDIA Blackwell.DeepSeek-V4-Pro model card
  16. **OpenAI gpt-oss (gpt-oss-120b, gpt-oss-20b).**OpenAI, 2025. Open-weight Mixture-of-Experts models whose MoE expert weights (over 90% of parameters) are quantization-aware-trained in MXFP4, enabling the 120B model to run on a single 80 GB GPU.gpt-oss-120b model card
  17. **FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.**Tri Dao. arXiv, 2023.arXiv:2307.08691
  18. **FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.**Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. arXiv, 2024.arXiv:2407.08608
  19. **SageAttention: Accurate 8-Bit Attention for Plug-and-Play Inference Acceleration.**Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen. arXiv, 2024.arXiv:2410.02367
  20. **SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-Thread INT4 Quantization.**Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen. arXiv, 2024.arXiv:2411.10958
  21. **QeRL: Beyond Efficiency — Quantization-enhanced Reinforcement Learning for LLMs.**Huang et al. arXiv, 2025. Pairs an NVFP4-quantized backbone with LoRA for efficient RL training of LLMs.arXiv:2510.11696
  22. **Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization.**Haocheng Xi et al. ICML, 2026. Training-free 2-bit KV-cache quantization for AR video diffusion using semantic-aware smoothing and progressive residual quantization (up to 7× smaller, <4% latency overhead).arXiv:2602.02958
  23. **RTX 5090 Workstation Configuration Journey.**Qinghao Hu, Jiaming Tang, Yujun Lin, Zhuoyang Zhang, Zhekai Zhang, Shang Yang, Song Han. MIT Han Lab Blog, February 2025.Han Lab Blog
  24. **SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs.**Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han. MIT Han Lab Blog, February 2025.Han Lab Blog

Similar Articles

@nrehiew_: For the visual learners

X AI KOLs Timeline

A thread reviewing the paper 'Pretraining Large Language Models with NVFP4' and discussing NVFP4 pre-training, especially for NVIDIA Blackwell.