@PyTorch: Model Optimization and Post-Training Quantization Model quantization is an effective method to reduce VRAM usage and im…
Summary
This post from NVIDIA explains how to use the NVIDIA Model Optimizer library to quantize a CLIP model to FP8 using post-training quantization, reducing VRAM usage and improving inference performance on consumer GPUs.
View Cached Full Text
Cached at: 05/26/26, 06:56 PM
Model Optimization and Post-Training Quantization
Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments.
This post walks through how to use NVIDIA Model Optimizer to quantize a CLIP model in FP8 format with the post-training quantization (PTQ) method, including an example workflow exporting a PyTorch checkpoint.
Read the complete blog post:
Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer
Source: https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/ Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIAGeForce RTXGPUs. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments.
This post walks through how to useNVIDIA Model Optimizerto quantize a CLIP model in FP8 format with thepost-training quantization (PTQ)method. For a general introduction to model quantization, seeModel Quantization: Concepts, Methods, and Why It Matters.
What is NVIDIA Model Optimizer?https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#what_is_nvidia_model_optimizer
TheNVIDIA Model Optimizer(ModelOpt) library incorporates state-of-the-art model optimization techniques to compress and accelerate AI models. These techniques include quantization, distillation, pruning, speculative decoding, and sparsity. ModelOpt accepts Hugging Face, PyTorch, or ONNX format models as input and provides Python APIs for users to easily combine different optimization techniques to produce optimized checkpoints.
ModelOpt supports highly performant quantization formats such as FP4, FP8, INT8, and INT4, and advanced algorithms including SmoothQuant, AWQ, SVDQuant, and Double Quantization. It supports bothPTQandquantization-aware training (QAT).
What is CLIP?https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#what_is_clip
CLIP(Contrastive Language-Image Pretraining), introduced by OpenAI in 2021, is a foundationvision language model (VLM)that learns a shared embedding space for images and text through contrastive learning on large image-text pairs. Its ability to produce semantically aligned representations has made it a core building block across modern multimodal systems.
The CLIP text encoder is widely reused as a conditioning module for text-to-image (Stable Diffusion, for example) and text-to-video (AnimateDiff, for example) synthesis. Its vision encoder serves as the visual backbone in multimodal LLMs, such as LLaVA, and open-vocabulary perception models, such as OWL-ViT. Successors such as OpenCLIP and SigLIP scale the data and refine the objective but preserve the dual-encoder contrastive paradigm.
Quantization recipehttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#quantization_recipe
The following quantization recipe is used in this post as a step-by-step guide for running CLIP model quantization with ModelOpt to understand how the process works.
First, prepare the corresponding models and datasets as shown below:
- Base CLIP model:CLIP-ViT-L-14-laion2B-s32B-b82K
- Calibration dataset for quantization: 10K subset fromMS-COCO
- Model accuracy evaluation tasks focus on three from theCLIP_benchmark- cifar100 (zero-shot classification) - imagenet1k (zero-shot classification) - mscoco_captions (zero-shot retrieval)
How to run PTQ with ModelOpthttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#how_to_run_ptq_with_modelopt
The following code sample shows how to run PTQ for the CLIP model in FP8 using ModelOpt:
import torch
from torch.utils.data import DataLoader, Subset
from transformers import CLIPModel, CLIPTokenizer, CLIPImageProcessor
from transformers.models.clip.modeling_clip import CLIPAttention
import modelopt.torch.opt as mto
import modelopt.torch.quantization as mtq
from modelopt.torch.quantization.plugins.diffusion.diffusers import _QuantAttention
# FP8 (E4M3) per-tensor static quantization
FP8_CFG = {
"quant_cfg": {
"*weight_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
"*input_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
"*[qkv]_bmm_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
"*bmm2_output_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
"default": {"enable": False},
},
"algorithm": "max",
}
mto.enable_huggingface_checkpointing()
mtq.QuantModuleRegistry.register({CLIPAttention: "CLIPAttention"})(_QuantAttention)
model = CLIPModel.from_pretrained(args.model_ckpt, attn_implementation="sdpa").half().eval().cuda()
tokenizer = CLIPTokenizer.from_pretrained(args.model_ckpt)
processor = CLIPImageProcessor.from_pretrained(args.model_ckpt)
calib_set = Subset(CLIP_COCO_dataset(ANN, IMG_DIR, tokenizer, processor), range(8192))
loader = DataLoader(calib_set, batch_size=512, num_workers=4)
# Calibration: 8k MS-COCO image-text pairs
def calibrate(m):
for img, txt in loader:
m.get_text_features(input_ids=txt.cuda())
m.get_image_features(pixel_values=img.cuda())
q_model = mtq.quantize(model, FP8_CFG, forward_loop=calibrate)
# Save quantized modelopt checkpoint
q_model.save_pretrained(ckpt_path)
mtq.print_quant_summary(q_model)
FP8\_CFGis just one recipe: W8A8 (FP8 on both weights and activations), per-tensor, static quantization, calibrated with the simple AbsMax algorithm. ModelOpt supports many more dimensions of choice (per-channel / block-wise granularity, dynamic activations quantization, advanced calibration algorithms such as AWQ / GPTQ, and many more).
For the detailed configuration schema, see theModelOpt quantization guide. The hyperparameters in the quantization configuration can always be fine-tuned as needed, and finding the optimal values usually requires some iteration.
Aftermtq\.quantizereturns, CLIP’sLinearlayers all carry weight and activation quantizers—but the attention blocks are still untouched. This is because multi-head attention dispatches totorch\.nn\.functional\.scaled\_dot\_product\_attention, a functional API that the ModelOpt module walker cannot intercept on its own.
To bring attention into the quantization scope, register a quantized replacement forCLIPAttention:
mtq.QuantModuleRegistry.register({CLIPAttention:
"CLIPAttention"})(_QuantAttention)
Each CLIPAttention instance is now upgraded to\_QuantAttentionfrom the ModelOpt diffusers plugin. Inside its forward pass,\_QuantAttentiontransparently intercepts the SDPA call and inserts four quantizers around the fused kernel:
q\_bmm\_quantizer,k\_bmm\_quantizer,v\_bmm\_quantizerwrap the projected Q / K / V tensors before they enter the kernelbmm2\_output\_quantizerwraps the kernel output (softmax @ V) before it flows intoout\_proj
This ensures proper quantization throughout the attention mechanism.
To restore some accuracy, it is often advised to disable some of the quantizers usingmtq\.disable\_quantizer. This takes a function as input, where the function itself takes a module name as input. By using regex or string matching, you can select the layers to disable. In the following example, the quantizers are disabled in thepatch\_embeddinglayer of the CLIP model.
import re
def filter_func(name):
pattern = re.compile(
r".*(patch_embedding).*"
)
return pattern.match(name) is not None
mtq.disable_quantizer(q_model, filter)
CLIP benchmark evaluationhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#clip_benchmark_evaluation
The saved ModelOpt checkpoint can be restored into any downstream evaluation script. For details, referenceRestoring ModelOpt Models. The quantized CLIP checkpoint was evaluated on three benchmarks: zero-shot classification (CIFAR-100, ImageNet-1k) and zero-shot retrieval (MS-COCO Captions). The FP16 CLIP model serves as the baseline.
Figure 1. CLIP model quality comparison of the FP16 baseline versus FP8-PTQ quantized models
Based on the evaluation results, the CLIP-FP8 quantized model demonstrates comparable quality to the CLIP-FP16 model. Notably, when quantizers are disabled in the patch embedding layer, the impact of quantization for model quality becomes negligible.
Inside the ModelOpt PTQ flowhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#inside_the_modelopt_ptq_flow
It’s important to understand that this stage involves working with “fake quantization” because the actual data type of the model hasn’t changed. Instead, these inserted quantizers act as observers that simulate the effects of quantization while keeping the model in original floating-point format.
The fake quantization process works in two key ways:
- Statistics collection: The quantizers collect tensor statistics (minimum and maximum values, for example) as data passes through them. These statistics are used to calculate optimal quantization parameters such as scaling factors.
- Quantization simulation: The quantizers perform a quantize-then-dequantize (QDQ) operation on tensors flowing through the network. It only simulates the low-precision computation and real speedup and memory saving should be achieved by exporting the model to deployment frameworks such asNVIDIA TensorRT.
This simulation is crucial because it enables you to evaluate the model’s accuracy before committing to actual quantization. The quantizers apply the same rounding and precision limitations that would occur in the deployed quantized model with downstream inference frameworks, so you can:
- Measure accuracy impacts before deployment
- Experiment with different quantization configurations
- Identify problematic layers that might need special handling
In general, the ModelOpt PTQ flow follows six stages:
- Prepare: Set quantization config to insert quantizer modules around the model’s weights and/or activations.
- Calibrate: Forward a small batch of representative data through the model so each quantizer can collect statistics (for example, activation amax) and derive its scaling factor.
- Fake quantization: Quantizers now apply a Q → DQ round-trip in floating point, faithfully simulating the precision loss of the target format while the model still runs in FP16/BF16.
- Evaluate: Measure accuracy on a held-out evaluation set and compare against the unquantized baseline.
- Iterate: If the gap is unacceptable, adjust the quantization configuration (granularity, algorithm, quantized layers), disable quantization for sensitive layers, and recalibrate.
- Export and deploy: Once the accuracy is acceptable, the fake quantized weights are compressed into their true low-precision form and exported as a checkpoint for downstream engines. In our case, we export the PyTorch checkpoint to ONNX and run inference with TensorRT. The speedups and memory savings will happen there.
Figure 2. ModelOpt PTQ workflow
QAT recovers quantization-induced quality loss by fine-tuning the model weights with frozen quantizer states. It is more compute-intensive than PTQ but can better improve quantized model quality. For more details, see theModelOpt examples.
Get started with NVIDIA Model Optimizerhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#get_started_with_nvidia_model_optimizer
This post introduced NVIDIA Model Optimizer and demonstrated a typical post-training quantization workflow by quantizing the CLIP model to FP8 with a practical code example. The results across three evaluation datasets show that FP8 quantization can preserve model quality while enabling a more efficient deployment path.
Ready to start using ModelOpt with your own models? Follow this workflow: prepare the model and calibration data, set quantization configuration, calibrate, validate the quantized model against task-specific quality metrics, save and restore ModelOpt checkpoints.
To explore additional workflows and adapt ModelOpt for your own use cases, see theModelOpt documentation.
Similar Articles
@tom_doerr: Compresses deep learning models for faster inference https://github.com/NVIDIA/Model-Optimizer…
NVIDIA Model Optimizer is a library that compresses deep learning models using techniques like quantization, distillation, pruning, and speculative decoding to accelerate inference. It supports Hugging Face, PyTorch, and ONNX models and integrates with NVIDIA inference frameworks.
Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]
Author shares experience hitting diminishing returns with FP16 + ONNX + pruning on 162 MB transformer, seeks advice on next best steps among quantization, distillation, low-rank factorization, or hardware-specific tricks.
@PyTorch: Bridging the gap between model optimization and production deployment This tutorial walks through a typical end-to-end …
This tutorial from NVIDIA walks through the end-to-end workflow of converting an FP8-quantized PyTorch model into a TensorRT inference engine for production deployment, covering ONNX export and performance profiling.
Quantizing MTP KV Cache = free lunch?
Quantizing the Multi-Token Prediction (MTP) KV cache to q8_0 in llama.cpp for Qwen models reduces VRAM usage without affecting inference speed or acceptance rate, effectively providing a 'free lunch' for memory-constrained setups.
An Implementation of NanoQuant: A flexible binary quantization method
NanoQuant is a flexible binary quantization method that compresses dense transformers to sub-1-bit per weight. This repository provides a PyTorch implementation, still a work in progress, capable of quantizing models like Qwen3-0.6B and Qwen3-4B.