@PyTorch: Model Optimization and Post-Training Quantization Model quantization is an effective method to reduce VRAM usage and im…

X AI KOLs Following 05/26/26, 05:00 PM Tools

model-quantization post-training-quantization nvidia-model-optimizer clip-model fp8 pytorch gpu-optimization

Summary

This post from NVIDIA explains how to use the NVIDIA Model Optimizer library to quantize a CLIP model to FP8 using post-training quantization, reducing VRAM usage and improving inference performance on consumer GPUs.

Model Optimization and Post-Training Quantization Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments. This post walks through how to use NVIDIA Model Optimizer to quantize a CLIP model in FP8 format with the post-training quantization (PTQ) method, including an example workflow exporting a PyTorch checkpoint. Read the complete blog post:

Original Article

View Cached Full Text

Cached at: 05/26/26, 06:56 PM

Model Optimization and Post-Training Quantization

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments.

This post walks through how to use NVIDIA Model Optimizer to quantize a CLIP model in FP8 format with the post-training quantization (PTQ) method, including an example workflow exporting a PyTorch checkpoint.

Read the complete blog post:

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

Source: https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/ Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIAGeForce RTXGPUs. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments.

This post walks through how to useNVIDIA Model Optimizerto quantize a CLIP model in FP8 format with thepost-training quantization (PTQ)method. For a general introduction to model quantization, seeModel Quantization: Concepts, Methods, and Why It Matters.

What is NVIDIA Model Optimizer?https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#what_is_nvidia_model_optimizer

TheNVIDIA Model Optimizer(ModelOpt) library incorporates state-of-the-art model optimization techniques to compress and accelerate AI models. These techniques include quantization, distillation, pruning, speculative decoding, and sparsity. ModelOpt accepts Hugging Face, PyTorch, or ONNX format models as input and provides Python APIs for users to easily combine different optimization techniques to produce optimized checkpoints.

ModelOpt supports highly performant quantization formats such as FP4, FP8, INT8, and INT4, and advanced algorithms including SmoothQuant, AWQ, SVDQuant, and Double Quantization. It supports bothPTQandquantization-aware training (QAT).

What is CLIP?https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#what_is_clip

CLIP(Contrastive Language-Image Pretraining), introduced by OpenAI in 2021, is a foundationvision language model (VLM)that learns a shared embedding space for images and text through contrastive learning on large image-text pairs. Its ability to produce semantically aligned representations has made it a core building block across modern multimodal systems.

The CLIP text encoder is widely reused as a conditioning module for text-to-image (Stable Diffusion, for example) and text-to-video (AnimateDiff, for example) synthesis. Its vision encoder serves as the visual backbone in multimodal LLMs, such as LLaVA, and open-vocabulary perception models, such as OWL-ViT. Successors such as OpenCLIP and SigLIP scale the data and refine the objective but preserve the dual-encoder contrastive paradigm.

Quantization recipehttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#quantization_recipe

The following quantization recipe is used in this post as a step-by-step guide for running CLIP model quantization with ModelOpt to understand how the process works.

First, prepare the corresponding models and datasets as shown below:

Base CLIP model:CLIP-ViT-L-14-laion2B-s32B-b82K
Calibration dataset for quantization: 10K subset fromMS-COCO
Model accuracy evaluation tasks focus on three from theCLIP_benchmark- cifar100 (zero-shot classification) - imagenet1k (zero-shot classification) - mscoco_captions (zero-shot retrieval)

How to run PTQ with ModelOpthttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#how_to_run_ptq_with_modelopt

The following code sample shows how to run PTQ for the CLIP model in FP8 using ModelOpt:

import torch
from torch.utils.data import DataLoader, Subset
from transformers import CLIPModel, CLIPTokenizer, CLIPImageProcessor
from transformers.models.clip.modeling_clip import CLIPAttention

import modelopt.torch.opt as mto
import modelopt.torch.quantization as mtq
from modelopt.torch.quantization.plugins.diffusion.diffusers import _QuantAttention

# FP8 (E4M3) per-tensor static quantization
FP8_CFG = {
    "quant_cfg": {
        "*weight_quantizer":      {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
        "*input_quantizer":       {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
        "*[qkv]_bmm_quantizer":   {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
        "*bmm2_output_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
        "default": {"enable": False},
    },
    "algorithm": "max",
}

mto.enable_huggingface_checkpointing()
mtq.QuantModuleRegistry.register({CLIPAttention: "CLIPAttention"})(_QuantAttention)

model = CLIPModel.from_pretrained(args.model_ckpt, attn_implementation="sdpa").half().eval().cuda()

tokenizer = CLIPTokenizer.from_pretrained(args.model_ckpt)
processor = CLIPImageProcessor.from_pretrained(args.model_ckpt)
calib_set = Subset(CLIP_COCO_dataset(ANN, IMG_DIR, tokenizer, processor), range(8192))
loader    = DataLoader(calib_set, batch_size=512, num_workers=4)

# Calibration: 8k MS-COCO image-text pairs
def calibrate(m):
    for img, txt in loader:
        m.get_text_features(input_ids=txt.cuda())
        m.get_image_features(pixel_values=img.cuda())

q_model = mtq.quantize(model, FP8_CFG, forward_loop=calibrate)

# Save quantized modelopt checkpoint
q_model.save_pretrained(ckpt_path)
mtq.print_quant_summary(q_model)

FP8\_CFGis just one recipe: W8A8 (FP8 on both weights and activations), per-tensor, static quantization, calibrated with the simple AbsMax algorithm. ModelOpt supports many more dimensions of choice (per-channel / block-wise granularity, dynamic activations quantization, advanced calibration algorithms such as AWQ / GPTQ, and many more).

For the detailed configuration schema, see theModelOpt quantization guide. The hyperparameters in the quantization configuration can always be fine-tuned as needed, and finding the optimal values usually requires some iteration.

Aftermtq\.quantizereturns, CLIP’sLinearlayers all carry weight and activation quantizers—but the attention blocks are still untouched. This is because multi-head attention dispatches totorch\.nn\.functional\.scaled\_dot\_product\_attention, a functional API that the ModelOpt module walker cannot intercept on its own.

To bring attention into the quantization scope, register a quantized replacement forCLIPAttention:

mtq.QuantModuleRegistry.register({CLIPAttention: 
"CLIPAttention"})(_QuantAttention)

Each CLIPAttention instance is now upgraded to\_QuantAttentionfrom the ModelOpt diffusers plugin. Inside its forward pass,\_QuantAttentiontransparently intercepts the SDPA call and inserts four quantizers around the fused kernel:

q\_bmm\_quantizer,k\_bmm\_quantizer,v\_bmm\_quantizerwrap the projected Q / K / V tensors before they enter the kernel
bmm2\_output\_quantizerwraps the kernel output (softmax @ V) before it flows intoout\_proj

This ensures proper quantization throughout the attention mechanism.

To restore some accuracy, it is often advised to disable some of the quantizers usingmtq\.disable\_quantizer. This takes a function as input, where the function itself takes a module name as input. By using regex or string matching, you can select the layers to disable. In the following example, the quantizers are disabled in thepatch\_embeddinglayer of the CLIP model.

import re
def filter_func(name):
    pattern = re.compile( 
        r".*(patch_embedding).*"
    )
    return pattern.match(name) is not None

mtq.disable_quantizer(q_model, filter)

CLIP benchmark evaluationhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#clip_benchmark_evaluation

The saved ModelOpt checkpoint can be restored into any downstream evaluation script. For details, referenceRestoring ModelOpt Models. The quantized CLIP checkpoint was evaluated on three benchmarks: zero-shot classification (CIFAR-100, ImageNet-1k) and zero-shot retrieval (MS-COCO Captions). The FP16 CLIP model serves as the baseline.

Bar chart comparing quantized model CLIP-FP8-PTQ and baseline CLIP-FP16 model quality across CIFAR100, ImageNet-1k, and MS-COCO metrics. FP8 PTQ closely matches FP16 on these evaluation benchmarks. Figure 1. CLIP model quality comparison of the FP16 baseline versus FP8-PTQ quantized models Based on the evaluation results, the CLIP-FP8 quantized model demonstrates comparable quality to the CLIP-FP16 model. Notably, when quantizers are disabled in the patch embedding layer, the impact of quantization for model quality becomes negligible.

Inside the ModelOpt PTQ flowhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#inside_the_modelopt_ptq_flow

It’s important to understand that this stage involves working with “fake quantization” because the actual data type of the model hasn’t changed. Instead, these inserted quantizers act as observers that simulate the effects of quantization while keeping the model in original floating-point format.

The fake quantization process works in two key ways:

Statistics collection: The quantizers collect tensor statistics (minimum and maximum values, for example) as data passes through them. These statistics are used to calculate optimal quantization parameters such as scaling factors.
Quantization simulation: The quantizers perform a quantize-then-dequantize (QDQ) operation on tensors flowing through the network. It only simulates the low-precision computation and real speedup and memory saving should be achieved by exporting the model to deployment frameworks such asNVIDIA TensorRT.

This simulation is crucial because it enables you to evaluate the model’s accuracy before committing to actual quantization. The quantizers apply the same rounding and precision limitations that would occur in the deployed quantized model with downstream inference frameworks, so you can:

Measure accuracy impacts before deployment
Experiment with different quantization configurations
Identify problematic layers that might need special handling

In general, the ModelOpt PTQ flow follows six stages:

Prepare: Set quantization config to insert quantizer modules around the model’s weights and/or activations.
Calibrate: Forward a small batch of representative data through the model so each quantizer can collect statistics (for example, activation amax) and derive its scaling factor.
Fake quantization: Quantizers now apply a Q → DQ round-trip in floating point, faithfully simulating the precision loss of the target format while the model still runs in FP16/BF16.
Evaluate: Measure accuracy on a held-out evaluation set and compare against the unquantized baseline.
Iterate: If the gap is unacceptable, adjust the quantization configuration (granularity, algorithm, quantized layers), disable quantization for sensitive layers, and recalibrate.
Export and deploy: Once the accuracy is acceptable, the fake quantized weights are compressed into their true low-precision form and exported as a checkpoint for downstream engines. In our case, we export the PyTorch checkpoint to ONNX and run inference with TensorRT. The speedups and memory savings will happen there.

ModelOpt PTQ workflow diagram showing Prepare, Calibrate, Fake Quantize, and Evaluate steps leading to an OK decision, if yes, export and deploy, otherwise iterate back to Prepare. Figure 2. ModelOpt PTQ workflow QAT recovers quantization-induced quality loss by fine-tuning the model weights with frozen quantizer states. It is more compute-intensive than PTQ but can better improve quantized model quality. For more details, see theModelOpt examples.

Get started with NVIDIA Model Optimizerhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#get_started_with_nvidia_model_optimizer

This post introduced NVIDIA Model Optimizer and demonstrated a typical post-training quantization workflow by quantizing the CLIP model to FP8 with a practical code example. The results across three evaluation datasets show that FP8 quantization can preserve model quality while enabling a more efficient deployment path.

Ready to start using ModelOpt with your own models? Follow this workflow: prepare the model and calibration data, set quantization configuration, calibrate, validate the quantized model against task-specific quality metrics, save and restore ModelOpt checkpoints.

To explore additional workflows and adapt ModelOpt for your own use cases, see theModelOpt documentation.

@PyTorch: Model Optimization and Post-Training Quantization Model quantization is an effective method to reduce VRAM usage and im…

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

What is NVIDIA Model Optimizer?https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#what_is_nvidia_model_optimizer

What is CLIP?https://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#what_is_clip

Quantization recipehttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#quantization_recipe

How to run PTQ with ModelOpthttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#how_to_run_ptq_with_modelopt

CLIP benchmark evaluationhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#clip_benchmark_evaluation

Inside the ModelOpt PTQ flowhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#inside_the_modelopt_ptq_flow

Get started with NVIDIA Model Optimizerhttps://developer.nvidia.com/blog/model-quantization-post-training-quantization-using-nvidia-model-optimizer/#get_started_with_nvidia_model_optimizer

Similar Articles

@tom_doerr: Compresses deep learning models for faster inference https://github.com/NVIDIA/Model-Optimizer…

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

@PyTorch: Bridging the gap between model optimization and production deployment This tutorial walks through a typical end-to-end …

Quantizing MTP KV Cache = free lunch?

An Implementation of NanoQuant: A flexible binary quantization method

Submit Feedback

Similar Articles

@tom_doerr: Compresses deep learning models for faster inference https://github.com/NVIDIA/Model-Optimizer…

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

@PyTorch: Bridging the gap between model optimization and production deployment This tutorial walks through a typical end-to-end …

Quantizing MTP KV Cache = free lunch?

An Implementation of NanoQuant: A flexible binary quantization method