@tom_doerr: Compresses deep learning models for faster inference https://github.com/NVIDIA/Model-Optimizer…

X AI KOLs Timeline Tools

Summary

NVIDIA Model Optimizer is a library that compresses deep learning models using techniques like quantization, distillation, pruning, and speculative decoding to accelerate inference. It supports Hugging Face, PyTorch, and ONNX models and integrates with NVIDIA inference frameworks.

Compresses deep learning models for faster inference https://t.co/WGZAiskcoo https://t.co/iZS0tSfyFq
Original Article
View Cached Full Text

Cached at: 05/20/26, 06:27 AM

Compresses deep learning models for faster inference

https://t.co/WGZAiskcoo https://t.co/iZS0tSfyFq


NVIDIA/Model-Optimizer

Source: https://github.com/NVIDIA/Model-Optimizer

Banner image

NVIDIA Model Optimizer

Documentation version license

Documentation | Roadmap


NVIDIA Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.

[Input] Model Optimizer currently supports inputs of a Hugging Face, PyTorch or ONNX model.

[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint. Model Optimizer is also integrated with NVIDIA Megatron-Bridge, Megatron-LM and Hugging Face Accelerate for training required inference optimization techniques.

[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, TensorRT, or vLLM. The unified Hugging Face export API now supports both transformers and diffusers models.

Latest News

Previous News

Install

To install stable release packages for Model Optimizer with pip from PyPI:

pip install -U nvidia-modelopt[all]

Model Optimizer will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

To install from source in editable mode with all development dependencies or to use the latest features, run:

# Clone the Model Optimizer repository
git clone [email protected]:NVIDIA/Model-Optimizer.git
cd Model-Optimizer

pip install -e .[dev]

You can also directly use NVIDIA container images, which have Model Optimizer pre-installed:

  • nvcr.io/nvidia/pytorch:<version>-py3
  • nvcr.io/nvidia/nemo:<version>
  • nvcr.io/nvidia/tensorrt-llm/release:<version>

Before pulling and using the container images, please review their respective license terms. Make sure to upgrade Model Optimizer to the latest version as described above. Visit our installation guide for more fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.

Techniques

TechniqueDescriptionExamplesDocs
Post Training QuantizationCompress model size by 2x-4x, speeding up inference while preserving model quality![LLMs] [diffusers] [VLMs] [onnx] [windows][docs]
Quantization Aware TrainingRefine accuracy even further with a few training steps![Hugging Face][docs]
PruningReduce your model size and accelerate inference by removing unnecessary weights![General] [Megatron-Bridge]
DistillationReduce deployment model size by teaching small models to behave like larger models![Megatron-Bridge] [Megatron-LM] [Hugging Face][docs]
Speculative DecodingTrain draft modules to predict extra tokens during inference![Megatron] [Hugging Face][docs]
SparsityEfficiently compress your model by storing only its non-zero parameter values and their locations[PyTorch][docs]

Pre-Quantized Checkpoints

Resources

Model Support Matrix

Model TypeSupport Matrix
LLM QuantizationView Support Matrix
Diffusers QuantizationView Support Matrix
VLM QuantizationView Support Matrix
ONNX QuantizationView Support Matrix
Windows QuantizationView Support Matrix
Quantization Aware TrainingView Support Matrix
PruningView Support Matrix
DistillationView Support Matrix
Speculative DecodingView Support Matrix

Deprecation Policy

Model Optimizer follows a structured approach to managing deprecated features:

  • Communication: Deprecation notices are documented in the Changelog. Deprecated items include source code statements indicating deprecation timing, with runtime warnings issued upon use.
  • Migration Period: Since Model Optimizer is still pre-1.0, we provide a 1-release (~1-month) migration period after deprecation. During this window, deprecated features continue functioning while issuing warnings.
  • Scope: The policy addresses both complete deprecations (entire APIs removed) and partial ones (specific parameters removed while methods remain).
  • Removal: Following the migration period, deprecated elements are removed in alignment with semantic versioning standards, potentially including breaking changes in minor version updates while Model Optimizer remains in 0.x.

Contributing

Model Optimizer is now open source! We welcome any feedback, feature requests and PRs. Please read our Contributing guidelines for details on how to contribute to this project.

AI Agents

For AI-assisted development setup, see the agent tooling notes.

Top Contributors

Contributors

Happy optimizing!

Similar Articles

Optimizing Models to Be Fast at Codegen (8 minute read)

TLDR AI

Morph LLC describes three key techniques—training a speculator on coding output, auto-searching kernels on cheap GPUs, and writing a custom interconnect—to dramatically speed up open models like Qwen and DeepSeek for coding agent workloads, achieving up to 3x speculative decoding speedup and 97-162 tok/s on a $7K GPU.