@RayFernando1337: “The selected runtime uses NVFP4 weights for maximum performance. From the original FP8 weights, we performed an in-hou…

X AI KOLs Following 06/23/26, 01:28 AM Tools

quantization fp4 nvidia model-optimization runtime performance

Summary

Discusses using NVFP4 4-bit floating point weights for maximum performance, achieved via in-house quantization from FP8 using NVIDIA ModelOpt, highlighting the data format's dual scale factors for high dynamic range.

“The selected runtime uses NVFP4 weights for maximum performance. From the original FP8 weights, we performed an in-house quantization to NVFP4 using NVIDIA ModelOpt. NVFP4 is a 4-bit floating point data format by NVIDIA that uses dual scale factors to retain high dynamic range and preserve model quality.”

Original Article

Similar Articles

@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…

X AI KOLs Following

A mixed-precision quantization of Google's Gemma-4-12B-it model using NVFP4 for MLP weights and FP8 for attention layers, achieving 25% smaller footprint and faster throughput while maintaining quality.

@zcbenz: nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale…

X AI KOLs Timeline

A technical comparison between nvfp4 and mxfp4 formats, highlighting that nvfp4 uses an additional tensor-wise scale factor to overcome fp4's range limit, allowing more precision in block-wise scale factors.

@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…

X AI KOLs Following

A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.

@MiaAI_lab: FYI the best Qwen 3.6 35b nvfp4 to run is the @NVIDIAAI nvfp4. Do not use unsloth nvfp4, it performs worse. https://hug…

X AI KOLs Timeline

NVIDIA's nvfp4 quantized version of Qwen 3.6 35B is recommended over the Unsloth variant, offering better performance. The model is available on HuggingFace for use in AI applications.

@0xSero: Just added 2 new model compressions: Hy3-FP8 & NVFP4 I recommend trying this model it's very strong and fits on 256gb o…

X AI KOLs Following

0xSero has released new FP8 and NVFP4 quantized versions of the Tencent Hy3-preview model, enabling it to run on 256GB VRAM with full context.

Similar Articles

@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…

@zcbenz: nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale…

@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…

@MiaAI_lab: FYI the best Qwen 3.6 35b nvfp4 to run is the @NVIDIAAI nvfp4. Do not use unsloth nvfp4, it performs worse. https://hug…

@0xSero: Just added 2 new model compressions: Hy3-FP8 & NVFP4 I recommend trying this model it's very strong and fits on 256gb o…

Submit Feedback