@RayFernando1337: “The selected runtime uses NVFP4 weights for maximum performance. From the original FP8 weights, we performed an in-hou…
Summary
Discusses using NVFP4 4-bit floating point weights for maximum performance, achieved via in-house quantization from FP8 using NVIDIA ModelOpt, highlighting the data format's dual scale factors for high dynamic range.
Similar Articles
@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…
A mixed-precision quantization of Google's Gemma-4-12B-it model using NVFP4 for MLP weights and FP8 for attention layers, achieving 25% smaller footprint and faster throughput while maintaining quality.
@zcbenz: nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale…
A technical comparison between nvfp4 and mxfp4 formats, highlighting that nvfp4 uses an additional tensor-wise scale factor to overcome fp4's range limit, allowing more precision in block-wise scale factors.
@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…
A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.
@MiaAI_lab: FYI the best Qwen 3.6 35b nvfp4 to run is the @NVIDIAAI nvfp4. Do not use unsloth nvfp4, it performs worse. https://hug…
NVIDIA's nvfp4 quantized version of Qwen 3.6 35B is recommended over the Unsloth variant, offering better performance. The model is available on HuggingFace for use in AI applications.
@0xSero: Just added 2 new model compressions: Hy3-FP8 & NVFP4 I recommend trying this model it's very strong and fits on 256gb o…
0xSero has released new FP8 and NVFP4 quantized versions of the Tencent Hy3-preview model, enabling it to run on 256GB VRAM with full context.