@0xSero: Just added 2 new model compressions: Hy3-FP8 & NVFP4 I recommend trying this model it's very strong and fits on 256gb o…
Summary
0xSero has released new FP8 and NVFP4 quantized versions of the Tencent Hy3-preview model, enabling it to run on 256GB VRAM with full context.
View Cached Full Text
Cached at: 05/10/26, 08:23 AM
Just added 2 new model compressions:
Hy3-FP8 & NVFP4
I recommend trying this model it’s very strong and fits on 256gb of vram with full context
https://t.co/UQI63BCFiJ
0xSero/Hy3-preview-NVFP4 · Hugging Face
Source: https://huggingface.co/0xSero/Hy3-preview-NVFP4
https://huggingface.co/0xSero/Hy3-preview-NVFP4#hy3-preview-nvfp4a16Hy3-preview NVFP4A16
This is a checkpoint-onlyNVFP4A16quantization oftencent/Hy3\-preview, produced withllmcompressor\.entrypoints\.model\_free\.model\_free\_ptq.
- Base model:
tencent/Hy3\-preview - Quantization scheme:
NVFP4A16 - Ignored modules/patterns:
lm\_head, model\.embed\_tokens, re:\.\*router\.gate$, re:\.\*expert\_bias$ - Source snapshot: recorded in
QUANTIZATION\_MANIFEST\.json - License: inherits Tencent Hy Community License Agreement from the base model; original
LICENSEis included.
https://huggingface.co/0xSero/Hy3-preview-NVFP4#notesNotes
This release quantizes safetensors weights without importing the custom HYV3 model class. Router gates, expert bias tensors, embeddings, and lm_head are preserved unquantized for compatibility/conservatism.
Similar Articles
NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable
NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.
@0xSero: Best models for your hardware this week. 8-12GB - https://huggingface.co/LiquidAI/LFM2.5-8B-A1B… incredible model, so f…
A curated weekly roundup of the best AI models for different hardware configurations, from 8GB to 768GB VRAM, highlighting performance and benchmarks.
@0xSero: Best models for your hardware - 4gb to 12gb vram - VibeThinker-3B - smokes everything remotely close to its weight clas…
This thread recommends AI models optimized for different VRAM levels, highlighting VibeThinker-3B for its strong reasoning performance at 3B parameters, along with other models for coding and general use.
500k context on 48gb VRAM!! - 21tok/s (coding)
A user reports successful deployment of a quantized Nemotron-3 Super model supporting 500k context and agentic coding on consumer-grade dual Titan RTX hardware.
@bstnxbt: DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction a…
DFlash v0.1.4 releases custom Metal verify kernels for quantized Qwen3 hybrid models with significant peak memory reduction and 2.2x throughput improvements at long context on M5 Max GPUs.