@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
Summary
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.
View Cached Full Text
Cached at: 04/21/26, 04:24 PM
GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45.6 tok/s decode (gen) - 1340 tok/s prefill I could get 2x decode if I limit to 64k context (100 tok/s) In this video it operates Figma (:
Similar Articles
@0xSero: Finally GLM-5.1-505B-REAP-NVFP4 45 tokens/s decode 1350 tokens/s prefill 32% prune This was the hardest I ever worked t…
Developer @0xSero achieved high-performance inference on an optimized GLM-5.1-505B variant using NVFP4 quantization and 32% pruning, reaching 45 tokens/s decode and 1350 tokens/s prefill speeds.
GLM 5.2 on consumer hardware
A user tested the unsloth quantized GLM-5.2 model on a high-end consumer-like system with dual RTX 5090, achieving 12 tokens per second.
GLM5.2 @7tg on 4x3090 + 192GB on budget motherboard + cpu
Running GLM5.2 with 7 trillion tokens on a budget setup using 4x RTX 3090 GPUs and 192GB RAM.
@leopardracer: GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW 20 tokens per second and a context window so large you can …
Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.
500k context on 48gb VRAM!! - 21tok/s (coding)
A user reports successful deployment of a quantized Nemotron-3 Super model supporting 500k context and agentic coding on consumer-grade dual Titan RTX hardware.