500k context on 48gb VRAM!! - 21tok/s (coding)

Reddit r/LocalLLaMA Models

Summary

A user reports successful deployment of a quantized Nemotron-3 Super model supporting 500k context and agentic coding on consumer-grade dual Titan RTX hardware.

I found this model hiding in the corner of huggingface: [https://huggingface.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF](https://huggingface.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF) Looks to be tuned specifically for math but i thought i'd give it a try since i cant run the full 120b nemotron super and it seem to hold up like a champ in agentic coding for some odd reason. been using it to code all my projects for a week now its amazing. Wouldnt dream of having 500k tokens on my potato dual TITAN RTX. If you do happen to try it drop a cmment on your experience with it where did it break what usecase did u use it for ETC.
Original Article

Similar Articles

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.