@sudoingX: i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped eng…

X AI KOLs Timeline Models

Summary

A 35B MoE agentic coding model called Ornith runs near lossless at FP8 on a single DGX Spark, achieving 3M token context and ~36 tok/s, with speculative decoding expected to boost speed further.

i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped engines. now i'm running the same MoE at FP8 in vLLM, near lossless, basically full quality, on a single dgx spark. and it's got headroom for over 3 million tokens of context on one box. sit with that, a 35B agentic coding model at near full precision with a 3M token window, on a desk. it's holding ~36 tok/s at that precision, fully usable, and this is the un-optimized baseline, no speculative decoding yet. that's the headline people are sleeping on. local AI quietly got a genuinely good agentic coding model, near lossless, massive context, single box, and almost nobody's clocked it. spec decode is next, that's where the spark cashes its idle compute for speed and this number climbs. if it lands, you've got near full quality 35B coding on a desk at real speed. that's the test.
Original Article
View Cached Full Text

Cached at: 06/28/26, 02:09 PM

i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped engines.

now i’m running the same MoE at FP8 in vLLM, near lossless, basically full quality, on a single dgx spark. and it’s got headroom for over 3 million tokens of context on one box. sit with that, a 35B agentic coding model at near full precision with a 3M token window, on a desk.

it’s holding ~36 tok/s at that precision, fully usable, and this is the un-optimized baseline, no speculative decoding yet.

that’s the headline people are sleeping on. local AI quietly got a genuinely good agentic coding model, near lossless, massive context, single box, and almost nobody’s clocked it.

spec decode is next, that’s where the spark cashes its idle compute for speed and this number climbs. if it lands, you’ve got near full quality 35B coding on a desk at real speed. that’s the test.

Sudo su (@sudoingX): running Ornith on the dgx spark to see what it actually is.

it’s a new agentic coding model from @ornith_ / deepreinforce-ai, the 35B MoE (A3B, ~3B active per token). pulled the Q4_K_M gguf (~20GB), wired it into hermes agent, ~78 tok/s on a single spark with fast prefill, so it

Similar Articles

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.