@sudoingX: i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped eng…
Summary
A 35B MoE agentic coding model called Ornith runs near lossless at FP8 on a single DGX Spark, achieving 3M token context and ~36 tok/s, with speculative decoding expected to boost speed further.
View Cached Full Text
Cached at: 06/28/26, 02:09 PM
i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped engines.
now i’m running the same MoE at FP8 in vLLM, near lossless, basically full quality, on a single dgx spark. and it’s got headroom for over 3 million tokens of context on one box. sit with that, a 35B agentic coding model at near full precision with a 3M token window, on a desk.
it’s holding ~36 tok/s at that precision, fully usable, and this is the un-optimized baseline, no speculative decoding yet.
that’s the headline people are sleeping on. local AI quietly got a genuinely good agentic coding model, near lossless, massive context, single box, and almost nobody’s clocked it.
spec decode is next, that’s where the spark cashes its idle compute for speed and this number climbs. if it lands, you’ve got near full quality 35B coding on a desk at real speed. that’s the test.
Sudo su (@sudoingX): running Ornith on the dgx spark to see what it actually is.
it’s a new agentic coding model from @ornith_ / deepreinforce-ai, the 35B MoE (A3B, ~3B active per token). pulled the Q4_K_M gguf (~20GB), wired it into hermes agent, ~78 tok/s on a single spark with fast prefill, so it
Similar Articles
@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…
Quantized 27B Qwen3.6 model achieves 200 tok/s peak (136 avg) with 256k context and 10 agents on a single 49W GB10 GPU using Dflash+DDTree optimizations.
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)
A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
@sudoingX: running Ornith on the dgx spark to see what it actually is. it's a new agentic coding model from @ornith_ / deepreinfor…
Ornith-1.0 is a new family of open-source agentic coding models from deepreinforce-ai, trained with reinforcement learning that jointly optimizes both the solution and the scaffolding. The 35B MoE version achieves state-of-the-art on coding benchmarks and supports efficient single-GPU deployment.
Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model
Ornith 35B shows 30-40% token generation speedup when paired with Qwen3.6 35B DFlash speculative model in llama-server, achieving 80% acceptance rate on mixed code and text, though prompt processing suffers.