@sudoingX: i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped eng…

X AI KOLs Timeline 06/27/26, 05:29 PM Models

open-source mixture-of-experts code-generation local-ai llm-inference vllm performance

Summary

A 35B MoE agentic coding model called Ornith runs near lossless at FP8 on a single DGX Spark, achieving 3M token context and ~36 tok/s, with speculative decoding expected to boost speed further.

i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped engines. now i'm running the same MoE at FP8 in vLLM, near lossless, basically full quality, on a single dgx spark. and it's got headroom for over 3 million tokens of context on one box. sit with that, a 35B agentic coding model at near full precision with a 3M token window, on a desk. it's holding ~36 tok/s at that precision, fully usable, and this is the un-optimized baseline, no speculative decoding yet. that's the headline people are sleeping on. local AI quietly got a genuinely good agentic coding model, near lossless, massive context, single box, and almost nobody's clocked it. spec decode is next, that's where the spark cashes its idle compute for speed and this number climbs. if it lands, you've got near full quality 35B coding on a desk at real speed. that's the test.

Original Article

View Cached Full Text

Cached at: 06/28/26, 02:09 PM

i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped engines.

now i’m running the same MoE at FP8 in vLLM, near lossless, basically full quality, on a single dgx spark. and it’s got headroom for over 3 million tokens of context on one box. sit with that, a 35B agentic coding model at near full precision with a 3M token window, on a desk.

it’s holding ~36 tok/s at that precision, fully usable, and this is the un-optimized baseline, no speculative decoding yet.

that’s the headline people are sleeping on. local AI quietly got a genuinely good agentic coding model, near lossless, massive context, single box, and almost nobody’s clocked it.

spec decode is next, that’s where the spark cashes its idle compute for speed and this number climbs. if it lands, you’ve got near full quality 35B coding on a desk at real speed. that’s the test.

Sudo su (@sudoingX): running Ornith on the dgx spark to see what it actually is.

it’s a new agentic coding model from @ornith_ / deepreinforce-ai, the 35B MoE (A3B, ~3B active per token). pulled the Q4_K_M gguf (~20GB), wired it into hermes agent, ~78 tok/s on a single spark with fast prefill, so it

@sudoingX: i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped eng…

Similar Articles

@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

@sudoingX: running Ornith on the dgx spark to see what it actually is. it's a new agentic coding model from @ornith_ / deepreinfor…

Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model

Submit Feedback

Similar Articles

@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

@sudoingX: running Ornith on the dgx spark to see what it actually is. it's a new agentic coding model from @ornith_ / deepreinfor…

Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model