Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model
Summary
Ornith 35B shows 30-40% token generation speedup when paired with Qwen3.6 35B DFlash speculative model in llama-server, achieving 80% acceptance rate on mixed code and text, though prompt processing suffers.
Similar Articles
@TeksEdge: Been testing Orinth-1.0-35B to see how it stacks up with Qwen3.6-35B over a day's use. Anecdotally, it works as well as…
A user reports that Ornith-1.0-35B matches Qwen3.6-35B in performance but excels at planning and long task execution, while the developer announces the open-source Ornith-1.0 family of LLMs specialized for agentic coding.
z-lab/Qwen3.6-35B-A3B-DFlash
z-lab releases DFlash, a speculative decoding drafter that uses a lightweight block-diffusion model to draft 15–16 tokens in parallel, yielding up to 2.9× speedup for Qwen3.6-35B-A3B inference.
@LottoLabs: This is awesome work Dflash for qwen 3.5/6 series
Charles Frye announces the co-release with Z Lab of six new DFlash speculators for Alibaba Qwen 3.x models, achieving over 1k output tokens per second for Qwen 3.5 122B-A10B on a B200.
@sudoingX: i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped eng…
A 35B MoE agentic coding model called Ornith runs near lossless at FP8 on a single DGX Spark, achieving 3M token context and ~36 tok/s, with speculative decoding expected to boost speed further.
Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.