Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model

Reddit r/LocalLLaMA 06/29/26, 08:55 PM Models

speculative-decoding inference-speedup llama-server ornith-35b qwen3.6 dflash

Summary

Ornith 35B shows 30-40% token generation speedup when paired with Qwen3.6 35B DFlash speculative model in llama-server, achieving 80% acceptance rate on mixed code and text, though prompt processing suffers.

I saw a solid 30-40% token gen increase from this: ./llama-server --no-mmap --port 8080 --host 0.0.0.0 -kvu -ts 75,70 \ --alias qwen -hf bartowski/deepreinforce-ai_Ornith-1.0-35B-GGUF:Q8_0 -sm layer -c 255000 -cram 0 \ -ctk f16 -ctv f16 -fa 1 --jinja -t 7 --metrics --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 \ --presence_penalty 0.0 --repeat-penalty 1.0 --ctx-checkpoints 4 --checkpoint-min-step 1024 \ --chat-template-kwargs '{"preserve_thinking": true}' \ -hfd williamliao/Qwen3.6-35B-A3B-DFlash-GGUF:Q8_0 --spec-draft-n-max 4 --spec-type draft-dflash Not completely sure if it's the the best dflash match, but it's good enough (i got a solid 80% acceptance rate at 50k context of javascript code mixed in with random wikipedia tests). As common with speculative drafting, while you gain speed in token generation you take a solid hit in prompt processing. So this is far from a silver bullet. But might help some of you.

Original Article

Similar Articles

@TeksEdge: Been testing Orinth-1.0-35B to see how it stacks up with Qwen3.6-35B over a day's use. Anecdotally, it works as well as…

X AI KOLs Timeline

A user reports that Ornith-1.0-35B matches Qwen3.6-35B in performance but excels at planning and long task execution, while the developer announces the open-source Ornith-1.0 family of LLMs specialized for agentic coding.

z-lab/Qwen3.6-35B-A3B-DFlash

Hugging Face Models Trending

z-lab releases DFlash, a speculative decoding drafter that uses a lightweight block-diffusion model to draft 15–16 tokens in parallel, yielding up to 2.9× speedup for Qwen3.6-35B-A3B inference.

@LottoLabs: This is awesome work Dflash for qwen 3.5/6 series

X AI KOLs Timeline

Charles Frye announces the co-release with Z Lab of six new DFlash speculators for Alibaba Qwen 3.x models, achieving over 1k output tokens per second for Qwen 3.5 122B-A10B on a B200.

@sudoingX: i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped eng…

X AI KOLs Timeline

A 35B MoE agentic coding model called Ornith runs near lossless at FP8 on a single DGX Spark, achieving 3M token context and ~36 tok/s, with speculative decoding expected to boost speed further.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

Similar Articles

@TeksEdge: Been testing Orinth-1.0-35B to see how it stacks up with Qwen3.6-35B over a day's use. Anecdotally, it works as well as…

z-lab/Qwen3.6-35B-A3B-DFlash

@LottoLabs: This is awesome work Dflash for qwen 3.5/6 series

@sudoingX: i was running Ornith new 35b moe on llama.cpp with a Q4 quant, 4 bit, small, fast. it hit ~78 tok/s. then i swapped eng…

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Submit Feedback