Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model

Reddit r/LocalLLaMA Models

Summary

Ornith 35B shows 30-40% token generation speedup when paired with Qwen3.6 35B DFlash speculative model in llama-server, achieving 80% acceptance rate on mixed code and text, though prompt processing suffers.

I saw a solid 30-40% token gen increase from this: ./llama-server --no-mmap --port 8080 --host 0.0.0.0 -kvu -ts 75,70 \ --alias qwen -hf bartowski/deepreinforce-ai_Ornith-1.0-35B-GGUF:Q8_0 -sm layer -c 255000 -cram 0 \ -ctk f16 -ctv f16 -fa 1 --jinja -t 7 --metrics --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 \ --presence_penalty 0.0 --repeat-penalty 1.0 --ctx-checkpoints 4 --checkpoint-min-step 1024 \ --chat-template-kwargs '{"preserve_thinking": true}' \ -hfd williamliao/Qwen3.6-35B-A3B-DFlash-GGUF:Q8_0 --spec-draft-n-max 4 --spec-type draft-dflash Not completely sure if it's the the best dflash match, but it's good enough (i got a solid 80% acceptance rate at 50k context of javascript code mixed in with random wikipedia tests). As common with speculative drafting, while you gain speed in token generation you take a solid hit in prompt processing. So this is far from a silver bullet. But might help some of you.
Original Article

Similar Articles

z-lab/Qwen3.6-35B-A3B-DFlash

Hugging Face Models Trending

z-lab releases DFlash, a speculative decoding drafter that uses a lightweight block-diffusion model to draft 15–16 tokens in parallel, yielding up to 2.9× speedup for Qwen3.6-35B-A3B inference.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.