Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution
Summary
Introduces Orthrus, a method that injects a trainable diffusion attention module into a frozen autoregressive transformer to achieve up to 7.8× tokens per forward pass and ~6× wall-clock speedup on MATH-500, with provably identical output distribution to the base Qwen3-8B model. The approach requires minimal additional parameters and training, and avoids the TTFT penalty of external drafters.
Similar Articles
Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution
Orthrus is a dual-architecture framework that combines autoregressive LLM fidelity with diffusion model speed, delivering up to 7.8x speedup on Qwen3 models while guaranteeing identical output distribution.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models for fast parallel token generation while maintaining exact inference fidelity via shared KV caches and consensus mechanisms, achieving up to 7.8x speedup.
qwen 3.6 27B AR-> Diffusion - local training on 5090
The author details attempts to locally train a Qwen 3.6 27B autoregressive-to-diffusion model on an Nvidia 5090 GPU using qlora and modifications from open-dllm and d3LLM, facing VRAM constraints and hardware issues while exploring one-shot diffusion techniques.
Wow! Qwen 3.6:35b-a3b on a 3090... pretty amazing.
A user shares impressive results running a quantized Qwen 3.6:35b-a3b model on a used RTX 3090, achieving 160 tokens per second output after fitting the model into VRAM, and demonstrates vision capabilities with a 75-second video processing time.
Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM
A quantized version of Qwen3.6 27B using a pure Q4_K_M method fits entirely in 16 GB VRAM, achieving up to 40 tok/s token generation speed with MTP, and significantly reducing model size compared to other GGUF variants.