attention-module

#attention-module

Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Reddit r/LocalLLaMA ↗ · 2026-05-15

Introduces Orthrus, a method that injects a trainable diffusion attention module into a frozen autoregressive transformer to achieve up to 7.8× tokens per forward pass and ~6× wall-clock speedup on MATH-500, with provably identical output distribution to the base Qwen3-8B model. The approach requires minimal additional parameters and training, and avoids the TTFT penalty of external drafters.

0 favorites 0 likes

attention-module

Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Submit Feedback