Tag
llama.cpp releases version b9495 with optimizations for Qwen3.6/3.5-MTP (Multi-Token Prediction) and requests users to share their benchmark results with full command details.
Domino is a speculative decoding framework that decouples causal dependency modeling from autoregressive drafting, using a parallel backbone and lightweight causal refinement head to achieve up to 5.49× end-to-end speedup on Qwen3 models.
A user reports successfully running Qwen3 8B locally on an older RTX 1070 GPU, demonstrating that modern LLMs can run on decade-old hardware with decent performance.
ETCHR is a novel image editing approach that decouples visual reasoning from image generation, using a two-stage training process (Reasoning Imitation and Reasoning Enhancement) to improve multimodal language model performance across five visual reasoning tasks. It achieves consistent gains of 4-5% Pass@1 on models like Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.
Mega-ASR is a 1.7B parameter robust ASR model under Apache 2.0, designed for noisy, reverberant, and overlapping speech, with an audio quality router to handle clean vs degraded audio.
MiroThinker-1.7 is an open-weight deep research agent built on Qwen3 MoE, with a mini version (30B total, 3B active) designed for consumer hardware; the team shares benchmarks and seeks feedback on local deployment.
Orthrus is a dual-architecture framework that combines autoregressive LLM fidelity with diffusion model speed, delivering up to 7.8x speedup on Qwen3 models while guaranteeing identical output distribution.
Introduces Orthrus, a method that injects a trainable diffusion attention module into a frozen autoregressive transformer to achieve up to 7.8× tokens per forward pass and ~6× wall-clock speedup on MATH-500, with provably identical output distribution to the base Qwen3-8B model. The approach requires minimal additional parameters and training, and avoids the TTFT penalty of external drafters.
Red Hat AI released a DFlash speculator model for Qwen3-8B, achieving 82.2% first-token acceptance on math reasoning tasks. The model was trained using the Speculators library and vLLM to optimize inference speed.
MemReranker is a reasoning-aware reranking model family (0.6B/4B) designed for agent memory retrieval, addressing limitations in semantic similarity by incorporating LLM knowledge distillation for better temporal and causal reasoning.
Developer shows how to run Qwen3 TTS locally in real-time with streaming, quantization, word-level alignment, and custom voice fine-tuning for an expressive open-source TTS pipeline.
DFlash v0.1.4 releases custom Metal verify kernels for quantized Qwen3 hybrid models with significant peak memory reduction and 2.2x throughput improvements at long context on M5 Max GPUs.