qwen3

Tag

Cards List
#qwen3

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

Reddit r/LocalLLaMA · yesterday

llama.cpp releases version b9495 with optimizations for Qwen3.6/3.5-MTP (Multi-Token Prediction) and requests users to share their benchmark results with full command details.

0 favorites 0 likes
#qwen3

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Hugging Face Daily Papers · 2026-05-28 Cached

Domino is a speculative decoding framework that decouples causal dependency modeling from autoregressive drafting, using a parallel backbone and lightweight causal refinement head to achieve up to 5.49× end-to-end speedup on Qwen3 models.

0 favorites 0 likes
#qwen3

@tunguz: After seeing these tweets, I decided to try it out on my own old Ubuntu computer with RTX 1070 GPU (the one that I just…

X AI KOLs Following · 2026-05-26 Cached

A user reports successfully running Qwen3 8B locally on an older RTX 1070 GPU, demonstrating that modern LLMs can run on decade-old hardware with decent performance.

0 favorites 0 likes
#qwen3

ETCHR: Editing To Clarify and Harness Reasoning

Hugging Face Daily Papers · 2026-05-22 Cached

ETCHR is a novel image editing approach that decouples visual reasoning from image generation, using a two-stage training process (Reasoning Imitation and Reasoning Enhancement) to improve multimodal language model performance across five visual reasoning tasks. It achieves consistent gains of 4-5% Pass@1 on models like Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.

0 favorites 0 likes
#qwen3

@AdinaYakup: Mega-ASR https://huggingface.co/zhifeixie/Mega-ASR… 1.7B Apache 2.0 Built for Noise/Reverb/Clipping/Overlapping speaker…

X AI KOLs Following · 2026-05-21 Cached

Mega-ASR is a 1.7B parameter robust ASR model under Apache 2.0, designed for noisy, reverberant, and overlapping speech, with an audio quality router to handle clean vs degraded audio.

0 favorites 0 likes
#qwen3

MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware

Reddit r/LocalLLaMA · 2026-05-17

MiroThinker-1.7 is an open-weight deep research agent built on Qwen3 MoE, with a mini version (30B total, 3B active) designed for consumer hardware; the team shares benchmarks and seeks feedback on local deployment.

0 favorites 0 likes
#qwen3

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Hacker News Top · 2026-05-15 Cached

Orthrus is a dual-architecture framework that combines autoregressive LLM fidelity with diffusion model speed, delivering up to 7.8x speedup on Qwen3 models while guaranteeing identical output distribution.

0 favorites 0 likes
#qwen3

Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Reddit r/LocalLLaMA · 2026-05-15

Introduces Orthrus, a method that injects a trainable diffusion attention module into a frozen autoregressive transformer to achieve up to 7.8× tokens per forward pass and ~6× wall-clock speedup on MATH-500, with provably identical output distribution to the base Qwen3-8B model. The approach requires minimal additional parameters and training, and avoids the TTFT penalty of external drafters.

0 favorites 0 likes
#qwen3

@RedHat_AI: Qwen3-8B now has a DFlash speculator! 82.2% first-token acceptance on math reasoning. 3.74 avg tokens accepted per step…

X AI KOLs Following · 2026-05-10 Cached

Red Hat AI released a DFlash speculator model for Qwen3-8B, achieving 82.2% first-token acceptance on math reasoning tasks. The model was trained using the Speculators library and vLLM to optimize inference speed.

0 favorites 0 likes
#qwen3

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

arXiv cs.CL · 2026-05-08 Cached

MemReranker is a reasoning-aware reranking model family (0.6B/4B) designed for agent memory retrieval, addressing limitations in semantic similarity by incorporating LLM knowledge distillation for better temporal and causal reasoning.

0 favorites 0 likes
#qwen3

Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried

Reddit r/LocalLLaMA · 2026-04-22

Developer shows how to run Qwen3 TTS locally in real-time with streaming, quantization, word-level alignment, and custom voice fine-tuning for an expressive open-source TTS pipeline.

0 favorites 0 likes
#qwen3

@bstnxbt: DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction a…

X AI KOLs Following · 2026-04-18 Cached

DFlash v0.1.4 releases custom Metal verify kernels for quantized Qwen3 hybrid models with significant peak memory reduction and 2.2x throughput improvements at long context on M5 Max GPUs.

0 favorites 0 likes
← Back to home

Submit Feedback