speedup

#speedup

SpecLA: Efficient Speculative Decoding for Linear-Attention Models

arXiv cs.CL ↗ · 5d ago Cached

SpecLA proposes a speculative decoding runtime tailored for stateful linear-attention models, achieving up to 1.70x end-to-end speedup over autoregressive decoding on an NVIDIA H100 with a GDN-1.3B target.

0 favorites 0 likes

#speedup

New set of FP4 attention kernels for B300, achieving up to 1.69x speedup over FA4

Reddit r/LocalLLaMA ↗ · 2026-07-14 Cached

The FastVideo team releases new FP4 attention kernels for B300, achieving up to 1.69x speedup over FlashAttention 4.

0 favorites 0 likes

#speedup

2.5x faster Qwen3.6 NVFP4 Unsloth quants

Reddit r/LocalLLaMA ↗ · 2026-07-10

Unsloth releases quantized Qwen3.6 models using NVFP4 format, achieving 2.5x faster inference speeds.

0 favorites 0 likes

#speedup

@AYi_AInotes: Two Hong Kong students achieved a 5x speedup on Karpathy's automated research framework. They didn't switch to a stronger model, add more compute, or even change much code. They just added another loop on top of the original loop. This might be the most useful paper of the year for ordinary Agent developers, bar none. Here's the breakdown:

X AI KOLs Timeline ↗ · 2026-07-10 Cached

Two Hong Kong students achieved a 5x speedup by adding another loop outside the original automated research framework, without needing a better model or more compute. It is considered one of the most useful papers for ordinary Agent developers.

0 favorites 0 likes

#speedup

@dejavucoder: my latest blog post "auto-research with codex: how I achieved a 212x faster kernel over baseline with codex in GPU Mode…

X AI KOLs Timeline ↗ · 2026-07-08 Cached

Blog post by Sankalp detailing how he used Codex to achieve a 232x faster GPU kernel for QR decomposition in GPU Mode's contest, outlining his auto-research methodology.

0 favorites 0 likes

#speedup

@charles_irl: If you're interested in speculative decoding, take some time to grok this chart! And read the article from @haoailab.ht…

X AI KOLs Timeline ↗ · 2026-07-07 Cached

A roofline model from the LLM Engineer's Almanac estimates speedups from speculative decoding for different draft lengths across models and hardware, with a note that it may underestimate benefits when overhead is significant.

0 favorites 0 likes

#speedup

Fable 5 sits at the top of KernelBench. Jack Clark calls it “the start of a RSI loop”

Reddit r/singularity ↗ · 2026-07-06

Fable achieves top rank on KernelBench-Mega by writing a highly efficient CUDA megakernel with an 18.71X speedup, signaling progress toward recursive self-improvement in AI R&D.

0 favorites 0 likes

#speedup

I asked Codex to optimize DeepSeek V4 Flash 8-bit MLX on oMLX. Got ~1.6x prefill and ~3x decode speedup.

Reddit r/LocalLLaMA ↗ · 2026-07-05

The author used Codex to optimize DeepSeek V4 Flash 8-bit MLX on oMLX, achieving approximately 1.6x prefill and 3x decode speedup.

0 favorites 0 likes

#speedup

@ollama: Gemma 4 is now nearly 90% faster on Apple Silicon with Ollama using MLX! The speedup comes from improved multi-token pr…

X AI KOLs Following ↗ · 2026-07-01 Cached

Ollama announces that Gemma 4 is now nearly 90% faster on Apple Silicon using MLX, thanks to improved multi-token prediction enabled by default, with automatic tuning to avoid slowdown.

0 favorites 0 likes

#speedup

@DeRonin_: DeepSeek just dropped a 5-page paper + free GitHub repo that makes any LLM respond 80% faster it's called speculative d…

X AI KOLs Following ↗ · 2026-06-27 Cached

DeepSeek released a paper and MIT-licensed open-source implementation of speculative decoding (DSpark) that speeds up LLM responses by up to 80% by using a small 'guess' model and a large 'check' model, achieving both speed and accuracy without tradeoffs.

0 favorites 0 likes

#speedup

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Reddit r/LocalLLaMA ↗ · 2026-06-25

JetSpec introduces parallel tree drafting for speculative decoding, achieving up to 9.64x end-to-end speedup on LLM inference while maintaining lossless accuracy, with throughput reaching ~1000 TPS on a single B200 GPU.

0 favorites 0 likes

#speedup

I'm eager for a 15x speedup on my strix halo

Reddit r/LocalLLaMA ↗ · 2026-06-23

Nvidia claims a 15x speedup in text generation using a diffusion model, generating entire blocks at once.

0 favorites 0 likes

#speedup

GLM 5.2 on Mac Studio Speedup PR

Reddit r/LocalLLaMA ↗ · 2026-06-23

GLM 5.2 delivers major performance gains on Mac Studio with 512GB RAM, achieving prefill speeds above 100 t/s at high context lengths and enabling 4-bit quantization for contexts over 100k tokens, as detailed in a pull request by the oMLX creator.

0 favorites 0 likes

#speedup

@_avichawla: Researchers made KMeans 200x faster. And the new technique also beats approaches like cuML and FAISS. Flash-KMeans is a…

X AI KOLs Timeline ↗ · 2026-06-16 Cached

Flash-KMeans is an IO-aware implementation of exact KMeans that redesigns the algorithm around modern GPU bottlenecks, achieving 33x speedup over cuML and 200x over FAISS by eliminating redundant memory reads and writes.

0 favorites 0 likes

#speedup

@AnimaAnandkumar: This is something I have been emphasizing since we started our work on Neural Operators. We very quickly went from simp…

X AI KOLs Following ↗ · 2026-06-10 Cached

Anima Anandkumar highlights that neural operators, despite simple benchmarks, have achieved massive speedups (10,000–million times) in hard real-world problems like high-resolution AI weather modeling (FourCastNet) and nuclear fusion turbulence, referencing a new paper showing learned solvers become more cost-effective as PDE tasks get harder.

0 favorites 0 likes

#speedup

Accelerating NeurASP with vectorization and caching

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper accelerates the NeurASP neurosymbolic AI framework by implementing vectorization, batch processing, and caching, achieving multiple orders of magnitude speedup on larger tasks.

0 favorites 0 likes

#speedup

Using Gemma 4 E4B with the LiteRT engine - ~2.4x speedup over Q4 GGUF in text generation, image processing roughly the same

Reddit r/LocalLLaMA ↗ · 2026-06-02

A developer benchmarks Gemma 4 E4B using Google's LiteRT engine against a Q4 GGUF quant, finding ~2.4x speedup in text generation due to multi-token prediction (MTP), but only 1.1x in image captioning. The post provides a Python wrapper for an OpenAI-compatible endpoint, though with limitations like deterministic output and single-session engine.

0 favorites 0 likes

#speedup

@atomic_chat_hq: MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-…

X AI KOLs Timeline ↗ · 2026-05-20 Cached

Atomic Chat's MTP technique speeds up Qwen dense models by 2.5x and MoE models by 25% on 2x RTX 5090 with zero accuracy loss and ~1 GB extra VRAM, using speculative decoding to draft and verify multiple tokens in one pass.

0 favorites 0 likes

#speedup

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA ↗ · 2026-05-17

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.

0 favorites 0 likes

#speedup

@NousResearch: Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that deli…

X AI KOLs Following ↗ · 2026-05-15

NousResearch releases Lighthouse Attention, a selection-based hierarchical attention that achieves 1.4-1.7x wall-clock speedup at 98K context and ~17x faster forward/backward pass than standard attention at 512K context on a single B200, validated on 530M-parameter Llama-3 models across 50B tokens.

0 favorites 0 likes

speedup

Submit Feedback