gpu-optimization

#gpu-optimization

@AndrewYNg: New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason…

X AI KOLs Following ↗ · 15h ago Cached

New course 'Transformers in Practice' from deeplearning.ai and AMD teaches practical understanding of transformer-based LLMs, covering text generation, attention mechanisms, and inference optimization techniques like quantization and KV caching.

0 favorites 0 likes

#gpu-optimization

Unlocking asynchronicity in continuous batching

Hugging Face Blog ↗ · yesterday Cached

This article explains how to implement asynchronous continuous batching for LLM inference, overlapping CPU batch preparation with GPU computation to maximize utilization and reduce idle time.

0 favorites 0 likes

#gpu-optimization

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

arXiv cs.CL ↗ · 2d ago Cached

This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.

0 favorites 0 likes

#gpu-optimization

vllm-project/vllm v0.21.0rc1

GitHub Releases Watchlist ↗ · 2d ago Cached

vLLM v0.21.0rc1 is a pre-release update for the high-performance LLM inference and serving library, featuring optimizations for throughput, quantization, and hardware support.

0 favorites 0 likes

#gpu-optimization

Stop wasting electricity

Reddit r/LocalLLaMA ↗ · 2d ago

The author demonstrates how to reduce RTX 4090 power consumption by up to 40% while running quantized Qwen models via llama.cpp, without sacrificing inference speed. By capping GPU power limits through nvidia-smi and adjusting llama-server parameters, users can significantly lower heat, noise, and extend hardware lifespan.

0 favorites 0 likes

#gpu-optimization

Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

Reddit r/LocalLLaMA ↗ · 3d ago

A developer toolkit providing configurations, wheels, and benchmarks for running large language models with NVFP4 precision on Nvidia Blackwell GPUs using TensorRT-LLM.

0 favorites 0 likes

#gpu-optimization

@no_stp_on_snek: appreciate the comprehensive write-up from @_EldarKurtic, @mgoin_, @RedHat_AI on TurboQuant. data on H100 with native F…

X AI KOLs Following ↗ · 3d ago

A technical discussion validates TurboQuant performance data on NVIDIA H100 GPUs with FP8 Tensor Cores and promises further insights from non-H100 testing.

0 favorites 0 likes

#gpu-optimization

ExLlamaV3 Major Updates!

Reddit r/LocalLLaMA ↗ · 4d ago

ExLlamaV3 has released a series of major updates including Gemma 4 support, improved caching efficiency, and the new DFlash technology for significantly faster inference speeds across various model categories.

0 favorites 0 likes

#gpu-optimization

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

Reddit r/LocalLLaMA ↗ · 4d ago

The article details a customized quantized version of DeepSeek-V4-Flash with MTP self-speculation enabled, achieving significant speedups on dual RTX PRO 6000 Max-Q GPUs using a patched vLLM setup.

0 favorites 0 likes

#gpu-optimization

@0xSero: Just added 2 new model compressions: Hy3-FP8 & NVFP4 I recommend trying this model it's very strong and fits on 256gb o…

X AI KOLs Following ↗ · 5d ago Cached

0xSero has released new FP8 and NVFP4 quantized versions of the Tencent Hy3-preview model, enabling it to run on 256GB VRAM with full context.

0 favorites 0 likes

#gpu-optimization

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

Reddit r/LocalLLaMA ↗ · 5d ago

BeeLlama.cpp is a performance-focused fork of llama.cpp that introduces DFlash speculative decoding and TurboQuant KV-cache compression, enabling high-speed local inference of large models like Qwen 3.6 27B on consumer hardware.

1 favorites 1 likes

#gpu-optimization

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA ↗ · 5d ago

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

0 favorites 0 likes

#gpu-optimization

@QGallouedec: TRL v1.4 is out! two things I'm excited about: → chunked NLL loss for SFT. Way less VRAM, same loss, often faster. Qwen…

X AI KOLs Following ↗ · 6d ago Cached

TRL v1.4 is released, featuring chunked NLL loss for SFT to reduce VRAM usage and first-class integration with OpenReward for GRPO.

0 favorites 0 likes

#gpu-optimization

@hardmaru: The human brain is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLM…

X AI KOLs Timeline ↗ · 6d ago Cached

This paper introduces TwELL and Hybrid sparse formats with custom CUDA kernels to efficiently leverage unstructured sparsity in LLMs, achieving over 20% faster training and inference on H100 GPUs while reducing energy and memory usage.

0 favorites 0 likes

#gpu-optimization

vllm-project/vllm v0.20.0

GitHub Releases Watchlist ↗ · 2026-04-27 Cached

vLLM v0.20.0 is released, an open-source library for high-throughput LLM inference and serving, featuring PagedAttention and support for various hardware architectures.

0 favorites 0 likes

#gpu-optimization

A faster way to estimate AI power consumption

MIT News — Artificial Intelligence ↗ · 2026-04-27 Cached

Researchers from MIT and IBM have developed a rapid tool that estimates AI power consumption in seconds, significantly faster than traditional emulation methods, to help optimize data center energy efficiency.

0 favorites 0 likes

#gpu-optimization

Deepseek has released DeepEP V2 and TileKernels.

Reddit r/LocalLLaMA ↗ · 2026-04-23

Deepseek open-sourced DeepEP V2 and TileKernels, new GPU kernel libraries aimed at accelerating AI workloads.

0 favorites 0 likes

#gpu-optimization

vllm-project/vllm v0.20.0rc1

GitHub Releases Watchlist ↗ · 2026-04-22 Cached

vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.

0 favorites 0 likes

#gpu-optimization

@sudoingX: this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 o…

X AI KOLs Timeline ↗ · 2026-04-20 Cached

A 31B parameter model runs locally on a laptop via Hermes agent at 15 tok/s, using 22.8 GB VRAM and 94 W power, highlighting fully autonomous, private AI inference without cloud dependencies.

0 favorites 0 likes

#gpu-optimization

From RTX to Spark: NVIDIA Accelerates Gemma 4 for Local Agentic AI

NVIDIA Blog ↗ · 2026-04-02 Cached

NVIDIA and Google collaborate to optimize Gemma 4 models for local deployment across RTX GPUs, DGX Spark, and Jetson devices, enabling efficient on-device agentic AI with support for reasoning, coding, multimodal capabilities, and 35+ languages.

0 favorites 0 likes

gpu-optimization

Submit Feedback