Tag
New course 'Transformers in Practice' from deeplearning.ai and AMD teaches practical understanding of transformer-based LLMs, covering text generation, attention mechanisms, and inference optimization techniques like quantization and KV caching.
This article explains how to implement asynchronous continuous batching for LLM inference, overlapping CPU batch preparation with GPU computation to maximize utilization and reduce idle time.
This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.
vLLM v0.21.0rc1 is a pre-release update for the high-performance LLM inference and serving library, featuring optimizations for throughput, quantization, and hardware support.
The author demonstrates how to reduce RTX 4090 power consumption by up to 40% while running quantized Qwen models via llama.cpp, without sacrificing inference speed. By capping GPU power limits through nvidia-smi and adjusting llama-server parameters, users can significantly lower heat, noise, and extend hardware lifespan.
A developer toolkit providing configurations, wheels, and benchmarks for running large language models with NVFP4 precision on Nvidia Blackwell GPUs using TensorRT-LLM.
A technical discussion validates TurboQuant performance data on NVIDIA H100 GPUs with FP8 Tensor Cores and promises further insights from non-H100 testing.
ExLlamaV3 has released a series of major updates including Gemma 4 support, improved caching efficiency, and the new DFlash technology for significantly faster inference speeds across various model categories.
The article details a customized quantized version of DeepSeek-V4-Flash with MTP self-speculation enabled, achieving significant speedups on dual RTX PRO 6000 Max-Q GPUs using a patched vLLM setup.
0xSero has released new FP8 and NVFP4 quantized versions of the Tencent Hy3-preview model, enabling it to run on 256GB VRAM with full context.
BeeLlama.cpp is a performance-focused fork of llama.cpp that introduces DFlash speculative decoding and TurboQuant KV-cache compression, enabling high-speed local inference of large models like Qwen 3.6 27B on consumer hardware.
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
TRL v1.4 is released, featuring chunked NLL loss for SFT to reduce VRAM usage and first-class integration with OpenReward for GRPO.
This paper introduces TwELL and Hybrid sparse formats with custom CUDA kernels to efficiently leverage unstructured sparsity in LLMs, achieving over 20% faster training and inference on H100 GPUs while reducing energy and memory usage.
vLLM v0.20.0 is released, an open-source library for high-throughput LLM inference and serving, featuring PagedAttention and support for various hardware architectures.
Researchers from MIT and IBM have developed a rapid tool that estimates AI power consumption in seconds, significantly faster than traditional emulation methods, to help optimize data center energy efficiency.
Deepseek open-sourced DeepEP V2 and TileKernels, new GPU kernel libraries aimed at accelerating AI workloads.
vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.
A 31B parameter model runs locally on a laptop via Hermes agent at 15 tok/s, using 22.8 GB VRAM and 94 W power, highlighting fully autonomous, private AI inference without cloud dependencies.
NVIDIA and Google collaborate to optimize Gemma 4 models for local deployment across RTX GPUs, DGX Spark, and Jetson devices, enabling efficient on-device agentic AI with support for reasoning, coding, multimodal capabilities, and 35+ languages.