Tag
This paper analyzes the trade-off between mixed batching and exclusive batching for LLM inference, showing that the optimal choice depends on GPU memory bandwidth. It proposes a threshold-based hybrid scheduler that dynamically switches between the two methods, achieving up to 41.9% higher throughput on bandwidth-constrained GPUs.
NVIDIA's upcoming RTX Spark GPU is reported to feature up to 600GB/s memory bandwidth, double that of the DGX Spark, using 128GB of LPDDR5X RAM.
A leak reveals details about Nvidia's upcoming N1X and N1 processors, including 16-channel DDR5 memory support with bandwidth exceeding 500 GB/s.
Kog AI launches a tech preview of the Kog Inference Engine, achieving 3,000 tokens/s per request on standard datacenter GPUs by co-designing model architecture, runtime, and low-level GPU code, targeting latency-critical AI agent workflows.
This paper investigates the performance gap in batch-1 LLM decode for physical AI systems, finding that faster memory bandwidth does not proportionally reduce latency due to launch overheads, and that quantization efficiency varies significantly across hardware.
Chamath explains the two key phases of AI compute: prefill, which is compute-bound and favors parallel GPUs like Nvidia's, and decode, which is memory-bandwidth bound and depends on scanning previously generated tokens.
Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.
A comprehensive blog post explaining how to optimize deep learning performance by understanding three key components: compute, memory bandwidth, and overhead, using first principles to identify the performance regime and focus on effective optimizations.
Introduces CODA, a GPU kernel abstraction that expresses Transformer operations as GEMM-plus-epilogue programs to reduce data movement, covering nearly all non-attention computation in a Transformer block.
The author ran 55 inference benchmark runs across Strix Halo, RTX 3090, and RTX 5070 with multiple backends, revealing that memory bandwidth dominates decode speed, the RTX 5070 beats the 3090 on small models, and reasoning models appear ~5x slower due to hidden reasoning content.
Thinky identifies human-to-AI bandwidth as a growing bottleneck akin to memory bandwidth issues in ML accelerators, proposing solutions to address this limitation.
The article breaks down memory bandwidth as the critical metric for local AI hardware performance, comparing current GPUs and unified memory systems from NVIDIA, Apple, AMD, Intel, and others across different performance tiers.
This lecture introduces the flexible evolution of GPU architecture as a SIMD (vector/array) processor, discusses data parallelism, memory bank grouping, bank conflicts, serial bottlenecks, and the history of SIMD instructions (such as MMX), emphasizing how GPUs leverage data parallelism and deal with serial bottlenecks.