memory-bandwidth

Tag

Cards List
#memory-bandwidth

Threshold-Based Exclusive Batching for LLM Inference

arXiv cs.AI · 2d ago Cached

This paper analyzes the trade-off between mixed batching and exclusive batching for LLM inference, showing that the optimal choice depends on GPU memory bandwidth. It proposes a threshold-based hybrid scheduler that dynamically switches between the two methods, achieving up to 41.9% higher throughput on bandwidth-constrained GPUs.

0 favorites 0 likes
#memory-bandwidth

RTX Spark will have up to 600GB/s of memory bandwidth.

Reddit r/LocalLLaMA · 2d ago

NVIDIA's upcoming RTX Spark GPU is reported to feature up to 600GB/s memory bandwidth, double that of the DGX Spark, using 128GB of LPDDR5X RAM.

0 favorites 0 likes
#memory-bandwidth

We might have a winner with the upcoming N1X

Reddit r/LocalLLaMA · 3d ago

A leak reveals details about Nvidia's upcoming N1X and N1 processors, including 16-channel DDR5 memory support with bandwidth exceeding 500 GB/s.

0 favorites 0 likes
#memory-bandwidth

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Hacker News Top · 6d ago Cached

Kog AI launches a tech preview of the Kog Inference Engine, achieving 3,000 tokens/s per request on standard datacenter GPUs by co-designing model architecture, runtime, and low-level GPU code, targeting latency-critical AI agent workflows.

0 favorites 0 likes
#memory-bandwidth

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Hugging Face Daily Papers · 2026-05-28 Cached

This paper investigates the performance gap in batch-1 LLM decode for physical AI systems, finding that faster memory bandwidth does not proportionally reduce latency due to launch overheads, and that quantization efficiency varies significantly across hardware.

0 favorites 0 likes
#memory-bandwidth

@rohanpaul_ai: Chamath on all important “prefill” and “decode.” in AI compute. Prefill is compute-bound; massive parallel GPUs win, so…

X AI KOLs Following · 2026-05-24 Cached

Chamath explains the two key phases of AI compute: prefill, which is compute-bound and favors parallel GPUs like Nvidia's, and decode, which is memory-bandwidth bound and depends on scanning previously generated tokens.

0 favorites 0 likes
#memory-bandwidth

Memory

Reddit r/artificial · 2026-05-24

Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.

0 favorites 0 likes
#memory-bandwidth

Making Deep Learning Go Brrrr from First Principles

Hacker News Top · 2026-05-23 Cached

A comprehensive blog post explaining how to optimize deep learning performance by understanding three key components: compute, memory bandwidth, and overhead, using first principles to identify the performance regime and focus on effective optimizations.

0 favorites 0 likes
#memory-bandwidth

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Hacker News Top · 2026-05-22 Cached

Introduces CODA, a GPU kernel abstraction that expresses Transformer operations as GEMM-plus-epilogue programs to reduce data movement, covering nearly all non-attention computation in a Transformer block.

0 favorites 0 likes
#memory-bandwidth

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Reddit r/LocalLLaMA · 2026-05-16

The author ran 55 inference benchmark runs across Strix Halo, RTX 3090, and RTX 5070 with multiple backends, revealing that memory bandwidth dominates decode speed, the RTX 5070 beats the 3090 on small models, and reasoning models appear ~5x slower due to hidden reasoning content.

0 favorites 0 likes
#memory-bandwidth

@cHHillee: In modern ML accelerators, FLOPS have absolutely exploded. Often though, the bottleneck is not FLOPS but memory bandwid…

X AI KOLs Following · 2026-05-11 Cached

Thinky identifies human-to-AI bandwidth as a growing bottleneck akin to memory bandwidth issues in ML accelerators, proposing solutions to address this limitation.

0 favorites 0 likes
#memory-bandwidth

Memory Bandwidth for Local AI Hardware (2026 Edition)

X AI KOLs · 2026-05-25 Cached

The article breaks down memory bandwidth as the critical metric for local AI hardware performance, comparing current GPUs and unified memory systems from NVIDIA, Apple, AMD, Intel, and others across different performance tiers.

0 favorites 0 likes
#memory-bandwidth

https://www.youtube.com/watch?v=aE0onltJlOo

YouTube AI Channels · 2026-05-21 Cached

This lecture introduces the flexible evolution of GPU architecture as a SIMD (vector/array) processor, discusses data parallelism, memory bank grouping, bank conflicts, serial bottlenecks, and the history of SIMD instructions (such as MMX), emphasizing how GPUs leverage data parallelism and deal with serial bottlenecks.

0 favorites 0 likes
← Back to home

Submit Feedback