memory-bandwidth

#memory-bandwidth

@Alacritic_Super: The biggest bottleneck in LLM inference isn't arithmetic but it's moving data. A single multiply-accumulate operation i…

X AI KOLs Timeline ↗ · 2026-07-15 Cached

An educational thread explaining that the main bottleneck in LLM inference is data movement, not computation, and highlighting techniques like quantization, KV cache optimization, and FlashAttention to reduce memory traffic.

0 favorites 0 likes

#memory-bandwidth

Unified Memory, Explained: Why Mini PCs Can Run 70B Models a Big GPU Can't

Hacker News Top ↗ · 2026-07-10 Cached

Explains how unified memory in mini PCs allows them to run large 70B parameter AI models that exceed the VRAM capacity of high-end GPUs, though at slower speeds due to lower memory bandwidth.

0 favorites 0 likes

#memory-bandwidth

@Alacritic_Super: If you are serious about LLM inference, study FlashAttention. It's one of the most important optimizations behind moder…

X AI KOLs Timeline ↗ · 2026-07-08 Cached

A tweet recommending studying FlashAttention for LLM inference, highlighting its importance in optimizing GPU memory traffic and speeding up attention mechanisms, with links to the GitHub repository and papers for FlashAttention, FlashAttention-2, and FlashAttention-3.

0 favorites 0 likes

#memory-bandwidth

@smolix: Here's part 1 (of 5) of my short course on efficient LLM inference that I taught at Columbia University. Slides are hea…

X AI KOLs Timeline ↗ · 2026-07-02 Cached

Part 1 of a 5-part short course on efficient LLM inference taught at Columbia University. Covers hardware bottlenecks, GPU memory bandwidth limits, and techniques like model compression and KV cache optimization to reduce inference cost.

0 favorites 0 likes

#memory-bandwidth

@RayFernando1337: https://x.com/RayFernando1337/status/2070621713952579990

X AI KOLs Following ↗ · 2026-06-26 Cached

A detailed analysis on whether to run AI models locally or via API, covering hardware options like RTX 5090, RTX PRO 6000, and DGX Spark, with emphasis on memory vs bandwidth trade-offs, cost considerations, and privacy needs.

0 favorites 0 likes

#memory-bandwidth

@TheAhmadOsman: Local AI hardware = capacity × bandwidth × software stack - Capacity tells you what fits - Bandwidth tells you how hard…

X AI KOLs Following ↗ · 2026-06-21 Cached

A detailed comparison of local AI hardware in terms of memory capacity, bandwidth, and software stack, covering GPUs, Apple Silicon, AMD, Intel, Tenstorrent, and others, with a focus on what bottlenecks matter for AI inference.

0 favorites 0 likes

#memory-bandwidth

Threshold-Based Exclusive Batching for LLM Inference

arXiv cs.AI ↗ · 2026-06-02 Cached

This paper analyzes the trade-off between mixed batching and exclusive batching for LLM inference, showing that the optimal choice depends on GPU memory bandwidth. It proposes a threshold-based hybrid scheduler that dynamically switches between the two methods, achieving up to 41.9% higher throughput on bandwidth-constrained GPUs.

0 favorites 0 likes

#memory-bandwidth

RTX Spark will have up to 600GB/s of memory bandwidth.

Reddit r/LocalLLaMA ↗ · 2026-06-01

NVIDIA's upcoming RTX Spark GPU is reported to feature up to 600GB/s memory bandwidth, double that of the DGX Spark, using 128GB of LPDDR5X RAM.

0 favorites 0 likes

#memory-bandwidth

We might have a winner with the upcoming N1X

Reddit r/LocalLLaMA ↗ · 2026-05-31

A leak reveals details about Nvidia's upcoming N1X and N1 processors, including 16-channel DDR5 memory support with bandwidth exceeding 500 GB/s.

0 favorites 0 likes

#memory-bandwidth

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Hacker News Top ↗ · 2026-05-29 Cached

Kog AI launches a tech preview of the Kog Inference Engine, achieving 3,000 tokens/s per request on standard datacenter GPUs by co-designing model architecture, runtime, and low-level GPU code, targeting latency-critical AI agent workflows.

0 favorites 0 likes

#memory-bandwidth

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

This paper investigates the performance gap in batch-1 LLM decode for physical AI systems, finding that faster memory bandwidth does not proportionally reduce latency due to launch overheads, and that quantization efficiency varies significantly across hardware.

0 favorites 0 likes

#memory-bandwidth

@rohanpaul_ai: Chamath on all important “prefill” and “decode.” in AI compute. Prefill is compute-bound; massive parallel GPUs win, so…

X AI KOLs Following ↗ · 2026-05-24 Cached

Chamath explains the two key phases of AI compute: prefill, which is compute-bound and favors parallel GPUs like Nvidia's, and decode, which is memory-bandwidth bound and depends on scanning previously generated tokens.

0 favorites 0 likes

#memory-bandwidth

Memory

Reddit r/artificial ↗ · 2026-05-24

Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.

0 favorites 0 likes

#memory-bandwidth

Making Deep Learning Go Brrrr from First Principles

Hacker News Top ↗ · 2026-05-23 Cached

A comprehensive blog post explaining how to optimize deep learning performance by understanding three key components: compute, memory bandwidth, and overhead, using first principles to identify the performance regime and focus on effective optimizations.

0 favorites 0 likes

#memory-bandwidth

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Hacker News Top ↗ · 2026-05-22 Cached

Introduces CODA, a GPU kernel abstraction that expresses Transformer operations as GEMM-plus-epilogue programs to reduce data movement, covering nearly all non-attention computation in a Transformer block.

0 favorites 0 likes

#memory-bandwidth

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Reddit r/LocalLLaMA ↗ · 2026-05-16

The author ran 55 inference benchmark runs across Strix Halo, RTX 3090, and RTX 5070 with multiple backends, revealing that memory bandwidth dominates decode speed, the RTX 5070 beats the 3090 on small models, and reasoning models appear ~5x slower due to hidden reasoning content.

0 favorites 0 likes

#memory-bandwidth

@cHHillee: In modern ML accelerators, FLOPS have absolutely exploded. Often though, the bottleneck is not FLOPS but memory bandwid…

X AI KOLs Following ↗ · 2026-05-11 Cached

Thinky identifies human-to-AI bandwidth as a growing bottleneck akin to memory bandwidth issues in ML accelerators, proposing solutions to address this limitation.

0 favorites 0 likes

#memory-bandwidth