Tag
This paper argues that using FP8 tensor cores with Ozaki Scheme II can replace native FP64 hardware for high-performance scientific computing on AI-optimized GPUs like NVIDIA's B300, achieving full double-precision accuracy at much higher throughput. The authors present a Tensor-Memory Equilibrium model and show that emulated FP64 performance can exceed native FP64 by orders of magnitude across all workloads.
A tweet shares a structured reference of 450 papers on GPU optimization spanning 14 years, noting that while some techniques evolve, the mental models remain useful. It also references a lecture on GPU architectures by Onur Mutlu.
Q.ANT has opened a US facility in Austin and appointed Bruno Spruth as CTO, highlighting production of photonic GPUs at the Leibniz Supercomputing Centre that claim significant performance and energy efficiency gains over traditional transistor-based chips.
Modal Labs has released an open-source, interlinked GPU glossary that consolidates fragmented NVIDIA documentation, CUDA details, and compiler flags into a single navigable resource for engineers optimizing LLM training and inference.
This lecture introduces the flexible evolution of GPU architecture as a SIMD (vector/array) processor, discusses data parallelism, memory bank grouping, bank conflicts, serial bottlenecks, and the history of SIMD instructions (such as MMX), emphasizing how GPUs leverage data parallelism and deal with serial bottlenecks.