low-precision

#low-precision

@ZhihuFrontier: GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp bre…

X AI KOLs Timeline ↗ · 2d ago Cached

A detailed analysis of how NVIDIA GPU programming evolved from Volta to Blackwell, highlighting the shift from synchronous thread models to asynchronous dataflow and the challenges of feeding Tensor Cores. The article discusses new hardware features like TMA, TMEM, and tcgen05 MMA, and shows how modern kernels like FlashAttention-3 and FlashMLA exploit these changes for higher utilization.

0 favorites 0 likes

#low-precision

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

arXiv cs.LG ↗ · 2026-06-04 Cached

dMX is a differentiable mixed-precision quantization framework that learns optimal floating-point bit-width assignments per layer for LLMs, targeting the MXFP family of formats defined by the OCP standard. It uses continuous optimization with temperature-based annealing and a budget-aware regularization term, consistently outperforming KL-divergence heuristics on Llama, Qwen3, and SmolLM2 models.

0 favorites 0 likes

#low-precision

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

arXiv cs.CL ↗ · 2026-05-21 Cached

Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.

0 favorites 0 likes

#low-precision

@charles_irl: another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4 https://modal.com/ll…

X AI KOLs Following ↗ · 2026-05-18 Cached

A page from Modal's LLM Engineer's Almanac that provides an interactive explorer for understanding low-precision floating-point formats like bf16 and fp4.

0 favorites 0 likes

low-precision

@ZhihuFrontier: GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp bre…

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

@charles_irl: another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4 https://modal.com/ll…

Submit Feedback