@charles_irl: another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4 https://modal.com/ll…

X AI KOLs Following 05/18/26, 03:58 PM Tools

low-precision floats bf16 fp4 quantization llm-engineering modal

Summary

A page from Modal's LLM Engineer's Almanac that provides an interactive explorer for understanding low-precision floating-point formats like bf16 and fp4.

another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4 https://t.co/yOgLrOFNOY https://t.co/w2u1ND5AQi

Original Article

View Cached Full Text

Cached at: 05/18/26, 04:34 PM

another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4

https://t.co/yOgLrOFNOY https://t.co/w2u1ND5AQi

LLM Engineer’s Almanac - Quant Formats

Source: https://modal.com/llm-almanac/quant-formats/e4::0x38 Value

Bit Pattern

Sign

Exponent

Significand

Sign

Exponent

Significand

Raw Hexadecimal Integer Value

Raw Decimal Integer Value

Hexadecimal Form (“%a”)

Evaluation in Base-2

(-1)0× 10201112- 01112× 1.0002

Evaluation in Base-10

1 × 20× 1

Exact Base-10 Value

Similar Articles

@charles_irl: Added a fun lil widget to the LLM Engineer's Almanac -- a "Token Timing Simulator" so you can get a visceral feel for w…

X AI KOLs Following

A token timing simulator widget was added to the LLM Engineer's Almanac, demonstrating the DFlash technique achieving ~1k TPS, to help users viscerally understand benchmark performance numbers.

@charles_irl: more vibe checks available from your friendly local lunatics at r/localllama https://reddit.com/r/LocalLLaMA/s/vqBVXvIT…

X AI KOLs Following

Modal announces day 0 support for Step 3.7 Flash, a 198B parameter MoE model with 256K context and native image/video understanding.

@charles_irl: Added a smol new section to last week's blog post on the technical internals of @modal's fast cold boots. This section …

X AI KOLs Following

Modal explains how it reduces AI inference cold starts by 40x using cloud buffers, a custom filesystem, checkpoint/restore, and CUDA checkpoint/restore, framing cloud buffer management as a linear optimization problem solved with GLOP.

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

arXiv cs.LG

dMX is a differentiable mixed-precision quantization framework that learns optimal floating-point bit-width assignments per layer for LLMs, targeting the MXFP family of formats defined by the OCP standard. It uses continuous optimization with temperature-based annealing and a budget-aware regularization term, consistently outperforming KL-divergence heuristics on Llama, Qwen3, and SmolLM2 models.

MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b

Reddit r/LocalLLaMA

The user converted Nvidia's Llama-Embed-Nemotron-8B model to MLX format with fp16, 8-bit, 4-bit, and 2-bit quantizations, enabling in-process embedding loading on Apple Silicon via mlx-embeddings.

LLM Engineer’s Almanac - Quant Formats

Similar Articles

@charles_irl: Added a fun lil widget to the LLM Engineer's Almanac -- a "Token Timing Simulator" so you can get a visceral feel for w…

@charles_irl: more vibe checks available from your friendly local lunatics at r/localllama https://reddit.com/r/LocalLLaMA/s/vqBVXvIT…

@charles_irl: Added a smol new section to last week's blog post on the technical internals of @modal's fast cold boots. This section …

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b

Submit Feedback