@charles_irl: another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4 https://modal.com/ll…
Summary
A page from Modal's LLM Engineer's Almanac that provides an interactive explorer for understanding low-precision floating-point formats like bf16 and fp4.
View Cached Full Text
Cached at: 05/18/26, 04:34 PM
another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4
https://t.co/yOgLrOFNOY https://t.co/w2u1ND5AQi
LLM Engineer’s Almanac - Quant Formats
Source: https://modal.com/llm-almanac/quant-formats/e4::0x38 Value
Bit Pattern
Sign
Exponent
Significand
Sign
Exponent
Significand
Raw Hexadecimal Integer Value
Raw Decimal Integer Value
Hexadecimal Form (“%a”)
Evaluation in Base-2
(-1)0× 10201112- 01112× 1.0002
Evaluation in Base-10
1 × 20× 1
Exact Base-10 Value
1
Similar Articles
@charles_irl: Added a fun lil widget to the LLM Engineer's Almanac -- a "Token Timing Simulator" so you can get a visceral feel for w…
A token timing simulator widget was added to the LLM Engineer's Almanac, demonstrating the DFlash technique achieving ~1k TPS, to help users viscerally understand benchmark performance numbers.
@charles_irl: more vibe checks available from your friendly local lunatics at r/localllama https://reddit.com/r/LocalLLaMA/s/vqBVXvIT…
Modal announces day 0 support for Step 3.7 Flash, a 198B parameter MoE model with 256K context and native image/video understanding.
@charles_irl: Added a smol new section to last week's blog post on the technical internals of @modal's fast cold boots. This section …
Modal explains how it reduces AI inference cold starts by 40x using cloud buffers, a custom filesystem, checkpoint/restore, and CUDA checkpoint/restore, framing cloud buffer management as a linear optimization problem solved with GLOP.
dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats
dMX is a differentiable mixed-precision quantization framework that learns optimal floating-point bit-width assignments per layer for LLMs, targeting the MXFP family of formats defined by the OCP standard. It uses continuous optimization with temperature-based annealing and a budget-aware regularization term, consistently outperforming KL-divergence heuristics on Llama, Qwen3, and SmolLM2 models.
MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b
The user converted Nvidia's Llama-Embed-Nemotron-8B model to MLX format with fp16, 8-bit, 4-bit, and 2-bit quantizations, enabling in-process embedding loading on Apple Silicon via mlx-embeddings.