Tag
This paper introduces InterFLOPBench, a benchmark for evaluating LLMs on detecting floating-point errors in C code, finding that recent models achieve high F1-scores, though performance varies by error type.
A half-day tutorial at ISC High Performance 2026 on using compiler-assisted tools (FPChecker/LLVM) for floating-point error analysis and profiling in C/C++ scientific codes.
Die analysis of the Intel 8087 math coprocessor's fast bit shifter, exploring its architecture and role in floating-point operations.
A blog post explaining a counterintuitive optimization where using float division (DIVSD) instead of integer division (IDIVQ) yields faster performance on modern CPUs, with benchmarks and assembly analysis.
This article examines the performance and precision differences between Unity's Mathf and System.MathF in C#, revealing that Unity's Mono runtime often performs double-precision math even for float operations, with implications for performance optimization.
A detailed technical article comparing and benchmarking various Rust decimal crates, covering fixed-point vs floating-point and fixed vs arbitrary precision designs.
This article explores how intermediate floating-point precision in C++ code depends on compiler settings, CPU flags, and architecture, particularly on x87 FPU, and how this affects performance and calculation results.
A detailed reverse-engineering analysis of the 69-bit adder in Intel's 8087 floating-point coprocessor from 1980, explaining its fast carry techniques and role in computing transcendental functions.
This article reverse-engineers the 69-bit adder at the core of Intel's 8087 floating-point coprocessor from 1980, explaining its architecture and carry-chain techniques.
dMX is a differentiable mixed-precision quantization framework that learns optimal floating-point bit-width assignments per layer for LLMs, targeting the MXFP family of formats defined by the OCP standard. It uses continuous optimization with temperature-based annealing and a budget-aware regularization term, consistently outperforming KL-divergence heuristics on Llama, Qwen3, and SmolLM2 models.
The IEEE P3109 draft standard defines a parameterized family of binary floating-point formats and operations tailored for machine learning, supporting configurable width, precision, signedness, and infinities, along with extensive rounding modes including stochastic rounding and a novel scale-invariant approximation measure called kappa-approximation.
Setting temperature to 0 does not guarantee deterministic tool calls in agents due to batched inference causing floating-point reduction order shifts, leading to token flips and different actions under load.
The article compares two methods for normalizing RGB values (dividing by 255 vs 256) and explains the consequences for floating-point conversion and rounding, including uneven bin widths at the extremes.
A detailed reverse-engineering analysis of the microcode inside the Intel 8087 floating-point coprocessor, focusing on the FXCH register exchange instruction and the chip's internal architecture.
Explores the behavior of floor and ceil functions when applied to denormalized floating-point numbers, highlighting differences between CPU and GPU implementations and potential pitfalls.
This paper introduces Digit Entropy Loss (DEL), a novel loss function for numerical learning in large language models that reformulates entropy optimization to improve digit-level prediction accuracy and handle floating-point numbers, consistently outperforming existing methods on mathematical reasoning benchmarks.
A report on the RISC-V architecture's floating point capabilities and updates.
FPSan is a Triton compiler pass that enables verification of algebraic equivalence of floating-point programs by replacing floating-point operations with integer operations, relying on Schanuel's conjecture for correctness.
The Engineering Director of Age of Empires provides an in-depth analysis of the technical debt in the series' pathfinding system over the past 25 years, pointing out that legacy code, dynamic map mechanics, and floating-point errors caused by SIMD instruction sets replacing x87 extended precision are the root causes of classic bugs such as units clipping through walls.
fc is an open-source lossless compressor for IEEE-754 64-bit double streams, offering superior compression ratios for structured data compared to zstd and fpzip, though with slower encoding speeds.