FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail

arXiv cs.AI Papers

Summary

This paper argues that using FP8 tensor cores with Ozaki Scheme II can replace native FP64 hardware for high-performance scientific computing on AI-optimized GPUs like NVIDIA's B300, achieving full double-precision accuracy at much higher throughput. The authors present a Tensor-Memory Equilibrium model and show that emulated FP64 performance can exceed native FP64 by orders of magnitude across all workloads.

arXiv:2606.06510v1 Announce Type: cross Abstract: Conventional HPC dogma holds that native hardware FP64 silicon is the irreducible foundation of scientific computing -- the "holy grail" of double-precision simulation. This paper argues the dogma is wrong: on AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel spectrum. NVIDIA's Blackwell Ultra (B300) collapses native FP64 to ~1.3 TFLOPS -- a 31x regression from the B200 -- rendering even memory-bound kernels (SpMV, GEMV, stencils) compute-bound. We make four contributions. First, a unified analytic model, the Tensor-Memory Equilibrium (TME) model, augmenting the Roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. Second, we identify register-level fusion as the mechanism driving beta -> 1, making emulation essentially free behind the memory wall. Third, we project that Ozaki II vaults emulated FP64 from the ~1 TFLOPS native floor to ~500 TFLOPS (B300) and ~400 TFLOPS (Rubin R200), exceeding even B200's native FP64 ceiling by over an order of magnitude in the compute-bound regime while matching the memory roof in the bandwidth-bound regime. Fourth, against an H100 baseline, Ozaki II matches or exceeds H100 on every workload studied, versus the up-to-50x regression that B300 native FP64 imposes. Combined with a companion FFT analysis (Kulisch fixed-point reconstruction on the surviving INT32 pipe) and FP32+Kahan reductions reported in the companion Part(2) paper, every surveyed kernel class on B300 reaches the memory roof at full FP64. The evidence supports the title's claim: FP8, with Ozaki II and Kulisch escape routes, is all one needs for production HPC; native FP64 silicon is no longer the holy grail it has been taken to be.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:15 AM

# FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail A Tensor–Memory Equilibrium Model and Implementation Strategy for Ozaki Scheme II on Memory-Bound Workloads in the Post-fp64 Era
Source: [https://arxiv.org/html/2606.06510](https://arxiv.org/html/2606.06510)
RIKEN Center for Computational Science \(R\-CCS\) KobeHyogoJapan

\(May 2026\)

###### Abstract

The conventional HPC dogma has long held that native hardwarefp64silicon is the irreducible foundation of scientific computing—the “holy grail” without which credible double\-precision simulation is impossible\. This paper argues, on the basis of a unified analytic model and a kernel\-by\-kernel audit, that the dogma is wrong: on AI\-optimised GPUs of the B300 generation and beyond, abundantfp8tensor throughput combined with the Chinese Remainder Theorem\-based Ozaki Scheme II is sufficient to recover memory\-roof execution at fullfp64accuracy across the entire canonical HPC kernel spectrum\. The divergence of modern GPU architectures toward low\-precision AI capability has created a widening gap for scientific applications requiring IEEE 754 double\-precision \(fp64\) arithmetic\. NVIDIA’s Blackwell Ultra \(B300\) GPU exemplifies this trend, delivering 10–15 PFLOPS of dense/sparse NVFP4 tensor throughput while collapsing nativefp64to roughly 1\.3 TFLOPS—a 31×\\timesregression relative to the B200\[[24](https://arxiv.org/html/2606.06510#bib.bib24),[5](https://arxiv.org/html/2606.06510#bib.bib5)\]\. For the first time in over a decade thefp64tensor throughput on a data\-center class GPU has been collapsed to a small fraction of the chip’sfp32\(TF32\) capability, a regime historically reserved for consumer silicon\. This effectively renders even traditionally memory\-bound kernels such as Sparse Matrix–Vector multiplication \(SpMV\), batched Matrix–Vector multiplication \(GEMV\) and stencil sweeps compute\-bound on the scant remainingfp64units\.

This paper analyses how the recently proposed Chinese Remainder Theorem\-basedOzaki Scheme II\[[31](https://arxiv.org/html/2606.06510#bib.bib31)\]—together with itsfp8\-oriented variant\[[37](https://arxiv.org/html/2606.06510#bib.bib37),[20](https://arxiv.org/html/2606.06510#bib.bib20)\]—can be used as a*practical replacement*for nativefp64on such “fp64\-starved” architectures, with explicit focus on the bandwidth\-limited kernels that prior work on Ozaki\-style emulation has not directly addressed\. We make five contributions\. First, we develop a unified analytic performance model, theTensor–Memory Equilibrium\(TME\) model, that augments the classical Roofline\[[39](https://arxiv.org/html/2606.06510#bib.bib39)\]with three emulation\-specific parameters: the compute multiplierα\\alpha\(number of low\-precision MMAs per high\-precision op\), the bandwidth multiplierβ\\beta\(data inflation due to residue/slice expansion\), and the reconstruction latencyγ\\gamma\. Second, we identify*register\-level fusion*of decomposition and reconstruction as the mechanism that drivesβ→1\\beta\\\!\\to\\\!1for streaming kernels, making the emulation cost essentially free behind the memory wall\. Third, we project, on B300 and the upcoming Rubin R200, that Ozaki II vaults emulatedfp64throughput from the collapsed∼1\\sim\\\!1TFLOPS native floor up to∼500\\sim\\\!500TFLOPS \(B300\) and∼400\\sim\\\!400TFLOPS \(Rubin\)—each exceeding even B200’s nativefp64ceiling by more than an order of magnitude in the compute\-bound regime, while matching the memory roof in the bandwidth\-bound regime\. In other words, the roofline analysis shows that emulation*restores*double\-precision performance to its rightful place across the entire operational\-intensity spectrum—exactly what scientific computing requires\. Fourth, taking the prior\-generation H100 as the HPC\-balanced baseline, we show that Ozaki II matches or exceeds the H100 on every workload on every GPU studied, in stark contrast to the up\-to\-50×50\\timesregression that B300 nativefp64imposes\. Fifth, combining the matrix/stencil/SpMV results of this paper with a companion FFT analysis\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]—which establishes that B300 admits a software rescue at fullfp64via a Kulisch fixed\-point reconstruction routed onto its surviving INT32 SIMT pipe—and with FP32\+Kahan compensation for BLAS\-1 reductions, the audit of §[7\.1](https://arxiv.org/html/2606.06510#S7.SS1)concludes that*every surveyed HPC kernel class on B300 admits a path to the memory roof at fullfp64accuracy*\. The combined evidence supports the title’s strong claim:*fp8\(with Ozaki II and Kulisch escape routes\) is all that one needs for production HPC*; nativefp64silicon is, on this evidence, no longer the holy grail it has historically been taken to be\. The kernel\-engineering investment required to build the Ozaki II and Kulisch implementations is, with modern AI coding assistants, now tractable on the timescale of months rather than years\. We close with a discussion of when emulation is genuinely competitive, when iterative refinement remains preferable, and what implications NVIDIA’s recent introduction of an*official*“Emulated DGEMM” column in its Rubin specifications\[[28](https://arxiv.org/html/2606.06510#bib.bib28),[16](https://arxiv.org/html/2606.06510#bib.bib16)\], the integration of Ozaki emulation into cuBLAS\[[26](https://arxiv.org/html/2606.06510#bib.bib26)\], and the announced reliance of the U\.S\. Department of Energy’s Genesis Mission on Ozaki emulation\[[42](https://arxiv.org/html/2606.06510#bib.bib42)\]have for the design of future scientific software stacks\.

Keywords:FP8 tensor\-core emulation; Ozaki Scheme II; Tensor–Memory Equilibrium \(TME\) model; memory\-bound HPC kernels; NVIDIA Blackwell Ultra \(B300\); post\-FP64 GPU architecture; AI for Science \(AI4S\)\.

## 1Introduction

The trajectory of high\-performance computing \(HPC\) hardware has bifurcated\. While scientific simulation—ranging from Quantum Chromodynamics and Computational Fluid Dynamics to climate emulation and seismic imaging—continues to rely on IEEE 754 double\-precision \(fp64\) arithmetic for numerical stability and reproducibility, the data\-center GPU market is decisively pivoting toward Artificial Intelligence\. This pivot is characterised by an exponential increase in low\-precision throughput \(fp16,fp8,fp6,fp4\) at the direct expense of nativefp64capability\.

The NVIDIA Blackwell architecture\[[23](https://arxiv.org/html/2606.06510#bib.bib23),[9](https://arxiv.org/html/2606.06510#bib.bib9)\]serves as the canonical case study\. Whereas the B200 retained a respectable∼40\\sim\\\!40TFLOPS of densefp64tensor performance, the Blackwell Ultra B300—based on the same architecture but with tensor cores re\-balanced for NVFP4—reports only∼1\.3\\sim\\\!1\.3TFLOPS sparse /1\.21\.2TFLOPS densefp64per GPU in the official datasheet\[[24](https://arxiv.org/html/2606.06510#bib.bib24)\]\. Independent microbenchmark analyses confirm thefp64regression and document the ascent of NVFP4 to the role of primary tensor format\[[9](https://arxiv.org/html/2606.06510#bib.bib9),[5](https://arxiv.org/html/2606.06510#bib.bib5)\]\. At SC25, Dongarra noted that “the floating\-point capability of the platform is not improved over the previous generation… the 64\-bit performance does not improve”\[[40](https://arxiv.org/html/2606.06510#bib.bib40)\]\.

The successor architecture, NVIDIA’s Vera Rubin platform \(R200 GPU\), sustains the same trajectory\. NVIDIA’s published Rubin specifications report*native*fp64vector performance at∼33\\sim\\\!33TFLOPS—a further regression from B200’s∼40\\sim\\\!40TFLOPS\[[17](https://arxiv.org/html/2606.06510#bib.bib17)\]—and list, for the first time, an explicit “Emulated DGEMM” column at∼200\\sim\\\!200TFLOPS that is achieved*through Ozaki\-style emulation*\[[28](https://arxiv.org/html/2606.06510#bib.bib28),[16](https://arxiv.org/html/2606.06510#bib.bib16)\]\. In parallel, Rubin ships with 22 TB/s HBM4 bandwidth—2\.75×2\.75\\timesover B300’s 8 TB/s—and 4 PFLOPS of densefp8matrix throughput\. NVIDIA has thus committed to emulation as the*official*path tofp64\-equivalent matrix performance on its scientific\-computing flagship, a commitment made concrete in the announced Doudna \(NERSC\) and Blue Lion \(LRZ\) Rubin\-based supercomputers\[[22](https://arxiv.org/html/2606.06510#bib.bib22),[27](https://arxiv.org/html/2606.06510#bib.bib27)\]\.

This regression has two consequences that, until now, have not been treated together\.

#### Consequence 1: Memory\-bound kernels become compute\-bound\.

The classical Roofline ridge point—the operational intensity at which the compute roof and the memory roof intersect—is given byPfp64/BmemP\_\{\\textsc\{fp64\}\}/B\_\{\\text\{mem\}\}, wherePfp64P\_\{\\textsc\{fp64\}\}is the peakfp64throughput andBmemB\_\{\\text\{mem\}\}is the HBM bandwidth\. On B300 this ridge sits at1\.3​TFLOPS/8​TB/s=0\.161\.3\\,\\text\{TFLOPS\}/8\\,\\text\{TB/s\}=0\.16FLOPS/Byte, so low that essentially every dense linear\-algebra kernel narrower than a GEMM falls into the compute\-bound regime\. A 7\-point stencil with operational intensity≈0\.5\\approx 0\.5FLOPS/Byte should run at the memory roof of8⋅0\.5=48\\cdot 0\.5=4TFLOPS, but is instead capped at1\.31\.3TFLOPS by the nativefp64pipe\. In other words,*the bandwidth that the application needs is physically present on the chip but cannot be consumed by the arithmetic units*\.

#### Consequence 2: Low\-precision tensor units are dormant\.

At the same time, the B300 carries 10 PFLOPS of dense NVFP4 throughput \(15 PFLOPS sparse\) and 5 PFLOPS of densefp8\(10 PFLOPS sparse\)\. In a typicalfp64\-only HPC kernel these units are idle, and the silicon area they occupy contributes nothing to the kernel’s time\-to\-solution\. This is the*Dark Silicon*manifestation of the AI–HPC divergence\.

#### The Ozaki Scheme\.

The Ozaki scheme\[[30](https://arxiv.org/html/2606.06510#bib.bib30)\], originally introduced for accurate dot products and matrix multiplication, has emerged over the last decade as the canonical mechanism for reclaiming this dormant throughput\. Recent work has extended the scheme in two directions: the original*Ozaki I*\(mantissa slicing\) has been adapted tofp16,fp8andint8tensor cores\[[21](https://arxiv.org/html/2606.06510#bib.bib21),[29](https://arxiv.org/html/2606.06510#bib.bib29)\], and a fundamentally different*Ozaki II*variant based on the Chinese Remainder Theorem \(CRT\) was proposed by Ozaki, Uchino and Imamura in 2025\[[31](https://arxiv.org/html/2606.06510#bib.bib31)\]\. NVIDIA integrated Ozaki\-style emulation into cuBLAS in October 2025\[[26](https://arxiv.org/html/2606.06510#bib.bib26)\], and the U\.S\. Department of Energy’s Genesis Mission has explicitly identified Ozaki emulation as its fallback path forfp64\-accurate scientific computing on AI\-centric hardware\[[42](https://arxiv.org/html/2606.06510#bib.bib42)\]\. At the same time, AMD’s recent positioning around the MI430X suggests that not every vendor is convinced that emulation alone is sufficient\[[41](https://arxiv.org/html/2606.06510#bib.bib41)\]\.

#### The gap addressed by this paper\.

Despite the rapid maturation of Ozaki I and Ozaki II for dense GEMM, all published performance studies have focused on the compute\-bound regime where the technique most obviously wins\. No published analysis \(to our knowledge, as of May 2026\) systematically asks:*when is Ozaki II profitable for memory\-bound kernels?*This is the operating regime that dominates real scientific simulation codes: stencil sweeps in PDE solvers, SpMV in iterative linear solvers, batched GEMV in time\-stepped reduced\-order models, and similar bandwidth\-limited primitives\. The conventional wisdom—that EFT methods cannot help bandwidth\-limited kernels because they inflate operand counts—deserves a more careful look on hardware where thefp64compute roof has collapsed below the memory roof\.

#### Contributions of this paper\.

1. 1\.We derive theTensor–Memory Equilibrium \(TME\)model \(§[4](https://arxiv.org/html/2606.06510#S4)\), a Roofline extension that exposes the three Ozaki\-II parameters\(α,β,γ\)\(\\alpha,\\beta,\\gamma\)and predicts the crossover arithmetic intensity at which emulation matches and surpasses nativefp64\.
2. 2\.We identifyregister\-level fusionas the structural mechanism that drivesβ→1\\beta\\to 1, and we lay out the kernel\-design discipline required to keep the residue / slice decomposition entirely on\-chip \(§[5](https://arxiv.org/html/2606.06510#S5)\)\.
3. 3\.We project, for both B300 \(Blackwell Ultra\) and the upcoming Rubin R200, the achievable emulatedfp64throughput across the full operational\-intensity spectrum, and show that Ozaki II \+fp8*simultaneously*matches the memory roof in the bandwidth\-bound regime and vastly exceeds the B200 nativefp64ceiling in the compute\-bound regime \(§[3](https://arxiv.org/html/2606.06510#S3), §[6](https://arxiv.org/html/2606.06510#S6)\)\.
4. 4\.Were\-baseline the projections against H100\(the last HPC\-balanced data\-centre NVIDIA GPU\) and show that Ozaki II matches or exceeds H100 on every workload on every architecture studied, in stark contrast to native B300 \(§[6\.1](https://arxiv.org/html/2606.06510#S6.SS1)\); we further compare INT8 and FP8 as emulation substrates \(§[6\.2](https://arxiv.org/html/2606.06510#S6.SS2)\)\.
5. 5\.Weclose the kernel coverageby combining this paper’s matrix/stencil/SpMV results with a companion FFT analysis\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]and standard FP32\+Kahan compensation for BLAS\-1 reductions \(§[7\.1](https://arxiv.org/html/2606.06510#S7.SS1)\)\. The companion paper establishes a four\-floor codesign rule for spectral workloads—a native FP64 bandwidth\-parity floor that B300 misses by∼10×\\sim\\\!10\\timesand Rubin meets within4%4\\%, a Kulisch INT32 sub\-floor that B300 meets with∼14%\\sim\\\!14\\%margin via a fixed\-point reconstruction routed onto its surviving INT32 SIMT pipe, and anfp8tensor\-core floor ofηfp8≥170​Bmem\\eta\_\{\\textsc\{fp8\}\}\\\!\\geq\\\!170\\,B\_\{\\text\{mem\}\}\(∼1\.36\\sim\\\!1\.36PFLOPS at88TB/s\) that every modern datacentre GPU exceeds by33–4×4\\times\. That thefp8floor is comfortably met everywhere is the quiet asymmetry that makes the FP8\-emulation thesis of this paper viable in the first place: the same FP8 tensor cores that drove the B300 FP64 collapse provide the throughput needed to emulate FP64 GEMMs at memory\-roof speed\. The composite result is that every HPC kernel class surveyed here admits a path to the memory roof on B300 at fullfp64accuracy, conditional on the Kulisch Phase B kernel being engineered\.

We emphasise that we make no claim of having invented the Ozaki scheme, the CRT\-based Ozaki II, thefp8adaptation, or the cuBLAS integration; these are due to the authors cited above \(Ozaki II\[[31](https://arxiv.org/html/2606.06510#bib.bib31)\], thefp8variant\[[37](https://arxiv.org/html/2606.06510#bib.bib37),[20](https://arxiv.org/html/2606.06510#bib.bib20)\], error analysis\[[32](https://arxiv.org/html/2606.06510#bib.bib32)\], the cuBLAS integration\[[26](https://arxiv.org/html/2606.06510#bib.bib26)\], and ADP\-style accuracy guarantees\[[33](https://arxiv.org/html/2606.06510#bib.bib33)\]\)\. The novelty of this paper is thecross\-cutting performance model, theapplication discipline for memory\-bound kernels, thegenerational H100\-baseline analysisof §[6](https://arxiv.org/html/2606.06510#S6), and theend\-to\-end kernel\-coverage auditof §[7\.1](https://arxiv.org/html/2606.06510#S7.SS1)that integrates this paper’s results with the companion FFT analysis\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]\.

#### The thesis\.

The composite finding—spanning the five contributions above—supports the strong thesis announced in the title of this paper:*across every canonical HPC kernel class surveyed, the combination of Ozaki II emulation, Ozaki–Bailey \+ Kulisch Phase B FFT, and FP32\+Kahan compensation for reductions makes B300 bandwidth\-bound at fullfp64accuracy\.*Put differently:*fp8\(with the right algorithmic scaffolding\) is all you need*for the production HPC workload spectrum; nativefp64silicon, on this evidence, is not the holy grail it has historically been taken to be\. What remains is the kernel\-engineering investment required to build the Ozaki and Kulisch implementations—investment that, as §[7\.4](https://arxiv.org/html/2606.06510#S7.SS4)argues, is now tractable on the timescale of months rather than years through the combination of AI\-assisted coding and the Rubin\-generation deployments scheduled for 2027 and beyond\.

#### Roadmap\.

§[2](https://arxiv.org/html/2606.06510#S2)reviews the two Ozaki schemes and related FP64\-emulation work\. §[3](https://arxiv.org/html/2606.06510#S3)establishes the architectural baselines\. §[4](https://arxiv.org/html/2606.06510#S4)develops the TME performance model\. §[5](https://arxiv.org/html/2606.06510#S5)gives kernel design strategies for memory\-bound workloads\. §[6](https://arxiv.org/html/2606.06510#S6)reports the projections\. §[7](https://arxiv.org/html/2606.06510#S7)discusses limitations and the AMD counter\-argument\. §[8](https://arxiv.org/html/2606.06510#S8)concludes\. Appendices contain a complete error analysis sketch, corrected pseudocode, and Garner’s reconstruction derivation\.

## 2Background and Related Work

We summarise the two Ozaki schemes at the level of detail required for the performance model\. Readers familiar with\[[30](https://arxiv.org/html/2606.06510#bib.bib30),[31](https://arxiv.org/html/2606.06510#bib.bib31)\]can skim this section\.

### 2\.1Error\-Free Transformations and the Ozaki Scheme

Leta,b∈𝔽a,b\\in\\mathbb\{F\}be working\-precision floats\. Veltkamp/Dekker splitting expressesa=ah\+aℓa=a\_\{h\}\+a\_\{\\ell\}withaha\_\{h\}holding the leading bits andaℓa\_\{\\ell\}the trailing bits, all in𝔽\\mathbb\{F\}; the producta​babthen equalsah​bh\+ah​bℓ\+aℓ​bh\+aℓ​bℓa\_\{h\}b\_\{h\}\+a\_\{h\}b\_\{\\ell\}\+a\_\{\\ell\}b\_\{h\}\+a\_\{\\ell\}b\_\{\\ell\}, computed exactly when each factor fits in𝔽\\mathbb\{F\}\. The Ozaki scheme\[[30](https://arxiv.org/html/2606.06510#bib.bib30)\]generalises this idea to dense matrices, replacing two\-word splitting byss\-way slicing along the inner\-product direction\.

### 2\.2Ozaki Scheme I: Mantissa Slicing

GivenA∈𝔽m×kA\\in\\mathbb\{F\}^\{m\\times k\}andB∈𝔽k×nB\\in\\mathbb\{F\}^\{k\\times n\}, Ozaki I forms decompositions

A=∑p=1SAA\(p\),B=∑q=1SBB\(q\),A=\\sum\_\{p=1\}^\{S\_\{A\}\}A^\{\(p\)\},\\qquad B=\\sum\_\{q=1\}^\{S\_\{B\}\}B^\{\(q\)\},\(1\)in which eachA\(p\)A^\{\(p\)\}andB\(q\)B^\{\(q\)\}has a bounded mantissa width compatible with the target tensor format\. The reconstruction is

C≈∑p=1SA∑q=1SBA\(p\)​B\(q\),C\\;\\approx\\;\\sum\_\{p=1\}^\{S\_\{A\}\}\\sum\_\{q=1\}^\{S\_\{B\}\}A^\{\(p\)\}B^\{\(q\)\},\(2\)summed in working precision, at dominant costSA⋅SBS\_\{A\}\\\!\\cdot\\\!S\_\{B\}low\-precision GEMMs\.

#### Substrate\-specific scaling\.

The three candidate tensor\-core substrates—fp16,int8, andfp8—differ in how \([1](https://arxiv.org/html/2606.06510#S2.E1)\)–\([2](https://arxiv.org/html/2606.06510#S2.E2)\) is actually computed\.

- •*fp16tensor cores\.*EachA\(p\)A^\{\(p\)\}andB\(q\)B^\{\(q\)\}is stored directly as anfp16matrix, and the tensor\-core MMA accumulatesA\(p\)​B\(q\)A^\{\(p\)\}B^\{\(q\)\}infp32\. No additional integer scaling is required\.
- •*int8tensor cores\.*Each slice carries a*floating\-point*mantissa range, but theint8engine consumes signed integers\. The decomposition therefore associates each slice with a per\-slice power\-of\-two exponent:A\(p\)=2ep⋅A~\(p\)A^\{\(p\)\}=2^\{e\_\{p\}\}\\\!\\cdot\\\!\\widetilde\{A\}^\{\(p\)\}andB\(q\)=2fq⋅B~\(q\)B^\{\(q\)\}=2^\{f\_\{q\}\}\\\!\\cdot\\\!\\widetilde\{B\}^\{\(q\)\}, whereA~\(p\),B~\(q\)∈ℤ8\\widetilde\{A\}^\{\(p\)\},\\widetilde\{B\}^\{\(q\)\}\\\!\\in\\\!\\mathbb\{Z\}^\{8\}are the signed\-integer mantissa slices fed to theint8engine, and the reconstruction accumulates2ep\+fq⋅\(A~\(p\)​B~\(q\)\)2^\{e\_\{p\}\+f\_\{q\}\}\\\!\\cdot\\\!\(\\widetilde\{A\}^\{\(p\)\}\\widetilde\{B\}^\{\(q\)\}\)infp64\[[21](https://arxiv.org/html/2606.06510#bib.bib21),[29](https://arxiv.org/html/2606.06510#bib.bib29)\]\. The integer products themselves are computed exactly inint32\.
- •*fp8tensor cores\.*fp8\(E4M3\) consumes only a 3\+1\-bit mantissa per element, so the slicing of anfp64mantissa requires more slices than theint8case; in addition, thefp8engine accumulates infp32rather thanint32, and a separate quantisation correction is required to handle the implicit normalisation of thefp8format\. The corresponding scheme is described in detail in\[[20](https://arxiv.org/html/2606.06510#bib.bib20)\]and is the basis of the FP8 variant of Ozaki II that we discuss in §[2\.4](https://arxiv.org/html/2606.06510#S2.SS4)\.

#### Slice counts: accumulator\-bound analysis\.

The required slice count is governed jointly by the input mantissa width and by the accumulator precision\. Forfp64\-accurate output with summation lengthkk, the requirement that the accumulator hold the sum ofkkproducts ofbb\-bit slices exactly is

2​b\+⌈log2⁡k⌉≤wacc,2b\+\\lceil\\log\_\{2\}k\\rceil\\;\\leq\\;w\_\{\\text\{acc\}\},\(3\)wherewaccw\_\{\\text\{acc\}\}is the number of significand bits available in the accumulator \(24 forfp32, 31 forint32signed\)\. Inverting \([3](https://arxiv.org/html/2606.06510#S2.E3)\) gives the maximum safe payloadb⋆=\(wacc−⌈log2⁡k⌉\)/2b^\{\\star\}\\\!=\\\!\(w\_\{\\text\{acc\}\}\\\!\-\\\!\\lceil\\log\_\{2\}k\\rceil\)/2bits per slice\. The slice count needed to cover the 53\-bitfp64mantissa is thenS≈⌈53/b⋆⌉S\\\!\\approx\\\!\\lceil 53/b^\{\\star\}\\rceil\.

Table[1](https://arxiv.org/html/2606.06510#S2.T1)reports the resulting slice counts for representative inner\-dimension lengthskk\. Three observations follow\.

Table 1:Slice count required forfp64\-accurate Ozaki I, derived from \([3](https://arxiv.org/html/2606.06510#S2.E3)\) and the substrate\-specific accumulator width\. The input mantissa width is the engine’s payload\-bit budget \(fp16: 11 bits;int8: 7 signed;fp8E4M3: 4 bits with implicit normalisation\)\. Values are rounded up to the next integer; “—” meansb⋆b^\{\\star\}exceeds the input mantissa, so a single slice suffices in principle\. Empirical values from\[[21](https://arxiv.org/html/2606.06510#bib.bib21),[20](https://arxiv.org/html/2606.06510#bib.bib20)\]are noted in the rightmost column\.Reported empirical values:fp16:S≈3S\\\!\\approx\\\!3–44at moderatekk\[[21](https://arxiv.org/html/2606.06510#bib.bib21)\];int8:S≈7S\\\!\\approx\\\!7–1010atk≤16384k\\\!\\leq\\\!16384\[[21](https://arxiv.org/html/2606.06510#bib.bib21),[29](https://arxiv.org/html/2606.06510#bib.bib29)\];fp8:S≥11S\\\!\\geq\\\!11in the configurations of\[[20](https://arxiv.org/html/2606.06510#bib.bib20)\]\. Empirical values exceed the accumulator\-bound minimum because of input distributional effects \(loss of leading bits when the entries’ magnitudes vary substantially across the inner\-product summation\)\.

1. 1\.Forfp16tensor cores withfp32accumulation, the bound is tight: atk=4096k\\\!=\\\!4096, only about 6 bits per slice can be safely accumulated, even thoughfp16can carry 11\. In other words, at largekk, thefp16/fp32substrate is*accumulator\-bound*rather than input\-bound, and the slice count grows accordingly\.
2. 2\.Forint8cores withint32accumulation, the bound is loose: atk=4096k\\\!=\\\!4096, the safe payload is 9\.5 bits, which exceeds the 7\-bitint8input\. Consequently, the slice count is governed by the*input*mantissa, not by the accumulator, and grows only logarithmically withkk\. At largekk, theint8substrate therefore requires comparable or fewer slices thanfp16, contrary to a naive ordering by input precision\.
3. 3\.Forfp8tensor cores, the small input mantissa \(effectively 4 bits\) combined with thefp32accumulator constraint produces the highest slice count of the three; Mukunoki*et al\.*’s reportedS≥11S\\\!\\geq\\\!11for FP8\[[20](https://arxiv.org/html/2606.06510#bib.bib20)\]is consistent with Table[1](https://arxiv.org/html/2606.06510#S2.T1)\.

#### Quadratic scaling\.

The total number of low\-precision GEMMs scales asΘ​\(S2\)\\Theta\(S^\{2\}\), which is the dominant disadvantage of Ozaki I for bandwidth\-limited problems: even with fused decomposition, the temporal volume of arithmetic grows quadratically in the slice count\. This is the principal motivation for the linearly\-scaling Ozaki II scheme described next\.

### 2\.3Ozaki Scheme II: Modular Arithmetic via the CRT

The Chinese Remainder Theorem\-based Ozaki II scheme, proposed by Ozaki, Uchino and Imamura in 2025\[[31](https://arxiv.org/html/2606.06510#bib.bib31)\], replaces the slice decomposition by a residue decomposition over a set of pairwise\-coprime moduli\. The algorithm consists of three phases\.

#### Phase 1: Integer scaling\.

Diagonal scaling matricesD∈𝔽m×mD\\\!\\in\\\!\\mathbb\{F\}^\{m\\times m\}andE∈𝔽n×nE\\\!\\in\\\!\\mathbb\{F\}^\{n\\times n\}, chosen with power\-of\-two diagonals to be exactly invertible infp64, are used to form

A~=⌊DA⌉∈ℤm×k,B~=⌊BE⌉∈ℤk×n,\\tilde\{A\}=\\lfloor DA\\rceil\\in\\mathbb\{Z\}^\{m\\times k\},\\qquad\\tilde\{B\}=\\lfloor BE\\rceil\\in\\mathbb\{Z\}^\{k\\times n\},\(4\)where the rounding⌊⋅⌉\\lfloor\\,\\cdot\\,\\rceilis to nearest integer\. The integer matricesA~,B~\\tilde\{A\},\\tilde\{B\}have magnitudes bounded by the chosen scaling factor, typically2p−1−12^\{p\-1\}\-1for a representation width ofppbits\.

#### Phase 2: Modular GEMMs\.

Choose pairwise\-coprime modulim1<m2<⋯<mrm\_\{1\}<m\_\{2\}<\\cdots<m\_\{r\}large enough that

M=∏i=1rmi\>2⋅maxi​j⁡\|\(A~​B~\)i​j\|,M=\\prod\_\{i=1\}^\{r\}m\_\{i\}\\;\>\\;2\\cdot\\max\_\{ij\}\\bigl\|\(\\tilde\{A\}\\tilde\{B\}\)\_\{ij\}\\bigr\|,\(5\)which guarantees that the integer product can be uniquely reconstructed from its residues\. For eachi∈\{1,…,r\}i\\in\\\{1,\\dots,r\\\}compute

C\(i\)=\(A~modmi\)​\(B~modmi\)modmi\.C^\{\(i\)\}\\;=\\;\\bigl\(\\tilde\{A\}\\bmod m\_\{i\}\\bigr\)\\;\\bigl\(\\tilde\{B\}\\bmod m\_\{i\}\\bigr\)\\bmod m\_\{i\}\.\(6\)Onint8tensor cores each\(A~modmi\)⋅\(B~modmi\)\(\\tilde\{A\}\\bmod m\_\{i\}\)\\cdot\(\\tilde\{B\}\\bmod m\_\{i\}\)is a standardint8GEMM accumulated inint32\. When usingfp8tensor cores the modular reduction must be carried out in scaled floating point following the technique of\[[37](https://arxiv.org/html/2606.06510#bib.bib37)\]\.

#### Phase 3: CRT reconstruction \(Garner’s algorithm\)\.

We apply Garner’s algorithm\[[6](https://arxiv.org/html/2606.06510#bib.bib6),[12](https://arxiv.org/html/2606.06510#bib.bib12)\]to recover the integer product element\-wise\. WritingCi​j=v1\+v2​m1\+v3​m1​m2\+⋯C\_\{ij\}=v\_\{1\}\+v\_\{2\}m\_\{1\}\+v\_\{3\}m\_\{1\}m\_\{2\}\+\\cdotsin mixed radix, the digitsvkv\_\{k\}are computed iteratively as

v1\\displaystyle v\_\{1\}=Ci​j\(1\),\\displaystyle=C^\{\(1\)\}\_\{ij\},\(7\)vk\\displaystyle v\_\{k\}=\(Ci​j\(k\)−∑j=1k−1vj​∏ℓ=1j−1mℓ\)⋅\(∏ℓ=1k−1mℓ\)−1\(modmk\),k≥2\.\\displaystyle=\\biggl\(C^\{\(k\)\}\_\{ij\}\-\\sum\_\{j=1\}^\{k\-1\}v\_\{j\}\\prod\_\{\\ell=1\}^\{j\-1\}m\_\{\\ell\}\\biggr\)\\cdot\\biggl\(\\prod\_\{\\ell=1\}^\{k\-1\}m\_\{\\ell\}\\biggr\)^\{\-1\}\\pmod\{m\_\{k\}\},\\quad k\\geq 2\.The modular inverses\(∏ℓmℓ\)−1\(modmk\)\\bigl\(\\prod\_\{\\ell\}m\_\{\\ell\}\\bigr\)^\{\-1\}\\pmod\{m\_\{k\}\}are precomputed once\. The recovered integer is finally rescaled back to floating point viaC≈D−1​C~​E−1/\(σA​σB\)C\\approx D^\{\-1\}\\tilde\{C\}E^\{\-1\}/\(\\sigma\_\{A\}\\sigma\_\{B\}\), whereσA,σB\\sigma\_\{A\},\\sigma\_\{B\}are the scalar magnitudes hidden insideD,ED,E\.

#### Linear scaling\.

The cost isrrlow\-precision GEMMs plus theO​\(r2\)O\(r^\{2\}\)element\-wise reconstruction\. Critically,rrscales*linearly*in the required exponent range for FP64\-equivalent dynamic range; published parameter sets user∈\[13,16\]r\\\!\\in\\\!\[13,16\]onint8cores\[[31](https://arxiv.org/html/2606.06510#bib.bib31)\]andr∈\[8,12\]r\\\!\\in\\\!\[8,12\]onfp8cores\[[37](https://arxiv.org/html/2606.06510#bib.bib37)\], with the exact value driven by the desired error guarantee\.

### 2\.4Thefp8Variant of Ozaki II

Uchino, Ozaki and Imamura observed in early 2026 that the original Ozaki II algorithm*cannot be directly adapted tofp8matrix\-multiply\-accumulate units*, because modular reduction is fundamentally an integer operation\[[37](https://arxiv.org/html/2606.06510#bib.bib37)\]\. They introduced a quantisation trick that emulates modular arithmetic overfp8by exploiting the fact that anfp8\(E4M3\) value can represent integers up to±448\\pm 448exactly\. This adaptation is the reason that Ozaki II remains viable on Blackwell Ultra and on the upcoming NVIDIA Rubin GPU, both of which significantly downgradeint8in favour offp8/fp4\[[20](https://arxiv.org/html/2606.06510#bib.bib20),[37](https://arxiv.org/html/2606.06510#bib.bib37)\]\.

This paper assumes the existence of thisfp8adaptation but does not re\-derive it\. Our TME model is parameterised so that the choice of underlying tensor format simply rescales the compute multiplierα\\alphaand the reconstruction latencyγ\\gamma, without altering the model’s structure\.

#### Sensitivity to the moduli count and the \(3r\+1\) FP8 cost structure\.

The performance projections in later sections taker=10r\\\!=\\\!10moduli for thefp8Ozaki II variant\. This is at the optimistic end of the empirical range: Mukunoki*et al\.*\[[20](https://arxiv.org/html/2606.06510#bib.bib20)\]reportr∈\[11,14\]r\\\!\\in\\\!\[11,14\]forfp64\-equivalent accuracy on thefp8\(E4M3\) substrate, withr=12r\\\!=\\\!12recommended\. Atr=11r\\\!=\\\!11the emulation compute ceilingPfp8/rP\_\{\\textsc\{fp8\}\}/rdecreases by≈9%\\approx\\\!9\\%\(e\.g\. from500500to≈455\\approx\\\!455TFLOPS on B300, and from400400to≈364\\approx\\\!364TFLOPS on Rubin\), and atr=12r\\\!=\\\!12by≈17%\\approx\\\!17\\%\. A second refinement clarified by Imamura \(private communication and\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]\) concerns the precise op count: each fp64\-equivalent residue product on the FP8 substrate expands into\(3​r\+1\)\(3r\+1\)FP8 MMAs rather thanrr, because the Karatsuba structure used internally to emulate signedint9 on FP8 introduces a factor\-of\-three multiplier on the residue planes\. For the INT8 substrate the count is\(s\+1\)\(s\+1\)wheressis the slice count,∼2\.5×\\sim\\\!2\.5\\timescheaper per modulus but requiring more moduli\. Adoptingr=12r\\\!=\\\!12with the\(3​r\+1\)\(3r\+1\)factor would revise the absolute throughput ceilings downward by an additional constant, but does not qualitatively alter any of the speedup conclusions in §[6](https://arxiv.org/html/2606.06510#S6): the parity\-row structure of Table[3](https://arxiv.org/html/2606.06510#S6.T3)and the H100\-relative scaling of Table[4](https://arxiv.org/html/2606.06510#S6.T4)are governed by HBM bandwidth ratios, not by the specific compute\-ceiling constant\. Readers comparing the projections against measured implementation data should adjust the absolute throughput numbers accordingly\.

### 2\.5Error Analysis

Both Ozaki I and Ozaki II provide*provable*error bounds for the emulated product, in contrast to ad\-hoc mixed\-precision schemes\. For Ozaki II withrrmoduli chosen by the criterion \([5](https://arxiv.org/html/2606.06510#S2.E5)\), the worst\-case componentwise relative error is bounded by the working\-precision unit round\-offufp64≈2−53u\_\{\\textsc\{fp64\}\}\\approx 2^\{\-53\}*plus*the floating\-to\-integer rounding introduced in \([4](https://arxiv.org/html/2606.06510#S2.E4)\)\[[32](https://arxiv.org/html/2606.06510#bib.bib32)\]\. In practice, for inputs with bounded condition number, the observed relative error is within22–10​ufp6410\\,u\_\{\\textsc\{fp64\}\}\[[31](https://arxiv.org/html/2606.06510#bib.bib31)\]; this behaviour was confirmed in the recent ADP work of Schwarz*et al\.*\[[33](https://arxiv.org/html/2606.06510#bib.bib33)\], who report less than 10% overhead for componentwise\-accurate DGEMM emulation on Blackwell\.

We refer the reader to\[[32](https://arxiv.org/html/2606.06510#bib.bib32)\]for the full componentwise analysis; the practical implication for this paper is that the*kernels we discuss inherit the same error bound as the underlying emulated GEMM*, modulo the trivial summation rounding incurred at the reconstruction step\.

### 2\.6Other FP64 Emulation Approaches

Several alternative paths exist for emulating high\-precision arithmetic on low\-precision tensor units\. Markidis*et al\.*\[[18](https://arxiv.org/html/2606.06510#bib.bib18)\]exploredfp16tensor cores forfp32\-equivalent accuracy, the natural ancestor of all modern emulation work\. Iterative\-refinement solvers\[[8](https://arxiv.org/html/2606.06510#bib.bib8),[2](https://arxiv.org/html/2606.06510#bib.bib2)\]use low\-precision GEMMs as inner kernels and correct residuals in high precision, achieving high accuracy on solvers but not on bare GEMM\-like primitives\. More recently, MixPert and similar adaptive\-precision frameworks compose low\-precision tensor operations with correction terms\.

Ozaki\-style emulation is distinctive in that it providesprovable componentwiseerror guarantees on the bare matrix product itself, without requiring an outer iteration or a problem\-specific tuning loop\. This property is essential for drop\-in replacement of cuBLAS DGEMM in scientific codes, which is exactly the use case NVIDIA targeted with the October 2025 cuBLAS integration\[[26](https://arxiv.org/html/2606.06510#bib.bib26)\]\.

### 2\.7Tensor Cores for Memory\-Bound Kernels: Prior Art

There is a parallel literature, largely independent of Ozaki, on*making memory\-bound kernels use tensor cores at all*—irrespective of precision considerations\. TCStencil\[[15](https://arxiv.org/html/2606.06510#bib.bib15)\], SPTCStencil\[[7](https://arxiv.org/html/2606.06510#bib.bib7)\]and SparStencil\[[38](https://arxiv.org/html/2606.06510#bib.bib38)\]reformulate stencil sweeps as block\-sparse GEMMs amenable to dense or sparse tensor cores\. These works treat the precision question orthogonally \(typically retainingfp32orfp16\)\. Our contribution sits at the*intersection*of the two threads: we want both \(i\) the tensor\-core mapping that the stencil community has developed, and \(ii\) thefp64\-accurate emulation that the Ozaki community has developed\. The TME model is the analytic tool that lets us reason about both simultaneously\.

## 3Architectural Analysis: The FP64 Cliff

We now establish the architectural baselines used throughout the rest of the paper\. All numbers in Table[2](https://arxiv.org/html/2606.06510#S3.T2)are taken from the official NVIDIA datasheets and corroborated by independent microbenchmarks\[[24](https://arxiv.org/html/2606.06510#bib.bib24),[23](https://arxiv.org/html/2606.06510#bib.bib23),[9](https://arxiv.org/html/2606.06510#bib.bib9),[5](https://arxiv.org/html/2606.06510#bib.bib5)\]\. We report*dense*throughput per single GPU unless explicitly marked as sparse, because real HPC workloads do not generally exhibit the 2:4 structured sparsity required to reach the sparse rates\.

Table 2:Per\-GPU architectural parameters used in the TME model\. Tensor throughput numbers are dense unless noted\. Vectorfp64refers to non\-tensor SIMT pipes\. Sources: NVIDIA Blackwell Ultra datasheet\[[24](https://arxiv.org/html/2606.06510#bib.bib24)\], HGX B200 datasheet\[[23](https://arxiv.org/html/2606.06510#bib.bib23)\], microbenchmark study\[[9](https://arxiv.org/html/2606.06510#bib.bib9)\], NVIDIA Rubin specifications as published\[[28](https://arxiv.org/html/2606.06510#bib.bib28),[17](https://arxiv.org/html/2606.06510#bib.bib17),[16](https://arxiv.org/html/2606.06510#bib.bib16)\]\.\*Rubin specifications list FP64 matrix performance under an explicit “Emulated DGEMM” column rather than as a native tensor\-core rate\[[28](https://arxiv.org/html/2606.06510#bib.bib28),[16](https://arxiv.org/html/2606.06510#bib.bib16)\]\.†\\daggerFP8 to native vector FP64; the FP8 : emulated\-DGEMM ratio is20:120\{:\}1\.

#### Reading the table\.

Four columns—H100, B200, B300, and R200—trace a clear architectural trajectory\. Nativefp64compute is no longer growing with each generation; on B300 it has effectively been removed, and on Rubin NVIDIA has chosen to expose “Emulated DGEMM” as an explicit, first\-class column in the official specifications\[[28](https://arxiv.org/html/2606.06510#bib.bib28)\]\. At the same time, the low\-precision matrix pipes \(fp8, NVFP4\) and the HBM bandwidth have each grown by factors of22–33per generation, with Rubin’s2222TB/s HBM4 marking a2\.75×2\.75\\timesjump over B300\. Any performance model for scientific computing on these architectures must therefore treat emulation*not as an optimisation*but as the*native execution model*forfp64matrix operations\.

#### Why Ozaki II runs throughfp8on these architectures\.

Theint8rate on Blackwell Ultra and Rubin has not scaled withfp8or NVFP4—in fact it has*decreased*relative to Hopper, as silicon area has been redirected toward low\-precision floating\-point matrix units\. The natural path to high emulation performance on these architectures therefore runs throughfp8\[[37](https://arxiv.org/html/2606.06510#bib.bib37),[20](https://arxiv.org/html/2606.06510#bib.bib20)\]\. Onfp8the B300 delivers55PFLOPS dense and1010PFLOPS sparse; withr≈10r\\\!\\approx\\\!10moduli \(a typical Ozaki II FP8 parameter setting\) the*effective upper bound*on emulatedfp64throughput is5,000/10=5005\{,\}000/10=500TFLOPS dense\. On Rubin,44PFLOPS densefp8with the same parameters yields≈400\\approx 400TFLOPS, comparable to NVIDIA’s published∼200\\sim 200TFLOPS “Emulated DGEMM” figure with margin for componentwise\-error guarantees\[[33](https://arxiv.org/html/2606.06510#bib.bib33),[32](https://arxiv.org/html/2606.06510#bib.bib32)\]\.

#### Where the native memory ridge sits\.

Combining the nativefp64compute roof with HBM bandwidth gives ridge points of10\.110\.1FLOPS/B on H100,5\.05\.0FLOPS/B on B200,0\.160\.16FLOPS/B on B300, and1\.51\.5FLOPS/B on Rubin\. These four numbers tell three distinct stories\. On H100 and B200, the ridge sits above the operational intensity of every standard memory\-bound primitive \(SpMV∼0\.2\\sim 0\.2, stencils∼0\.5\\sim 0\.5, batched GEMV∼1\\sim 1–44\), so those kernels are bandwidth\-bound natively and the architecture is balanced for the workload class\. On B300, the ridge has collapsed to0\.160\.16FLOPS/B, below the operational intensity of*every*standard scientific kernel; the native pipe is the bottleneck even for SpMV\. On Rubin, the picture is genuinely mixed: the bandwidth jump to2222TB/s combined with the modest3333TFLOPS vectorfp64pushes the ridge to1\.51\.5FLOPS/B, which puts SpMV \(∼0\.2\\sim 0\.2\) and stencils \(∼0\.5\\sim 0\.5\) back into the memory\-bound regime where nativefp64vector is once again usable, while batched GEMV and dense GEMM sit at or above the ridge and benefit from the Emulated DGEMM path\. In other words, Rubin partially restores the native\-fp64\-viable regime for the lowest\-OI kernels, and Ozaki II picks up the rest\.

## 4The Tensor–Memory Equilibrium Model

We now develop the analytic performance model used in the rest of the paper\. Let a kernel performWWfp64\-equivalent floating\-point operations onQQbytes of memory traffic, so that its operational intensity isOI=W/Q\\mathrm\{OI\}=W/Q\. We adopt OI in preference to “operational intensity” to avoid notational collision with AI throughout this paper\.

### 4\.1Native FP64

The Roofline model\[[39](https://arxiv.org/html/2606.06510#bib.bib39)\]predicts the native execution time as

Tnat=max⁡\(WPfp64,QBmem\)\+Lmem,T\_\{\\text\{nat\}\}\\;=\\;\\max\\\!\\left\(\\frac\{W\}\{P\_\{\\textsc\{fp64\}\}\},\\frac\{Q\}\{B\_\{\\text\{mem\}\}\}\\right\)\\;\+\\;L\_\{\\text\{mem\}\},\(8\)wherePfp64P\_\{\\textsc\{fp64\}\}is the peakfp64throughput,BmemB\_\{\\text\{mem\}\}the HBM bandwidth andLmemL\_\{\\text\{mem\}\}the cold\-start latency \(assumed amortised for the workload sizes considered\)\. The kernel is memory\-bound whenI<Pfp64/BmemI<P\_\{\\textsc\{fp64\}\}/B\_\{\\text\{mem\}\}, compute\-bound otherwise\.

### 4\.2Emulated Execution

Replacing nativefp64by Ozaki II introduces three overhead parameters\.

###### Definition 1\(Emulation parameters\)\.

For a fixed Ozaki II configuration on a given hardware target,

α\\alpha=\\;=\\;number of low\-precision tensor\-core MMAs required perfp64fused multiply–add\. For Ozaki II this is essentiallyrr\(the number of moduli\), modulo constant overhead from index arithmetic\.

β\\beta≥1\\;\\geq 1\\;is the*bandwidth multiplier*: the ratio of bytes moved by the emulated kernel to the bytes that the equivalent nativefp64kernel would move\.β=1\\beta=1for fully fused on\-chip decomposition \(residues live in registers and never touch global memory\);β=r\\beta=rin the unfused regime \(each residue plane is materialised in HBM\)\.

γ\\gamma≥0\\;\\geq 0\\;is the per\-output reconstruction latency, in seconds per output element, capturing the cost of Garner’s algorithm\.

The emulated time is then

Temu=max⁡\(α​WPlow,β​QBmem\)\+γ​nout,T\_\{\\text\{emu\}\}\\;=\\;\\max\\\!\\left\(\\frac\{\\alpha\\,W\}\{P\_\{\\text\{low\}\}\},\\frac\{\\beta\\,Q\}\{B\_\{\\text\{mem\}\}\}\\right\)\\;\+\\;\\gamma\\,n\_\{\\text\{out\}\},\(9\)wherePlowP\_\{\\text\{low\}\}is the relevant low\-precision tensor throughput \(Pfp8P\_\{\\textsc\{fp8\}\},Pint8P\_\{\\textsc\{int8\}\}, etc\.\) andnoutn\_\{\\text\{out\}\}is the number of output elements\.

### 4\.3Crossover Analysis

Letρ=Plow/Pfp64\\rho\\;=\\;P\_\{\\text\{low\}\}/P\_\{\\textsc\{fp64\}\}be the precision ratio \(e\.g\.,ρ≈3800\\rho\\approx 3800forfp8on B300\)\. Definer=αr=\\alphato emphasise the dependence on the moduli count\.

#### Case A: Native compute\-bound, emulation memory\-bound\.

This is the regime that the B300 collapses into for nearly every standard kernel\. Equations \([8](https://arxiv.org/html/2606.06510#S4.E8)\) and \([9](https://arxiv.org/html/2606.06510#S4.E9)\) give

Tnat=WPfp64,Temu=β​QBmem\+γ​nout,T\_\{\\text\{nat\}\}=\\frac\{W\}\{P\_\{\\textsc\{fp64\}\}\},\\qquad T\_\{\\text\{emu\}\}=\\frac\{\\beta Q\}\{B\_\{\\text\{mem\}\}\}\+\\gamma n\_\{\\text\{out\}\},\(10\)and the emulation is profitable when

WPfp64\>β​QBmem\+γ​nout\.\\frac\{W\}\{P\_\{\\textsc\{fp64\}\}\}\\;\>\\;\\frac\{\\beta Q\}\{B\_\{\\text\{mem\}\}\}\+\\gamma n\_\{\\text\{out\}\}\.\(11\)Ignoringγ\\gammamomentarily, and assumingβ=1\\beta=1\(the fused case\), the speedup of emulation over native is

TnatTemu=W/Pfp64Q/Bmem=I⋅BmemPfp64,\\frac\{T\_\{\\text\{nat\}\}\}\{T\_\{\\text\{emu\}\}\}\\;=\\;\\frac\{W/P\_\{\\textsc\{fp64\}\}\}\{Q/B\_\{\\text\{mem\}\}\}\\;=\\;\\frac\{I\\cdot B\_\{\\text\{mem\}\}\}\{P\_\{\\textsc\{fp64\}\}\},\(12\)that is, the speedup equals the kernel’s distance into the compute\-bound region of the native roofline\. On the B300 withI=0\.5I=0\.5\(a 7\-point stencil\),Bmem=8B\_\{\\text\{mem\}\}=8TB/s,Pfp64=1\.3P\_\{\\textsc\{fp64\}\}=1\.3TFLOPS, this gives a speed\-up of0\.5⋅8/1\.3≈3\.1×0\.5\\cdot 8/1\.3\\approx 3\.1\\times\.

#### Case B: Both regimes memory\-bound \(smallII\)\.

When the native kernel is already memory\-bound, emulation cannot improve time\-to\-solution beyondTnatT\_\{\\text\{nat\}\}in the best case \(β=1\\beta=1\)\. More precisely,Temu/Tnat→βT\_\{\\text\{emu\}\}/T\_\{\\text\{nat\}\}\\to\\beta\. Hence fused decomposition \(β=1\\beta=1\) yields parity, and unfused decomposition is strictly worse\. This gives the important practical guidance:for genuinely memory\-bound kernels in the small\-IIregime, register\-level fusion is not just an optimisation; it is a correctness requirement for emulation to be viable at all\.

#### Case C: Both regimes compute\-bound \(dense GEMM\)\.

Here the speedup isρ/α\\rho/\\alpha, and the bandwidth roof is irrelevant\. On the B300 withfp8cores andr≈10r\\approx 10, this gives a throughput ceiling of5,000/10≈5005\{,\}000/10\\approx 500TFLOPS, which is≈380×\\approx 380\\timesover B300 nativefp64\(1\.31\.3TFLOPS\)\. This is the regime in which Ozaki II for dense GEMM has been most extensively studied\[[37](https://arxiv.org/html/2606.06510#bib.bib37),[20](https://arxiv.org/html/2606.06510#bib.bib20),[33](https://arxiv.org/html/2606.06510#bib.bib33)\]\.

### 4\.4The TME Picture

Figure[1](https://arxiv.org/html/2606.06510#S4.F1)draws the TME roofline for both B300 and Rubin\. The nativefp64roof on B300 is flat at1\.31\.3TFLOPS, and on Rubin at∼33\\sim 33TFLOPS for vectorfp64\. Ozaki II with fused decomposition follows the memory roof up to its own compute ceiling:≈500\\approx 500TFLOPS on B300 \(densefp8at55PFLOPS withr=10r=10\) and≈400\\approx 400TFLOPS on Rubin \(densefp8at44PFLOPS withr=10r=10\)\. We note in passing that Rubin’s dense\-fp8throughput is actually slightly lower than B300’s—NVIDIA has reallocated silicon area fromfp8to NVFP4/fp6on Rubin, so the dense\-fp8Ozaki II ceiling is essentially flat between B300 and Rubin\. What changes generationally is the*memory roof*, which jumps from88TB/s to2222TB/s, lifting the achieved throughput across the entire bandwidth\-bound region\. NVIDIA’s conservative published Emulated\-DGEMM figure for Rubin \(∼200\\sim 200TFLOPS\) corresponds to a parameter set with more moduli or with margin reserved for componentwise\-error guarantees\[[33](https://arxiv.org/html/2606.06510#bib.bib33),[32](https://arxiv.org/html/2606.06510#bib.bib32)\]\.

Two observations follow directly from the figure and are the central quantitative message of this paper\.

#### Memory\-bound regime: parity with the memory roof\.

In the bandwidth\-bound region \(left of the ridge point\) the Ozaki II/fp8curve overlaps the memory roof exactly\. This is precisely what scientific computing requires:*whatever bandwidth the architecture provides, the emulated path can consume*\. On B300 this lifts memory\-bound kernels from the collapsed1\.31\.3TFLOPS native floor up to the8​I8ITFLOPS memory roof—a gain that ranges from1\.2×1\.2\\timesat SpMV\-grade intensities \(I≈0\.2I\\\!\\approx\\\!0\.2\) to9\.2×9\.2\\timesat batched\-GEMV intensities \(I≈1\.5I\\\!\\approx\\\!1\.5\)\. On Rubin, the same emulated curve follows a22​I22ITFLOPS memory roof,2\.75×2\.75\\timeshigher in absolute throughput than B300 across the entire memory\-bound region\. This bandwidth\-driven gain is the dominant Rubin advantage and is felt by*every*HPC kernel withI≲Pfp8/\(r⋅Bmem\)≈18I\\lesssim P\_\{\\textsc\{fp8\}\}/\(r\\cdot B\_\{\\text\{mem\}\}\)\\approx 18FLOPS/B\.

#### Compute\-bound regime: surpassing B200 by an order of magnitude\.

In the compute\-bound region \(right of the ridge point\) the Ozaki II/fp8curve saturates at∼500\\sim 500TFLOPS on B300 and∼400\\sim 400TFLOPS on Rubin\. For reference, B200’s nativefp64ceiling sits at4040TFLOPS—so the emulated path on B300 exceeds B200 by≈12×\\approx 12\\timesand on Rubin by≈10×\\approx 10\\times, in the regime where nativefp64on B200 was previously the gold standard\. In other words, the TME roofline shows that emulation does not merely*compensate*for thefp64collapse; it carries the entire operational\-intensity spectrum substantially above the best nativefp64performance ever offered on the prior generation\. This is the quantitative argument for treating Ozaki II as the new baseline rather than as a fallback\.

10−110^\{\-1\}10010^\{0\}10110^\{1\}10210^\{2\}10010^\{0\}10110^\{1\}10210^\{2\}10310^\{3\}SpMVStencilbGEMVGEMMOperational intensityOI\\mathrm\{OI\}\(FLOPs / Byte\)Achieved throughput \(TFLOPS\)Rubin mem\. roof \(22 TB/s\)B300 mem\. roof \(8 TB/s\)B200 native FP64 \(40\)Rubin native FP64 vec\. \(33\)B300 native FP64 \(1\.3\)B300 Ozaki II/FP8 \(r=10r\{=\}10\)Rubin Ozaki II/FP8 \(r=10r\{=\}10\)Figure 1:TME roofline projection for B300 and Rubin\. Each platform has its own memory roof \(black for B300, dotted violet for Rubin\) and its own nativefp64ceiling \(red for B300, dashed violet for Rubin vector; B200 in dashed orange for reference\)\. The Ozaki II/fp8curves \(blue for B300, teal for Rubin\) trace the memory roof in the bandwidth\-bound regime and saturate at the emulation compute ceiling in the compute\-bound regime\. The Ozaki II/FP8 ceiling is≈500\\approx 500TFLOPS on B300 and≈400\\approx 400TFLOPS on Rubin \(each =Pfp8/rP\_\{\\textsc\{fp8\}\}/rwithr=10r=10\); NVIDIA’s conservative published “Emulated DGEMM” figure for Rubin is∼200\\sim 200TFLOPS\[[28](https://arxiv.org/html/2606.06510#bib.bib28),[16](https://arxiv.org/html/2606.06510#bib.bib16)\]\. Across the entire intensity spectrum, the emulated curves dominate the B200 nativefp64reference line\.

## 5Implementation Strategies for Memory\-Bound Kernels

The crossover analysis of §[4](https://arxiv.org/html/2606.06510#S4)establishes that the*kernel design discipline*—and specifically the value ofβ\\beta—is what determines whether Ozaki II is profitable on memory\-bound workloads\. We now describe how to driveβ→1\\beta\\to 1for the three canonical bandwidth\-limited primitives\.

### 5\.1Register\-Level Fusion: the Common Pattern

The unifying technique is to perform the decomposition \(\([4](https://arxiv.org/html/2606.06510#S2.E4)\) and the residue reductionmodmi\\bmod\\,m\_\{i\}\)*after*the working\-precision operand has arrived in registers, and to perform the reconstruction \(\([7](https://arxiv.org/html/2606.06510#S2.E7)\)\) on the accumulator before the result is stored\. This keeps the residue planes off the HBM bus\. In CUDA terms:

1. 1\.Each thread block reads a tile ofAA\(and possiblyBB\) infp64from HBM into shared memory\.
2. 2\.Within the tile, each warp computes therrresidue planes in registers, using a small precomputed table of moduli stored inconstantmemory\.
3. 3\.The warp issuesrrlow\-precisionwmma::mma\_sync\(or equivalenttcgen05\) instructions, each operating on theiith residue plane\.
4. 4\.After the tile loop, the warp reconstructs the integer accumulator via Garner’s algorithm \([7](https://arxiv.org/html/2606.06510#S2.E7)\), rescales byD−1​E−1D^\{\-1\}E^\{\-1\}, and stores onefp64result\.

Because the residue planes never reach HBM, the bandwidth multiplierβ\\betaequals 1 \(modulo a negligible constant for the moduli table and the precomputed Garner inverses\)\. The compute multiplierα\\alphaequalsrr, the number of moduli\. The reconstruction latencyγ\\gammaisO​\(r2\)O\(r^\{2\}\)per output element, but since each output element is the result of an inner\-product reduction of lengthkk, the per\-FMA overhead isO​\(r2/k\)O\(r^\{2\}/k\)and vanishes fork≫r2∼100k\\gg r^\{2\}\\sim 100\.

### 5\.2Strategy 1: Batched GEMV

Standard GEMV \(y=A​xy=Ax\) cannot use tensor cores at all because the operands have a one\-dimensional axis\. Batched GEMV,Y=A​XY=AXwithX∈𝔽n×BX\\in\\mathbb\{F\}^\{n\\times B\}for batch sizeBB, exposes a smallBBthat maps onto the 16\- or 32\-wide tensor\-corenn\-dimension\. Algorithm[1](https://arxiv.org/html/2606.06510#alg1)sketches the fused Ozaki\-II GEMV kernel\.

Algorithm 1Fused Ozaki\-II batched GEMV \(pseudocode\)\.1:Input:matrix

A∈𝔽M×NA\\in\\mathbb\{F\}^\{M\\times N\}\(fp64\), batch

X∈𝔽N×BX\\in\\mathbb\{F\}^\{N\\times B\}\(fp64\), moduli table

\{mi\}i=1r\\\{m\_\{i\}\\\}\_\{i=1\}^\{r\}, scaling diagonals

D,ED,E\.

2:Output:

Y=A​X∈𝔽M×BY=AX\\in\\mathbb\{F\}^\{M\\times B\}\.

3:Allocate registers

\{afrag\(i\)\}i=1r\\\{a^\{\(i\)\}\_\{\\text\{frag\}\}\\\}\_\{i=1\}^\{r\},

\{xfrag\(i\)\}i=1r\\\{x^\{\(i\)\}\_\{\\text\{frag\}\}\\\}\_\{i=1\}^\{r\},

\{Cacc\(i\)\}i=1r\\\{C^\{\(i\)\}\_\{\\text\{acc\}\}\\\}\_\{i=1\}^\{r\}\.

4:Zero all accumulators

Cacc\(i\)C^\{\(i\)\}\_\{\\text\{acc\}\}\.

5:fortile

k=0,…,N/Tk−1k=0,\\dots,N/T\_\{k\}\-1do

6:Cooperatively load

AtileA\_\{\\text\{tile\}\}and

XtileX\_\{\\text\{tile\}\}infp64into shared memory\.

7:for

i=1,…,ri=1,\\dots,rdo

8:

afrag\(i\)←residue\(⌊DAtile⌉,mi\)a^\{\(i\)\}\_\{\\text\{frag\}\}\\leftarrow\\mathrm\{residue\}\\bigl\(\\lfloor D\\,A\_\{\\text\{tile\}\}\\rceil,\\;m\_\{i\}\\bigr\)⊳\\trianglerightin registers

9:

xfrag\(i\)←residue\(⌊XtileE⌉,mi\)x^\{\(i\)\}\_\{\\text\{frag\}\}\\leftarrow\\mathrm\{residue\}\\bigl\(\\lfloor X\_\{\\text\{tile\}\}\\,E\\rceil,\\;m\_\{i\}\\bigr\)
10:for

i=1,…,ri=1,\\dots,rdo

11:

Cacc\(i\)←mma\_sync​\(afrag\(i\),xfrag\(i\),Cacc\(i\)\)C^\{\(i\)\}\_\{\\text\{acc\}\}\\leftarrow\\texttt\{mma\\\_sync\}\(a^\{\(i\)\}\_\{\\text\{frag\}\},x^\{\(i\)\}\_\{\\text\{frag\}\},C^\{\(i\)\}\_\{\\text\{acc\}\}\)⊳\\trianglerightINT8 or FP8 tensor core

12:

C~←GarnerReconstruct​\(C\(1\),…,C\(r\);\{mi\}\)\\tilde\{C\}\\leftarrow\\text\{GarnerReconstruct\}\(C^\{\(1\)\},\\dots,C^\{\(r\)\};\\\{m\_\{i\}\\\}\)
13:

Y←D−1​C~​E−1Y\\leftarrow D^\{\-1\}\\tilde\{C\}E^\{\-1\}⊳\\trianglerightrescale tofp64

14:Store

YY\.

#### Key analysis\.

The matrixAAis read once per output row \(cached behaviour assumed\), and the input batchXXis read⌈M/Tm⌉\\lceil M/T\_\{m\}\\rceiltimes\. For typicalB≈8B\\approx 8and largeM,NM,N, the operational intensity is≈B/2=4\\approx B/2=4FLOPS/Byte; withβ=1\\beta=1the emulated kernel hits the memory roof at3232TFLOPS on the B300, versus1\.31\.3TFLOPS nativefp64\. This≈25×\\approx 25\\timesspeedup is realised only when the register pressure ofrrresidue planes is tolerable; in practice we expectBBto be limited to about44–88before register spilling forcesβ\>1\\beta\>1\.

### 5\.3Strategy 2: Stencil via im2col

A 7\-point stencil computes, at every grid point\(i,j,k\)\(i,j,k\)of a 3\-D arrayuu,

vi​j​k=c0​ui​j​k\+c1​\(ui±1,j,k\+ui,j±1,k\+ui,j,k±1\),v\_\{ijk\}\\;=\\;c\_\{0\}u\_\{ijk\}\+c\_\{1\}\\\!\\left\(u\_\{i\\pm 1,j,k\}\+u\_\{i,j\\pm 1,k\}\+u\_\{i,j,k\\pm 1\}\\right\),\(13\)or, more generally, a fixed\-coefficient linear combination of a small neighbourhood\. At fp64 the operational intensity isOI≈0\.5\\mathrm\{OI\}\\\!\\approx\\\!0\.5FLOPS/Byte \(7 multiply\-adds per output across∼28\\sim\\\!28bytes of HBM traffic when neighbour reads are reused via the L1/shared\-memory cache\)\.

#### Mapping to GEMM via im2col\.

The standard tensor\-core mapping\[[15](https://arxiv.org/html/2606.06510#bib.bib15)\]flattens the stencil into a GEMM by treating the coefficient vectorc∈ℝ7c\\\!\\in\\\!\\mathbb\{R\}^\{7\}as a row matrix and the neighbourhood values as columns of a7×Ntile7\\\!\\times\\\!N\_\{\\text\{tile\}\}matrixUim2colU\_\{\\text\{im2col\}\}for a tile ofNtileN\_\{\\text\{tile\}\}output points:

vtile=c⋅Uim2col,Uim2col∈ℝ7×Ntile\.v\_\{\\text\{tile\}\}\\;=\\;c\\cdot U\_\{\\text\{im2col\}\}\\,,\\qquad U\_\{\\text\{im2col\}\}\\\!\\in\\\!\\mathbb\{R\}^\{7\\times N\_\{\\text\{tile\}\}\}\.\(14\)This produces a tall\-and\-skinny matrix product whose inner dimension is the stencil width\. The coefficient vectorccis constant across the entire kernel and can therefore be pre\-decomposed once into itsrrresidue planes\{c\(i\)=residue\(⌊sc⌉,mi\)\}i=1r\\\{c^\{\(i\)\}=\\mathrm\{residue\}\(\\lfloor sc\\rceil,\\,m\_\{i\}\)\\\}\_\{i=1\}^\{r\}and held in constant memory; the grid values must be residue\-decomposed on each tile\.

#### Fused Ozaki\-II 7\-point stencil\.

Algorithm[2](https://arxiv.org/html/2606.06510#alg2)gives the fused kernel\. The key structural property is that ther×7r\\\!\\times\\\!7pre\-decomposed coefficients fit in constant memory once for the entire kernel run; on each tile the kernel readsNtile\+2N\_\{\\text\{tile\}\}\\\!\+\\\!2planes from HBM \(for the three\-axis halo\), residue\-decomposes the values intorrplanes in registers, invokesrrtensor\-core MMAs to compute the residue productsc\(i\)⋅Uim2col\(i\)c^\{\(i\)\}\\cdot U^\{\(i\)\}\_\{\\text\{im2col\}\}, and reconstructs one fp64 output per grid point via Garner\.

Algorithm 2Fused Ozaki\-II 7\-point stencil \(pseudocode\)\. The pre\-decomposed coefficient table\{c\(i\)\}\\\{c^\{\(i\)\}\\\}lives in constant memory; all residue planes are register\-resident, givingβ→1\\beta\\\!\\to\\\!1\.1:Input:grid

u∈𝔽Nx×Ny×Nzu\\\!\\in\\\!\\mathbb\{F\}^\{N\_\{x\}\\times N\_\{y\}\\times N\_\{z\}\}\(fp64\), coefficient vector

c∈𝔽7c\\\!\\in\\\!\\mathbb\{F\}^\{7\}\(fp64\), moduli table

\{mi\}i=1r\\\{m\_\{i\}\\\}\_\{i=1\}^\{r\}, scaling factor

ss\.

2:Output:stencil image

vv\.

3:Precompute \(once\):

c\(i\)←residue\(⌊sc⌉,mi\)c^\{\(i\)\}\\\!\\leftarrow\\\!\\mathrm\{residue\}\(\\lfloor sc\\rceil,\\,m\_\{i\}\)for

i=1,…,ri\\\!=\\\!1,\\dots,r; store

\{c\(i\)\}\\\{c^\{\(i\)\}\\\}in constant memory\.

4:fortile

ttpartitioning the griddo

5:Cooperatively load

uhalou\_\{\\text\{halo\}\}for tile

tt\(including

±1\\pm 1neighbourhood\) into shared memory\.

6:foreach output point

ppin the tile \(one warp per warp\-tile\)do

7:Assemble neighbourhood vector

Up=\[up,up±x^,up±y^,up±z^\]∈𝔽7U\_\{p\}\\\!=\\\!\[u\_\{p\},u\_\{p\\pm\\hat\{x\}\},u\_\{p\\pm\\hat\{y\}\},u\_\{p\\pm\\hat\{z\}\}\]\\\!\\in\\\!\\mathbb\{F\}^\{7\}from shared memory\.

8:for

i=1,…,ri=1,\\dots,rdo

9:

Up\(i\)←residue\(⌊sUp⌉,mi\)U\_\{p\}^\{\(i\)\}\\\!\\leftarrow\\\!\\mathrm\{residue\}\(\\lfloor sU\_\{p\}\\rceil,\\,m\_\{i\}\)⊳\\trianglerightin registers

10:

Cp\(i\)←mma\_sync​\(c\(i\),Up\(i\),0\)C\_\{p\}^\{\(i\)\}\\\!\\leftarrow\\\!\\texttt\{mma\\\_sync\}\(c^\{\(i\)\},U\_\{p\}^\{\(i\)\},0\)⊳\\triangleright1×7×Ntile1\\\!\\times\\\!7\\\!\\times\\\!N\_\{\\text\{tile\}\}tensor\-core MMA

11:

vp←GarnerReconstruct​\(Cp\(1\),…,Cp\(r\)\)/s2v\_\{p\}\\\!\\leftarrow\\\!\\mathrm\{GarnerReconstruct\}\(C\_\{p\}^\{\(1\)\},\\dots,C\_\{p\}^\{\(r\)\}\)\\,/\\,s^\{2\}
12:Store

vv\.

#### Key analysis\.

Each grid point is read once \(cached, with the±1\\pm\\,1neighbours hit in shared memory\) and written once\. The dominant HBM traffic is therefore≈16\\approx\\\!16B per output \(one fp64 read \+ one fp64 write\)\. Compute per output isrrtensor\-core MMAs over the 7\-element inner product—a total ofr⋅7r\\\!\\cdot\\\!7fp8 multiply\-adds per output, which on B300’s55PFLOPSfp8tensor ceiling is∼10−12\\sim\\\!10^\{\-12\}s per output and therefore far below the22ns/output that the memory roof affords\. Provided the register pressure ofrrresidue planes is tolerable and the coefficient constant\-memory traffic does not saturate the constant cache, theβ=1\\beta\\\!=\\\!1discipline is realised and the kernel runs at the memory roof of≈4\\approx\\\!4TFLOPS on B300, against≈0\.65\\approx\\\!0\.65TFLOPS under native fp64\. This is the3\.1×3\.1\\timesspeedup reported in Table[3](https://arxiv.org/html/2606.06510#S6.T3)\.

#### Sparsity exploitation \(future work\)\.

The im2col matrix is structurally sparse: at most 7 non\-zeros per row for a 7\-point stencil\. This invites the use of NVIDIA’s 2:4 sparse tensor cores; SPTCStencil and SparStencil exploit this directly forfp32/fp16\[[7](https://arxiv.org/html/2606.06510#bib.bib7),[38](https://arxiv.org/html/2606.06510#bib.bib38)\]\. Combining the structured\-sparsity tensor cores with the Ozaki\-II residue decomposition is a natural next step but is outside the scope of this paper; we leave it open for follow\-up \(§[7\.4](https://arxiv.org/html/2606.06510#S7.SS4)\)\.

### 5\.4Strategy 3: SpMV via Blocked\-Ellpack

SpMV computesy=A​xy\\\!=\\\!Axfor sparseA∈𝔽M×NA\\\!\\in\\\!\\mathbb\{F\}^\{M\\times N\}and densex∈𝔽Nx\\\!\\in\\\!\\mathbb\{F\}^\{N\}\. It is the most challenging of the three strategies because the underlying access pattern is irregular and the operational intensity is the lowest \(OI∼0\.2\\mathrm\{OI\}\\\!\\sim\\\!0\.2FLOPS/Byte for typical PDE\-discretisation sparsity\)\.

#### Mapping to tensor cores via Blocked\-Ellpack\.

TheBlocked\-ELLformat groups non\-zeros into row\-major blocks of a uniform block\-column width𝑏𝑤\\mathit\{bw\}\(typically 16 or 32, chosen to match the tensor\-corekk\-dimension\)\. Sparse rows that have fewer than𝑏𝑤\\mathit\{bw\}non\-zeros are padded with structural zeros; row permutations may be applied to balance block density\. Each block of𝑏𝑤\\mathit\{bw\}columns becomes a small dense GEMMytile=Ablock​xgathery\_\{\\text\{tile\}\}\\\!=\\\!A\_\{\\text\{block\}\}x\_\{\\text\{gather\}\}, wherexgather∈𝔽𝑏𝑤x\_\{\\text\{gather\}\}\\\!\\in\\\!\\mathbb\{F\}^\{\\mathit\{bw\}\}is a gather of the relevant entries ofxx\. Padding wastes compute, but the wasted compute isfp8/int8compute, which is roughly four orders of magnitude cheaper than wastedfp64compute on B300\. Even a 90%\-wasteful padding scheme remains profitable providedβ\\betastays close to 1\.

#### Fused Ozaki\-II Blocked\-ELL SpMV\.

Algorithm[3](https://arxiv.org/html/2606.06510#alg3)gives the fused kernel\. Each block row issues a small GEMV that maps onto the tensor\-core𝑏𝑤×1\\mathit\{bw\}\\\!\\times\\\!1MMA shape, withrrresidue planes computed and reconstructed in registers\. The dense input vectorxxis shared across all block rows and remains in shared memory; the sparse pattern dictates the gather indices\.

Algorithm 3Fused Ozaki\-II Blocked\-Ellpack SpMV \(pseudocode\)\. Each block row issues one𝑏𝑤×1\\mathit\{bw\}\\\!\\times\\\!1tensor\-core MMA per residue plane;β→1\\beta\\\!\\to\\\!1provided shared\-memory tiling keeps the gatheredxgatherx\_\{\\text\{gather\}\}register\-resident\.1:Input:sparse matrix

AAin Blocked\-ELL layout with block\-column width

𝑏𝑤\\mathit\{bw\}\(data array

AvalA\_\{\\text\{val\}\}, column\-index array

AcolA\_\{\\text\{col\}\}\), dense vector

x∈𝔽Nx\\\!\\in\\\!\\mathbb\{F\}^\{N\}, moduli table

\{mi\}i=1r\\\{m\_\{i\}\\\}\_\{i=1\}^\{r\}, scaling factors

D,ED,E\.

2:Output:

y=A​xy\\\!=\\\!Ax\.

3:Allocate registers

\{Crow\(i\)\}i=1r\\\{C\_\{\\text\{row\}\}^\{\(i\)\}\\\}\_\{i=1\}^\{r\}, zero each\.

4:forblock row

bbof

AAdo

5:Load

Aval​\[b,⋅\]∈𝔽𝑏𝑤A\_\{\\text\{val\}\}\[b,\\cdot\]\\\!\\in\\\!\\mathbb\{F\}^\{\\mathit\{bw\}\}and indices

Acol​\[b,⋅\]A\_\{\\text\{col\}\}\[b,\\cdot\]for this block row\.

6:Gather

xgather∈𝔽𝑏𝑤x\_\{\\text\{gather\}\}\\\!\\in\\\!\\mathbb\{F\}^\{\\mathit\{bw\}\}using

Acol​\[b,⋅\]A\_\{\\text\{col\}\}\[b,\\cdot\]\(shared\-memory access if locality permits\)\.

7:for

i=1,…,ri=1,\\dots,rdo

8:

Afrag\(i\)←residue\(⌊DAval\[b,⋅\]⌉,mi\)A^\{\(i\)\}\_\{\\text\{frag\}\}\\\!\\leftarrow\\\!\\mathrm\{residue\}\\bigl\(\\lfloor D\\,A\_\{\\text\{val\}\}\[b,\\cdot\]\\rceil,\\,m\_\{i\}\\bigr\)
9:

xfrag\(i\)←residue\(⌊xgatherE⌉,mi\)x^\{\(i\)\}\_\{\\text\{frag\}\}\\\!\\leftarrow\\\!\\mathrm\{residue\}\\bigl\(\\lfloor x\_\{\\text\{gather\}\}\\,E\\rceil,\\,m\_\{i\}\\bigr\)
10:

Crow\(i\)←mma\_sync​\(Afrag\(i\),xfrag\(i\),Crow\(i\)\)C\_\{\\text\{row\}\}^\{\(i\)\}\\\!\\leftarrow\\\!\\texttt\{mma\\\_sync\}\(A^\{\(i\)\}\_\{\\text\{frag\}\},x^\{\(i\)\}\_\{\\text\{frag\}\},C\_\{\\text\{row\}\}^\{\(i\)\}\)
11:

y~b←GarnerReconstruct​\(Crow\(1\),…,Crow\(r\)\)\\tilde\{y\}\_\{b\}\\\!\\leftarrow\\\!\\mathrm\{GarnerReconstruct\}\(C\_\{\\text\{row\}\}^\{\(1\)\},\\dots,C\_\{\\text\{row\}\}^\{\(r\)\}\)
12:

yb←D−1​y~b​E−1y\_\{b\}\\\!\\leftarrow\\\!D^\{\-1\}\\tilde\{y\}\_\{b\}E^\{\-1\}
13:Store

yy\.

#### Key analysis\.

Bandwidth is dominated by the streaming read ofAvalA\_\{\\text\{val\}\}plus the gather ofxx; with a moderate shared\-memory tile ofxxthe gather hits cache most of the time, so the effective intensity matches the structural value of∼0\.2\\sim\\\!0\.2FLOPS/Byte\. Atβ=1\\beta\\\!=\\\!1the emulated kernel saturates HBM at≈1\.6\\approx 1\.6TFLOPS on B300 and≈4\.4\\approx 4\.4TFLOPS on Rubin, against≈1\.3\\approx 1\.3TFLOPS native B300 and≈4\.4\\approx 4\.4TFLOPS native Rubin \(the latter is at parity because Rubin’s higher HBM ridge already accommodates the SpMV intensity\)\. The B300 speedup is modest \(∼1\.2×\\sim 1\.2\\times\) precisely because the kernel is so deeply bandwidth\-bound that the native pipe is already a tolerable match; the point of including it is not the speedup magnitude but the*equality*: emulation gives up nothing\.

#### Caveat: padding inflation\.

The Blocked\-ELL mapping inflatesQQby the inverse of the row density\. This is the closest we come in this paper to a regime whereβ\>1\\beta\\\!\>\\\!1is fundamentally unavoidable:β\\betais bounded below by the padding ratio\. For matrices with strongly heterogeneous row lengths, hybrid CSR\-ELL or HYB formats are required to keepβ\\betaacceptable; otherwise the model predicts a regression rather than parity\. See Appendix[D](https://arxiv.org/html/2606.06510#A4)for the precise mapping of padding ratio toβ\\beta\.

## 6Performance Projections

We instantiate the TME model on the architectures of Table[2](https://arxiv.org/html/2606.06510#S3.T2)and compute the projected speedup of Ozaki II/fp8over nativefp64for the four canonical workloads\. We use the standard abbreviation*OI*for operational intensity, measured in FLOPs per byte of HBM traffic; values are taken from the Williams Roofline literature\[[39](https://arxiv.org/html/2606.06510#bib.bib39)\]and the workload\-specific analyses in §[5](https://arxiv.org/html/2606.06510#S5)\. All projections assumeβ=1\\beta=1\(fused decomposition\),r=10r=10moduli for thefp8variant\[[37](https://arxiv.org/html/2606.06510#bib.bib37),[20](https://arxiv.org/html/2606.06510#bib.bib20)\]—an optimistic value relative to Mukunoki*et al\.*’s reportedr≥11r\\\!\\geq\\\!11for FP8, so the projections should be read with the≈9\\approx\\\!9–17%17\\%downward sensitivity discussed in §[2\.4](https://arxiv.org/html/2606.06510#S2.SS4)—andγ\\gammaamortised to zero for inner\-product lengthsk≳100k\\\!\\gtrsim\\\!100\.

Table 3:Projected speedups of Ozaki II/fp8over nativefp64\. All projections use the TME model of §[4](https://arxiv.org/html/2606.06510#S4), the architectural data of Table[2](https://arxiv.org/html/2606.06510#S3.T2), and assume register\-level fusion \(β=1\\beta=1\)\. Memory\-bound speedups are upper\-bounded by the memory roofI⋅BmemI\\cdot B\_\{\\text\{mem\}\}divided by the nativefp64compute roof; compute\-bound speedups are upper\-bounded byPfp8/\(r⋅Pfp64\)P\_\{\\textsc\{fp8\}\}/\(r\\cdot P\_\{\\textsc\{fp64\}\}\)\. For Rubin we report two columns: one against the nativefp64vector roof \(3333TFLOPS\), the other against NVIDIA’s conservative published Emulated\-DGEMM figure of∼200\\sim 200TFLOPS\.“∼1\.0×\\sim 1\.0\\times” indicates parity: emulation neither hurts nor helps because the kernel is already at the memory roof under native execution\. On Rubin, native vector FP64 has been partially restored \(3333vs\. B300’s1\.31\.3TFLOPS\), so memory\-bound kernels withI<22/33≈0\.67I<22/33\\approx 0\.67FLOPs/B remain at parity even before considering the Emulated\-DGEMM path\.

#### What the projections show\.

Table[3](https://arxiv.org/html/2606.06510#S6.T3)carries two simultaneous messages\. On B300, emulation delivers substantial speedups across the entire operational\-intensity spectrum—from∼1\.2×\\sim 1\.2\\timeson bandwidth\-saturating SpMV to∼380×\\sim 380\\timeson dense GEMM—and the unified Ozaki II library can serve all of them\. On Rubin, NVIDIA has already incorporated emulated DGEMM as the official path, exposing it as∼200\\sim 200TFLOPS in published specifications; the TME model is consistent with that figure when a more conservative parameter set \(largerrr, or additional margin for componentwise\-error guarantees\) is chosen, while the maximal dense\-fp8/r=10r\{=\}10ceiling sits at∼400\\sim 400TFLOPS \(§[3](https://arxiv.org/html/2606.06510#S3)\)\. On H100 and B200 the projections show*parity*: emulation neither hurts nor helps because nativefp64is healthy\. This parity property is what makes a unified library implementation feasible—the same Ozaki II code path is safe to ship on all four architectures, with the runtime selecting emulation only where it pays\.

#### The positive reading of the memory\-bound parity rows\.

Several rows in Table[3](https://arxiv.org/html/2606.06510#S6.T3)show1\.0×1\.0\\timeson H100, B200, and the Rubin\-vs\-native column\. This is exactly the behaviour scientific computing wants: in the memory\-bound regime, emulation is*indistinguishable in performance from nativefp64on afp64\-healthy chip*\. Combined with the compute\-bound speedups \(e\.g\.∼10×\\sim 10\\timeson B200,∼380×\\sim 380\\timeson B300,∼12×\\sim 12\\timeson Rubin\), the message is unambiguous: switching the entire numerical stack to Ozaki II costs essentially nothing on memory\-bound workloads and gains order\-of\-magnitude throughput on the rest\.

#### Where emulation does not currently help\.

In two scenarios, emulation still gives no benefit\. First, on afp64\-healthy GPU \(H100, B200, Rubin\-vector\-only\), memory\-bound kernels are already at the memory roof, so there is no headroom\. Second, kernels whose dominant cost is host–device data movement, or whose effective operational intensity is bounded by ELL\-padding inflation pushingβ\\betaabove unity, will see no improvement\. Bringing those into the emulation envelope is a separate problem that we revisit in §[7\.4](https://arxiv.org/html/2606.06510#S7.SS4)\.

### 6\.1Generational Performance: H100 as the Baseline

Table[3](https://arxiv.org/html/2606.06510#S6.T3)reports speedups within each GPU—i\.e\., how much Ozaki II/fp8accelerates each workload relative to that GPU’s own nativefp64performance\. This is the right view for evaluating emulation*on a single chip*, but it is not the right view for evaluating whetherfp64\-emulated execution*regresses*or*progresses*relative to the prior\-generation HPC baseline\. The appropriate baseline for that question is the H100, the last data\-centre NVIDIA GPU whose architecture was balanced for HPC rather than for AI inference\. Table[4](https://arxiv.org/html/2606.06510#S6.T4)therefore reports absolute achievablefp64\-equivalent throughput for the same five workloads, with H100 nativefp64explicitly set as the unit\.

Table 4:Achievablefp64\-equivalent throughput per workload, in TFLOPS, and relative to H100 nativefp64\(last column block, in parentheses\)\. Native throughput uses thefp64tensor path for dense GEMM and thefp64vector path for the memory\-bound primitives\. Ozaki II throughput usesfp8tensor cores withr=10r=10moduli andβ=1\\beta=1; for Rubin, the dense\-GEMM Ozaki entry is the model\-derived upper bound \(Pfp8/r=400P\_\{\\textsc\{fp8\}\}/r=400TFLOPS\), to be compared with NVIDIA’s conservative published Emulated\-DGEMM figure of∼200\\sim 200TFLOPS\.∗For Rubin dense GEMM, NVIDIA’s published Emulated DGEMM specification of∼200\\sim 200TFLOPS corresponds to2\.99×2\.99\\timeson the H100\-relative scale; the figure shown \(400400TFLOPS,5\.97×5\.97\\times\) is the TME\-model upper bound atr=10r=10,β=1\\beta=1\.

#### The key generational claim\.

Three patterns in Table[4](https://arxiv.org/html/2606.06510#S6.T4)support a single thesis:*Ozaki II does not regress performance against the H100 baseline; on the contrary, it restores or improves the prior\-generation scaling on every workload\.*

1. 1\.Native B300 regresses catastrophicallyfor every compute\-sensitive workload:0\.02×0\.02\\timesH100 on dense GEMM,0\.10×0\.10\\timeson batched GEMV \(B=8B\{=\}8\), and0\.26×0\.26\\timeson batched GEMV \(B=2B\{=\}2\)\. Only SpMV, which is so memory\-bound that even the collapsedfp64pipe is fast enough to consume B300’s88TB/s atI=0\.2I\\\!=\\\!0\.2, ends up faster than the H100\. This is the regression that NVIDIA’s silicon\-area redirection imposes on every non\-emulated HPC code running on B300\.
2. 2\.B300 with Ozaki II returns to B200’s scaling: every memory\-bound row of the Ozaki rows on B300 matches B200’s row exactly \(2\.39×2\.39\\timesH100, the HBM\-bandwidth ratio8/3\.358/3\.35\)\. Dense GEMM under emulation reaches7\.46×7\.46\\timesH100, exceeding B200’s native6\.72×6\.72\\times\. In other words, the silicon area redirected fromfp64units tofp8units is fully recovered through Ozaki II—there is no net regression versus B200\.
3. 3\.Rubin with Ozaki II scales as the HBM4 jump alone would predict: the6\.57×6\.57\\timesH100 figure for every memory\-bound row is exactly the bandwidth ratio22/3\.3522/3\.35\. The emulation path simply passes the bandwidth advantage through to the application without bottlenecking it on the compressedfp64pipe\.

Put differently: Ozaki II is not just a compensation mechanism for NVIDIA’sfp64regression; it is the mechanism that*converts*the silicon\-area savings into bandwidth\-scaling and into AI\-grade tensor throughput, both of which the application then sees as fasterfp64\. The H100 baseline view shown in Table[4](https://arxiv.org/html/2606.06510#S6.T4)is, in the author’s opinion, the appropriate framing for procurement, application\-porting, and benchmarking decisions: the question is not “does emulation match B300 nativefp64?”—of course it does, by orders of magnitude—but “does the post\-fp64stack continue the H100→\\toB200→\\toB300→\\toRubin generational improvement that HPC procurement has historically relied on?” Table[4](https://arxiv.org/html/2606.06510#S6.T4)answers*yes*\.

The table covers the four primitives—dense GEMM, batched GEMV, stencils, SpMV—for which Ozaki II is the appropriate emulation path\. The corresponding analysis for the fifth canonical primitive, three\-dimensional FFT, appears in a companion paper\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\], which establishes that B300 admits a memory\-roof\-near path at fullfp64via a Kulisch fixed\-point reconstruction routed onto its surviving INT32 SIMT pipe\. Together with this paper’s results, the integrated kernel coverage is the subject of §[7\.1](https://arxiv.org/html/2606.06510#S7.SS1)\.

#### A note on FP8:FP64 ratios\.

The ratios across the four GPUs are wildly different \(30:1 on H100, 113:1 on B200, 3800:1 on B300, 120:1 native or 20:1 emulated on Rubin\)\. A simple compute\-balance analysis\[[39](https://arxiv.org/html/2606.06510#bib.bib39)\]—equatingPfp8/rP\_\{\\textsc\{fp8\}\}/rtoPfp64/\(1−fG\)P\_\{\\textsc\{fp64\}\}/\(1\\\!\-\\\!f\_\{G\}\)for a workload with GEMM\-fractionfGf\_\{G\}—shows that B300’s 3800:1 ratio is balanced forfG≥0\.997f\_\{G\}\\\!\\geq\\\!0\.997, i\.e\. essentially pure\-AI workloads, while classical HPC mixes \(fG∈\[0\.6,0\.9\]f\_\{G\}\\\!\\in\\\!\[0\.6,0\.9\]\) call for ratios of∼10\\sim\\\!10:1 to∼190\\sim\\\!190:1\. This is the headline tension that the thesis of §[7\.1](https://arxiv.org/html/2606.06510#S7.SS1)resolves: the post\-fp64stack collapses the\(1−fG\)\(1\\\!\-\\\!f\_\{G\}\)residue onto pipes that*are*appropriately provisioned \(Ozaki II for GEMM\-like work, Kulisch INT32 for FFT\-class reductions, FP32\+Kahan for BLAS\-1\), making the native FP8:FP64 ratio operationally irrelevant for HPC\. Of the current generation Rubin’s effective2020:1 \(against the Emulated\-DGEMM figure\) is the closest to a balanced HPC design point; B300’s38003800:1 is, in isolation, the furthest—but in combination with the software stack of this paper, it ceases to matter\.

### 6\.2INT8 versus FP8 as the Emulation Substrate

A related question is which low\-precision tensor format should underlie Ozaki II\. The original Ozaki II formulation\[[31](https://arxiv.org/html/2606.06510#bib.bib31),[36](https://arxiv.org/html/2606.06510#bib.bib36)\]usedint8because integer modular arithmetic is the most natural fit for the Chinese Remainder Theorem decomposition\. However, NVIDIA’s deprioritisation ofint8on Blackwell and Rubin \(Table[2](https://arxiv.org/html/2606.06510#S3.T2)\) has forced a shift tofp8\. Table[5](https://arxiv.org/html/2606.06510#S6.T5)compares the emulation ceilings under both substrates\.

Table 5:Emulated densefp64\-equivalent throughput atr=10r=10,β=1\\beta=1, under the two candidate Ozaki II substrates\. “Emulation gain” is the ratio of the better substrate to the worse, indicating how strongly the choice of substrate matters on each GPU\.The pattern is clean\. On H100 the two substrates yield identical emulation ceilings \(≈198\\approx\\\!198TFLOPS each\) because the tensor\-core rates are equal\. H100 is therefore the only architecture on which the*original*INT8\-based Ozaki II formulation remains competitive with the FP8 reformulation; the Uchino–Ozaki–Imamura FP8 adaptation\[[37](https://arxiv.org/html/2606.06510#bib.bib37)\]is principally motivated by the collapse ofint8throughput on Blackwell forward\. On B300 the shift to FP8 buys a30×30\\timesimprovement; on Rubin,16×16\\times\. This justifies the additional implementation complexity of the FP8 variant \(quantisation\-trick handling of the FP8 mantissa, NaN/Inf behaviour, narrower dynamic range\) on every NVIDIA architecture except H100\.

For applications that can run on H100, the simpler INT8 path remains useful and is recommended where the implementation cost of the FP8 variant cannot be amortised\. This observation is consistent with prior INT8\-only Ozaki implementations\[[21](https://arxiv.org/html/2606.06510#bib.bib21),[36](https://arxiv.org/html/2606.06510#bib.bib36)\]that demonstrate strong performance on H100\-class hardware\.

## 7Discussion

### 7\.1What Remains Compute\-Bound on B300?

Tables[3](https://arxiv.org/html/2606.06510#S6.T3)and[4](https://arxiv.org/html/2606.06510#S6.T4)establish that the four primitive kernels analysed in this paper—dense GEMM, batched GEMV, structured stencils, and SpMV—are either restored to their memory\-roof performance on B300 through Ozaki II, or were already memory\-bound on a healthy native pipe \(in which case emulation costs nothing\)\. The companion analysis on spectral workloads\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]closes a fifth case, the three\-dimensional FFT, via the Kulisch fixed\-point Phase B route\[[13](https://arxiv.org/html/2606.06510#bib.bib13),[14](https://arxiv.org/html/2606.06510#bib.bib14)\]that runs the per\-output forward\-CRT reduction on B300’s surviving INT32 SIMT pipe; B300 meets the corresponding sub\-floorηint32≥8\.25​Bmem\\eta\_\{\\textsc\{int32\}\}\\geq 8\.25\\,B\_\{\\text\{mem\}\}with∼14%\\sim\\\!14\\%margin, and comfortably exceeds the accompanyingfp8tensor\-core floorηfp8≥170​Bmem\\eta\_\{\\textsc\{fp8\}\}\\geq 170\\,B\_\{\\text\{mem\}\}by∼3\.7×\\sim\\\!3\.7\\times\(full four\-floor codesign derivation in\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]; summary in §[6\.1](https://arxiv.org/html/2606.06510#S6.SS1)\)\. This leaves the natural question:*are there any other kernels that would still make a typical scientific simulation compute\-bound on B300?*We enumerate the candidates by category\.

#### \(a\) Reductions and dot products\.

BLAS\-1 inner\-product kernels \(ddot,dnrm2, CG residuals, Krylov orthogonalisation\) haveOI∼1/4\\mathrm\{OI\}\\\!\\sim\\\!1/4and typically constitute<5%<\\\!5\\%of CG/GMRES/BiCGStab time\-to\-solution; the dominant cost is the SpMV plus preconditioner, both addressed by Ozaki II\. When present, the dot products themselves can be executed infp32with Kahan compensation\[[11](https://arxiv.org/html/2606.06510#bib.bib11)\]at no penalty for the surrounding iterative solver, because the orthogonality residual is re\-projected at each iteration\. B300’sfp32vector pipe at∼60\\sim\\\!60–8080TFLOPS is well above the BLAS\-1 memory\-roof requirement of0\.25⋅8=20\.25\\cdot 8=2TFLOPS\. Not binding\.

#### \(b\) FFT\.

Handled in detail by the companion paper\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]and summarised above: B300 fails the native FP64 floor but satisfies the Kulisch INT32 sub\-floor with∼14%\\sim\\\!14\\%margin, projecting to∼18\\sim\\\!18ms wall time for102431024^\{3\}FP64 FFT against a12\.912\.9ms memory roof at full5252\-bit mantissa precision\. The headline caveat is that the Kulisch kernel does not yet exist in any production library and the projection has not been measured; building it is a principal item in §[7\.4](https://arxiv.org/html/2606.06510#S7.SS4)\.

#### \(c\) Triangular solve and LU panel factorisation\.

The inner kernels of dense direct solvers \(dtrsm,dgetrfpanel\) are sequential along one dimension, with operational intensity∼2\\sim\\\!2–55for typical panel widths\. On B300 the panel falls into the compute\-bound regime in nativefp64, but the surroundingdgemmupdate isO​\(n3\)O\(n^\{3\}\)and dominates by a factor proportional to the matrix size; Ozaki II handles the outer update at full speed, leaving the panel as a sub\-percent residual\. This is the*HPL\-pattern*that NVIDIA’s cuBLAS Ozaki integration\[[26](https://arxiv.org/html/2606.06510#bib.bib26)\]and the 2\.3×\\timesB200 HPL speedup\[[25](https://arxiv.org/html/2606.06510#bib.bib25)\]already demonstrate empirically\. Not binding at production scale\.

#### \(d\) Sparse direct solvers\.

MUMPS\-class\[[1](https://arxiv.org/html/2606.06510#bib.bib1)\]multifrontal factorisations decompose into many small dense frontal matrices plus large dense GEMM\-like extend\-add updates\. The GEMM portion is recovered by Ozaki II; the frontal\-matrix portion has the same panel\-residual structure as \(c\) and the same conclusion\. Provided the maximum frontal matrix is≥64×64\\geq\\\!64\\\!\\times\\\!64, B300 is not binding\. Very ill\-conditioned, deeply nested supernodal structures may remain problematic and require empirical investigation\.

#### \(e\) Lattice QCD Dirac inversion\.

The Dirac operator\[[10](https://arxiv.org/html/2606.06510#bib.bib10)\]is a structured stencil with4×44\{\\times\}4or3×33\{\\times\}3complex\-matrix\-valued links,OI≈1\\mathrm\{OI\}\\\!\\approx\\\!1, GEMM\-decomposable\. Production LQCD codes already use mixed precision \(FP32 inner, FP64 polish\) with iterative refinement\[[4](https://arxiv.org/html/2606.06510#bib.bib4)\]; Ozaki II FP32 mode addresses thefp32inner solve at memory\-roof speed, and thefp64polish is BLAS\-2 and falls under \(a\)\.

#### \(f\) Small\-inner\-kernel codes \(latency\-bound\)\.

A small but real class of codes—accelerator beam dynamics over10610^\{6\}turns, geodynamo simulations, long\-time\-integration storm\-resolving climate—show drift that mixed precision cannot fully tame and whose dominant inner kernel may be neither FFT nor GEMM\. These have historically been read as a B300 counter\-example, but on closer inspection the issue is algorithmic and GPU\-agnostic: when the per\-time\-step inner kernel is a few hundred sites, the wall\-clock cost is dominated by kernel\-launch latency on*any*accelerator \(including H100 native\), not by FP64 throughput\. Ozaki II is not appropriate here because the kernel is too small for the crossover to bite, but neither is native FP64 silicon: the binding constraint is per\-step latency, which a future generation would address through persistent kernels or graph\-style execution, not through additional FP64 silicon\. This residue\[[34](https://arxiv.org/html/2606.06510#bib.bib34)\]is an open problem for the field, but it is not evidence that B300 is the wrong target for HPC; the same problem exists on H100, B200, and Rubin\.

#### \(g\) Stochastic methods \(FCI\-QMC, AFQMC\)\.

Population dynamics with extreme dynamic range that resist low\-precision reformulation\[[3](https://arxiv.org/html/2606.06510#bib.bib3)\]\. Active research area; appropriate mitigation is algorithmic rather than architectural\.

#### Verdict\.

Of the seven categories, \(a\), \(b\), \(c\), \(d\), and \(e\) all admit a bandwidth\-bound execution path on B300 at fullfp64accuracy via the combination of Ozaki II \(this paper\), Ozaki–Bailey \+ Kulisch Phase B FFT\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\], and FP32\+Kahan compensation; \(f\) is latency\-bound and GPU\-agnostic; \(g\) is an algorithmic\-research question\.*Every kernel class surveyed in this paper therefore admits a bandwidth\-bound execution path on B300 at fullfp64accuracy*; native FP64 silicon is, on this evidence, not the holy grail it has historically been taken to be—abundantfp8tensor throughput, combined with the right algorithmic scaffolding, is sufficient for the HPC workload spectrum, modulo the engineering effort required to build the Ozaki II and Kulisch kernels\. That effort, as argued in §[7\.4](https://arxiv.org/html/2606.06510#S7.SS4), is tractable on the timescale of months rather than years because modern AI coding assistants are well\-suited to the pattern\-translation work it consists of\. The architectural recommendation that emerges is the four\-floor codesign rule of\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]: any post\-Rubin GPU intended to serve spectral scientific workloads should provide eitherηfp64\-vec≥1\.56​Bmem\\eta\_\{\\textsc\{fp64\}\\text\{\-vec\}\}\\geq 1\.56\\,B\_\{\\text\{mem\}\}\(the safe native target, met by Rubin within4%4\\%\), or, as an engineered fallback,ηint32\-vec≥8\.25​Bmem\\eta\_\{\\textsc\{int32\}\\text\{\-vec\}\}\\geq 8\.25\\,B\_\{\\text\{mem\}\}\(the Kulisch sub\-floor\)*together with*ηfp8≥170​Bmem\\eta\_\{\\textsc\{fp8\}\}\\geq 170\\,B\_\{\\text\{mem\}\}\(the FP8 tensor\-core floor that Phase A needs\)\. The last is comfortably met by every modern datacentre GPU—a direct consequence of NVIDIA’s deliberate scale\-up of FP8 silicon for AI workloads—and so is not the binding constraint for any current architecture\. Cutting both the FP64 and INT32 pipes simultaneously would close both escape routes; further cuts to FP8 would invalidate the FP8\-is\-enough thesis altogether\. This is the design boundary that the FugakuNEXT and post\-Rubin generations should respect\.

### 7\.2The AMD Counter\-Argument

AMD has taken a different position\. Speaking to HPCwire in March 2026\[[41](https://arxiv.org/html/2606.06510#bib.bib41)\], AMD Fellow Nick Malaya argued that “there is currently no substitution for rawfp64performance” and announced that the forthcoming Instinct MI430X will significantly increase nativefp64throughput rather than rely on emulation\. The view should be taken seriously, but the kernel\-coverage audit of §[7\.1](https://arxiv.org/html/2606.06510#S7.SS1)sharpens the comparison\. When AMD’s position was articulated, the principal counter\-argument to NVIDIA’s emulation\-only path was that real scientific codes contain non\-GEMM hot spots—FFT, sorting, scan, atomics—where Ozaki\-style emulation did not directly apply\. Of these, FFT has since been closed\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]; what remains is the integer\-dominated kernels \(sort, scan, atomics\), which do not need FP64 silicon at all, and the small\-inner\-kernel latency\-bound codes of §[7\.1](https://arxiv.org/html/2606.06510#S7.SS1)\(f\), which are GPU\-agnostic\. The residual case for AMD’s native\-FP64 strategy is therefore narrower than it was in early 2026: it rests on procurement convenience \(no Ozaki and Kulisch kernels to build\) and on long\-tail codes that the post\-FP64 stack has not yet been engineered for\. Both arguments lose force as AI\-assisted coding \(§[7\.4](https://arxiv.org/html/2606.06510#S7.SS4)\) compresses the implementation timeline\. A portfolio approach—NVIDIA in Ozaki\+Kulisch emulation, AMD in nativefp64, applications mixing where appropriate—remains the practical near\-term picture, but the medium\-term direction tips toward emulation\.

### 7\.3Implications for AI4S and National Strategy

The fact that the U\.S\. Department of Energy’s Genesis Mission has explicitly identified Ozaki emulation as the path tofp64on AI\-centric accelerators\[[42](https://arxiv.org/html/2606.06510#bib.bib42)\]indicates that the technique has moved from research curiosity to strategic infrastructure\. In the author’s view this is a healthy development; it lets science continue to exploit the cost and energy efficiency of AI\-class hardware without abandoning the precision discipline that distinguishes simulation from inference\. It also places a burden on the numerical\-software community to verify that the error bounds hold on*real*application inputs, not just on the well\-conditioned synthetic matrices typical of benchmark suites\. The community should also recognise that emulation fundamentally trades silicon area \(the missingfp64units\) for software complexity \(the Ozaki path through cuBLAS\); a sustained engineering investment is required to keep that trade\-off favourable\. The encouraging development on this front is that the software complexity is highly amenable to modern AI coding assistants \(§[7\.4](https://arxiv.org/html/2606.06510#S7.SS4)\): the Ozaki II and Kulisch kernel discipline is a known algorithmic pattern that needs to be translated onto specific tile shapes and register budgets for each new kernel—exactly the class of work that Claude Code, OpenAI’s Codex, and similar tools handle well\. This makes the silicon\-vs\-software trade\-off increasingly favourable on the timescale of FugakuNEXT and the Rubin\-generation US procurements, strengthening the case that AI4S infrastructure can rely on the post\-fp64stack rather than insisting on native FP64 silicon\.

### 7\.4Future Work

The analytical projections of this paper, taken together with the companion FFT analysis\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]and the residual\-kernel audit of §[7\.1](https://arxiv.org/html/2606.06510#S7.SS1), leave several substantial directions open\.

#### Sort, scan, and graph kernels\.

The non\-GEMM, non\-FFT hot spots of real scientific codes—sort/scan primitives \(whose arithmetic is integer\-dominated and where Ozaki\-II offers no obvious benefit\), atomic\-heavy graph traversals, and stochastic methods with extreme dynamic range \(§[7\.1](https://arxiv.org/html/2606.06510#S7.SS1)\(g\)\)—are open questions for the TME framework\. A systematic application of the model to these kernels, with the associated\(α,β,γ\)\(\\alpha,\\beta,\\gamma\)parameter extraction, is the natural next step\.

#### Real\-code benchmarking and production maturity of FP8 Ozaki II\.

The projections in Table[3](https://arxiv.org/html/2606.06510#S6.T3)are model\-based; the next step is empirical validation on production hardware running real applications\. Thefp8variant of Ozaki II\[[37](https://arxiv.org/html/2606.06510#bib.bib37)\]is only weeks old at the time of writing; componentwise error bounds have been established\[[32](https://arxiv.org/html/2606.06510#bib.bib32)\], but production maturity \(behaviour on ill\-conditioned inputs, NaN/Inf propagation infp8E4M3\) is still being assessed by the community\. As of mid\-2026 the empirical step requires three pieces of infrastructure that are not all yet in place: \(i\) a B300 or Rubin system in production deployment with the cuBLAS Ozaki integration\[[26](https://arxiv.org/html/2606.06510#bib.bib26)\]enabled, \(ii\) a set of application kernels expressed in terms of thefp64\-accurate GEMM/GEMV/stencil primitives that Ozaki II accelerates, and \(iii\) a measurement methodology that separates the emulation cost from the surrounding kernel\-fusion and data\-movement effects\. The forthcoming Doudna \(NERSC\) and Blue Lion \(LRZ\) Rubin\-based systems\[[22](https://arxiv.org/html/2606.06510#bib.bib22),[27](https://arxiv.org/html/2606.06510#bib.bib27)\]are the obvious early targets\.

#### Register\-fusion implementations realised with AI\-assisted coding\.

The kernel\-engineering work required to driveβ→1\\beta\\to 1for each new primitive is repetitive and error\-prone, but algorithmically shallow: it is mostly the translation of a known algorithmic pattern \(register\-levelrr\-fold decomposition, tensor\-core MMA, Garner reconstruction\) onto the specific tile shape, register budget, and shared\-memory layout of the target kernel\. This is exactly the class of work for which modern AI coding assistants—Anthropic’s Claude Code, OpenAI’s Codex, and similar tools as of mid\-2026—are well\-suited\. In the author’s view, the path from the projections of this paper to a production\-ready Ozaki II library covering SpMV, stencils, FFT fragments, and the long tail of scientific\-computing primitives is realistically a matter of*months of engineering work performed in collaboration with AI coding assistants*, rather than years of manual kernel writing\. This represents an unusual moment in which two ostensibly unrelated AI developments—the architectural pivot of GPUs toward low\-precision tensor cores, and the maturation of AI coding assistants—combine to make the emulation strategy practically realisable on the timescale of the FugakuNEXT, Doudna, and Blue Lion deployments\.

#### Other open problems\.

Three further refinements of the TME framework remain open\. First, the model treatsβ\\betaas an integer\-valued discipline question \(fused or unfused\), but real kernels live on a continuum: register pressure forces partial spilling, multi\-tile algorithms are forced to revisit residues in shared memory, and so on\. A naive Ozaki II implementation withr=10r\\\!=\\\!10residue planes per warp can spill, raisingβ\\betaabove 1 and erasing the projected speedup; a more refined model would parameteriseβ​\(r,Tk,Tm,Tn,regs\)\\beta\(r,T\_\{k\},T\_\{m\},T\_\{n\},\\text\{regs\}\)explicitly and predict the optimal tile shape jointly with the optimal moduli count\. We expect such a model to drop out of careful instrumentation of the cuBLAS reference implementation\[[26](https://arxiv.org/html/2606.06510#bib.bib26)\]and the open\-sourceGEMMul8library\[[35](https://arxiv.org/html/2606.06510#bib.bib35)\]\. Second, the marriage of Ozaki II with structured 2:4 sparse tensor cores: the natural construction \(apply 2:4 sparsity at the residue\-plane level\) may invalidate the modular reduction in \([6](https://arxiv.org/html/2606.06510#S2.E6)\) because the sparsity mask is data\-dependent; a clean solution likely requires reformulating Ozaki II as a structured\-sparse\-friendly decomposition, and is to our knowledge open\. Third, the dramatic speedups projected in this paper occur*only*onfp64\-starved architectures \(B300, Rubin\); on H100 and B200 the architecture is already balanced for memory\-bound kernels\. The narrowing of the emulation benefit as the silicon mix evolves is itself a question the TME framework should be able to answer quantitatively for future hardware generations\.

## 8Conclusion

The post\-fp64era is here\. On B300 the nativefp64pipe has fallen to the point that nearly every standard scientific kernel becomes compute\-bound on physical silicon, while the memory bandwidth that those kernels were designed around remains available—but uncollectable\. On Rubin, NVIDIA has gone a step further and listed “Emulated DGEMM” as a first\-class column in the official specifications\[[28](https://arxiv.org/html/2606.06510#bib.bib28)\], signalling that emulation is now the architectural default forfp64matrix performance\. The Chinese Remainder Theorem\-based Ozaki Scheme II\[[31](https://arxiv.org/html/2606.06510#bib.bib31)\], together with itsfp8adaptation\[[37](https://arxiv.org/html/2606.06510#bib.bib37),[20](https://arxiv.org/html/2606.06510#bib.bib20)\], offers a principled mechanism for converting that dormant bandwidth into useful,fp64\-accurate work\.

This paper has provided the analytical framework—the Tensor–Memory Equilibrium model—to reason about when Ozaki II is profitable for the*memory\-bound*workloads \(stencils, SpMV, batched GEMV\) that dominate real scientific simulation, as distinct from the dense GEMMs that have so far been the focus of Ozaki II evaluation\. The TME roofline shows two simultaneous properties of the emulated path: in the bandwidth\-bound regime it matches the memory roof exactly, giving the application all the bandwidth the architecture can supply; in the compute\-bound regime it vaults far above the nativefp64ceiling of the previous generation, exceeding B200’s4040TFLOPS dense roof by≈12×\\approx 12\\timeson B300 and≈10×\\approx 10\\timeson Rubin\. Of equal importance, when the comparison baseline is shifted to H100 nativefp64\(Table[4](https://arxiv.org/html/2606.06510#S6.T4)\), every workload on every Ozaki II configuration matches or exceeds the H100 baseline—in stark contrast to native B300, which regresses by up to50×50\\timeson compute\-sensitive workloads\. This is precisely the behaviour scientific computing needs: no penalty where the application is memory\-bound, and substantial gain where it is compute\-bound, with no regression relative to the lastfp64\-balanced data\-centre NVIDIA architecture\.

Concrete projections on B300 are dramatic but bounded by the memory roof:∼1\.2×\\sim 1\.2\\timesfor SpMV,∼3×\\sim 3\\timesfor stencils,∼24×\\sim 24\\timesfor batched GEMV with reasonable batch size, and∼380×\\sim 380\\timesfor dense GEMM, all atfp64\-equivalent accuracy\. On Rubin, the published∼200\\sim 200TFLOPS Emulated\-DGEMM figure already exceeds B200’s native ceiling by5×5\\times; the TME model’s dense\-fp8/r=10r\{=\}10upper bound of∼400\\sim 400TFLOPS exceeds B200 by10×10\\times\. The bandwidth\-driven gain \(2\.75×2\.75\\timesover B300 across the memory\-bound regime,6\.6×6\.6\\timesover H100\) is the dominant Rubin generational improvement\.

#### The end\-to\-end claim\.

Taken together with the companion analysis of spectral workloads\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]and the residual\-kernel audit of §[7\.1](https://arxiv.org/html/2606.06510#S7.SS1), this paper supports a stronger end\-to\-end claim than its individual kernel projections suggest\. Of every kernel class that we have surveyed at production scale—dense GEMM, batched GEMV, stencils, SpMV, FFT, dense and sparse direct\-solver panels, lattice\-QCD inverters, BLAS\-1 reductions and CG inner products—*none genuinely fails to recover the memory roof on B300, provided the Kulisch Phase B kernel is built*\. Matrix and stencil work is recovered by Ozaki II in the main body of this paper; spectral work is recovered by Ozaki–Bailey FFT with Kulisch fixed\-point Phase B routed onto the INT32 SIMT pipe, at∼18\\sim\\\!18ms wall time for102431024^\{3\}FP64 FFT against the12\.912\.9ms memory roof and at full5252\-bit accuracy\[[19](https://arxiv.org/html/2606.06510#bib.bib19)\]; and BLAS\-1 reductions are absorbed by the healthy B300 FP32 pipe with Kahan compensation\. The companion FFT paper establishes the architectural fix as a*four\-floor codesign rule*: a GPU meets memory\-roof FFT parity at fullfp64if and only if at least one of the native FP64 floor \(ηfp64\-vec≥1\.56​Bmem\\eta\_\{\\textsc\{fp64\}\\text\{\-vec\}\}\\geq 1\.56\\,B\_\{\\text\{mem\}\}\), the informational naive\-Ozaki Phase B floor \(2\.06​Bmem2\.06\\,B\_\{\\text\{mem\}\}\), or the Kulisch INT32 sub\-floor \(ηint32\-vec≥8\.25​Bmem\\eta\_\{\\textsc\{int32\}\\text\{\-vec\}\}\\geq 8\.25\\,B\_\{\\text\{mem\}\}\) is met,*together with*the FP8 tensor\-core floor \(ηfp8≥170​Bmem\\eta\_\{\\textsc\{fp8\}\}\\geq 170\\,B\_\{\\text\{mem\}\}, i\.e\.∼1\.36\\sim\\\!1\.36PFLOPS at88TB/s\) that enables the Phase A inner\-product batch\. H100 and B200 satisfy all four; Rubin satisfies the native floor within4%4\\%and exceeds the FP8 floor by∼4×\\sim\\\!4\\times; B300 fails the native and naive\-Ozaki floors but satisfies the Kulisch INT32 sub\-floor with∼14%\\sim\\\!14\\%margin and the FP8 floor with∼3\.7×\\sim\\\!3\.7\\timesmargin\. Crucially, the FP8 floor is comfortably met by every current datacentre GPU—a direct consequence of NVIDIA’s scale\-up of FP8 silicon for AI workloads—and is the quiet asymmetry that makes the “fp8\-is\-enough” thesis viable in the first place\. The practical implication is that the post\-fp64stack—Ozaki II for matrix and stencil work, Ozaki–Bailey \+ Kulisch Phase B for spectral work, FP32\+Kahan for reductions, and the discipline of register\-level fusion to keepβ→1\\beta\\\!\\to\\\!1—is*end\-to\-end sufficient*for the broad majority of scientific simulation codes\. The stronger statement that the analysis here supports is precisely the one announced in the title:*fp8\(with the right algorithmic scaffolding\) is all you need*for the canonical HPC kernel spectrum; nativefp64silicon is not the holy grail it has historically been taken to be, modulo the kernel\-engineering investment required to build the Ozaki II and Kulisch implementations\. That investment, as argued in §[7\.4](https://arxiv.org/html/2606.06510#S7.SS4), is now genuinely tractable on the timescale of months rather than years, because the work consists almost entirely of pattern\-translation—mapping a known algorithmic structure onto specific tile shapes, register budgets, and shared\-memory layouts—of the kind that modern AI coding assistants handle well\. B300’s native FP64 cliff, surprising as it may appear in isolation, is therefore not merely operationally tolerable but*strategically aligned*with the direction in which the HPC software stack is heading\. The safe architectural recommendation is, however, not to rely solely on the software escape route: post\-Rubin designs should satisfy the native floor and treat the Kulisch sub\-floor as a fallback, not as a primary target\.

The author’s view, after this analysis, is that Ozaki\-style emulation should be integrated systematically into the standard HPC libraries \(cuBLAS, cuSPARSE, cuFFT\-with\-care, hipBLAS\) and exposed to applications behind precision\-policy interfaces\. The architecture of the Ozaki II algorithm and its componentwise error guarantee make it well\-suited to this role; what remains is the engineering work of converting the projections of §[6](https://arxiv.org/html/2606.06510#S6)into production reality—work that, as argued in §[7\.4](https://arxiv.org/html/2606.06510#S7.SS4), is now plausibly achievable in months rather than years through the combination of AI\-assisted kernel engineering and the Rubin\-generation deployments scheduled for 2027 and beyond\.

#### Acknowledgements\.

The author thanks Katsuhisa Ozaki, Yuki Uchino, Toshiyuki Imamura, and Daichi Mukunoki for the body of work on which this analysis rests; any errors in interpretation are the author’s\. The author is particularly grateful to Katsuhisa Ozaki for detailed review comments on §[2\.2](https://arxiv.org/html/2606.06510#S2.SS2), which led to the substrate\-specific treatment of integer scaling and to the accumulator\-bound slice\-count analysis in Table[1](https://arxiv.org/html/2606.06510#S2.T1)\. The author also thanks the broader RIKEN R\-CCS team, the Institute of Science Tokyo faculty, and the NVIDIA cuBLAS team for technical discussions that shaped this paper\. This work was undertaken as part of the FugakuNEXT project and related R\-CCS initiatives on AI for Science\.

#### Disclosure of AI\-assisted writing\.

This manuscript was prepared with assistance from large language models, specifically Anthropic’s Claude \(Opus 4\.7\) and Google’s Gemini 3\. The models were used for draft generation, copy editing, literature\-summary cross\-checking, andLaTeXmechanics\. All scientific arguments, architectural interpretations, performance projections, and conclusions were directed, reviewed, and validated by the author, who takes full ownership of and responsibility for the content of this paper, including any errors of fact or judgment\.

## References

- \[1\]Patrick R\. Amestoy, Iain S\. Duff, Jean\-Yves L’Excellent, and Jacko Koster\.A fully asynchronous multifrontal solver using distributed dynamic scheduling\.SIAM Journal on Matrix Analysis and Applications, 23\(1\):15–41, 2001\.
- \[2\]Hartwig Anzt, Jack Dongarra, Goran Flegar, Nicholas J\. Higham, and Enrique S\. Quintana\-Ortí\.Adaptive precision in block\-Jacobi preconditioning for iterative sparse linear system solvers\.Concurrency and Computation: Practice and Experience, 31\(6\):e4460, 2019\.
- \[3\]George H\. Booth, Alex J\.W̃\. Thom, and Ali Alavi\.Fermion Monte Carlo without fixed nodes: A game of life, death, and annihilation in Slater determinant space\.Journal of Chemical Physics, 131:054106, 2009\.
- \[4\]Michael A\. Clark, Ronald Babich, Kipton Barros, Richard C\. Brower, and Claudio Rebbi\.Solving lattice QCD systems of equations using mixed precision solvers on GPUs\.Computer Physics Communications, 181\(9\):1517–1528, 2010\.
- \[5\]Nicolas Dickenmann\.Fifteen years of FP64 segmentation, and why the Blackwell ultra breaks the pattern\.Blog post, 2026\.
- \[6\]Harvey L\. Garner\.The residue number system\.IRE Transactions on Electronic Computers, EC\-8\(2\):140–147, 1959\.
- \[7\]Qiqi Gu, Chenpeng Wu, Heng Shi, and Jianguo Yao\.SPTCStencil: Unleashing sparse tensor cores for stencil computation via strided swap, 2025\.arXiv:2506\.22035\.
- \[8\]Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J\. Higham\.Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed\-precision iterative refinement solvers\.InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613, 2018\.
- \[9\]Aaron Jarmusch and Sunita Chandrasekaran\.Microbenchmarking NVIDIA’s Blackwell architecture: An in\-depth architectural analysis, 2025\.
- \[10\]Bálint Joo, Dhiraj D\. Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Karthikeyan Pamnany, Victor W\. Lee, Pradeep Dubey, and William Watson\.Lattice QCD on Intel Xeon Phi coprocessors\.ISC High Performance, 2013\.
- \[11\]William Kahan\.Pracniques: Further remarks on reducing truncation errors\.Communications of the ACM, 8\(1\):40, 1965\.
- \[12\]Donald E\. Knuth\.The Art of Computer Programming, Volume 2: Seminumerical Algorithms\.Addison\-Wesley, 3rd edition, 1997\.
- \[13\]U\. W\. Kulisch\.Mathematical foundation of computer arithmetic\.IEEE Transactions on Computers, C\-26\(7\):610–621, 1977\.
- \[14\]U\. W\. Kulisch and W\. L\. Miranker\.The arithmetic of the digital computer: A new approach\.SIAM Review, 28\(1\):1–40, 1986\.
- \[15\]Xueying Liu et al\.Toward accelerated stencil computation by adapting tensor core unit on GPU\.InProceedings of the 36th ACM International Conference on Supercomputing, 2022\.
- \[16\]Glenn K\. Lockwood\.NVIDIA Rubin: Architecture notes and performance specifications\.Glenn’s Digital Garden, 2026\.[https://www\.glennklockwood\.com/garden/processors/R200](https://www.glennklockwood.com/garden/processors/R200), accessed May 2026\.
- \[17\]Tobias Mann\.Nvidia unpacks Vera Rubin rack system at CES\.The Register, January 2026\.Documents Rubin FP64 vector regression from4545to3333TFLOPS\.
- \[18\]Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S\. Vetter\.NVIDIA tensor core programmability, performance & precision\.In2018 IEEE International Parallel and Distributed Processing Symposium Workshops \(IPDPSW\), pages 522–531, 2018\.
- \[19\]Satoshi Matsuoka\.FP8 is all you need \(part 2\): Efficient Ozaki–Bailey style FFT through tensor\-core Garner reformulation and Kulisch escape route\.Companion manuscript, 2026\.In preparation\.
- \[20\]Daichi Mukunoki\.DGEMM without FP64 arithmetic: Using FP64 emulation and FP8 tensor cores with Ozaki scheme, 2025\.
- \[21\]Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, and Toshiyuki Imamura\.DGEMM using tensor cores, and its accurate and reproducible versions\.InHigh Performance Computing – ISC High Performance 2020, pages 230–248\. Springer, 2020\.
- \[22\]NERSC\.Doudna: NERSC’s next\-generation supercomputer based on NVIDIA Vera Rubin\.National Energy Research Scientific Computing Center, 2026\.
- \[23\]NVIDIA Corporation\.NVIDIA Blackwell Architecture Technical Brief, 2024\.
- \[24\]NVIDIA Corporation\.NVIDIA Blackwell Ultra GPU Datasheet, 2025\.Individual Blackwell Ultra GPU specifications, October 2025\.
- \[25\]NVIDIA Corporation\.NVIDIA HPC\-Benchmarks 25\.04: Ozaki\-I HPL on blackwell tensor cores\.NVIDIA Developer Documentation, 2025\.
- \[26\]NVIDIA Corporation\.Unlocking tensor core performance with floating\-point emulation in cuBLAS\.NVIDIA Developer Blog, 2025\.
- \[27\]NVIDIA Corporation\.Blue Lion supercomputer will run on NVIDIA Vera Rubin\.NVIDIA Blog, 2026\.
- \[28\]NVIDIA Corporation\.Inside the NVIDIA Vera Rubin platform: Six new chips, one AI supercomputer\.NVIDIA Developer Blog, 2026\.Lists “Emulated DGEMM” as an official column in Rubin specifications\.
- \[29\]Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota\.DGEMM on integer matrix multiplication unit\.The International Journal of High Performance Computing Applications, 2024\.
- \[30\]Katsuhisa Ozaki, Takeshi Ogita, Shin’ichi Oishi, and Siegfried M\. Rump\.Error\-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications\.Numerical Algorithms, 59\(1\):95–118, 2012\.
- \[31\]Katsuhisa Ozaki, Yuki Uchino, and Toshiyuki Imamura\.Ozaki Scheme II: A GEMM\-oriented emulation of floating\-point matrix multiplication using an integer modular technique, 2025\.
- \[32\]Katsuhisa Ozaki, Yuki Uchino, and Toshiyuki Imamura\.Error analysis of matrix multiplication emulation using Ozaki\-II scheme, 2026\.Preprint\.
- \[33\]A\. Schwarz, A\. Anders, C\. Brower, H\. Bayraktar, J\. Gunnels, K\. Clark, R\. G\. Xu, S\. Rodriguez, S\. Cayrols, P\. Tabaszewski, et al\.Guaranteed DGEMM accuracy while using reduced precision tensor cores through extensions of the Ozaki scheme\.InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region\. ACM, 2025\.Also available as arXiv:2511\.13778\.
- \[34\]Rick Stevens et al\.The mixed\-precision path to FP64 on AI\-centric accelerators \(MFP64 whitepaper\)\.Technical report, U\.S\. Department of Energy, 2025\.
- \[35\]Yuki Uchino\.GEMMul8: GEMM emulation using INT8/FP8 matrix engines based on the Ozaki Scheme II\.GitHub repository, RIKEN\-RCCS, 2025\.
- \[36\]Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura\.Performance enhancement of the Ozaki scheme on integer matrix multiplication unit\.The International Journal of High Performance Computing Applications, 2025\.
- \[37\]Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura\.Double\-precision matrix multiplication emulation via Ozaki\-II scheme with FP8 quantization, 2026\.
- \[38\]Likun Wang et al\.SparStencil: Retargeting sparse tensor cores to scientific stencil computations via structured sparsity transformation\.InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis \(SC ’25\), 2025\.arXiv:2506\.22969\.
- \[39\]Samuel Williams, Andrew Waterman, and David Patterson\.Roofline: An insightful visual performance model for multicore architectures\.Communications of the ACM, 52\(4\):65–76, 2009\.
- \[40\]Alex Woodie\.NVIDIA says it’s not abandoning 64\-bit computing\.HPCwire, December 2025\.
- \[41\]Alex Woodie\.AMD hints at big FP64 increases in MI430X GPU as Ozaki underwhelms\.HPCwire, March 2026\.
- \[42\]Alex Woodie\.Genesis mission will lean heavily on Ozaki scheme for FP64 capability\.HPCwire, February 2026\.

## Appendix AGarner’s Algorithm: Detailed Derivation

Equation \([7](https://arxiv.org/html/2606.06510#S2.E7)\) is the iterative formulation of Garner’s algorithm\. We expand it here for the caser=3r=3to make the dependence pattern explicit\.

LetC∈ℤC\\in\\mathbb\{Z\}with0≤C<m1​m2​m30\\leq C<m\_\{1\}m\_\{2\}m\_\{3\}, and letC\(i\)=CmodmiC^\{\(i\)\}=C\\bmod m\_\{i\}\. We seek mixed\-radix digitsv1,v2,v3v\_\{1\},v\_\{2\},v\_\{3\}such that

C=v1\+v2​m1\+v3​m1​m2\.C\\;=\\;v\_\{1\}\+v\_\{2\}m\_\{1\}\+v\_\{3\}m\_\{1\}m\_\{2\}\.\(15\)Reducing \([15](https://arxiv.org/html/2606.06510#A1.E15)\) modulom1m\_\{1\},m2m\_\{2\},m3m\_\{3\}in turn:

v1\\displaystyle v\_\{1\}≡C\(1\)\(modm1\),\\displaystyle\\equiv C^\{\(1\)\}\\pmod\{m\_\{1\}\},\(16\)v2\\displaystyle v\_\{2\}≡\(C\(2\)−v1\)​m1−1\(modm2\),\\displaystyle\\equiv\(C^\{\(2\)\}\-v\_\{1\}\)\\,m\_\{1\}^\{\-1\}\\pmod\{m\_\{2\}\},\(17\)v3\\displaystyle v\_\{3\}≡\(\(C\(3\)−v1\)​m1−1−v2\)​m2−1\(modm3\)\.\\displaystyle\\equiv\\bigl\(\(C^\{\(3\)\}\-v\_\{1\}\)\\,m\_\{1\}^\{\-1\}\-v\_\{2\}\\bigr\)\\,m\_\{2\}^\{\-1\}\\pmod\{m\_\{3\}\}\.\(18\)The inversesm1−1\(modm2\)m\_\{1\}^\{\-1\}\\\!\\pmod\{m\_\{2\}\}andm2−1\(modm3\)m\_\{2\}^\{\-1\}\\\!\\pmod\{m\_\{3\}\}are precomputed once and stored in constant memory\. The arithmetic in \([7](https://arxiv.org/html/2606.06510#S2.E7)\) is done in 32\-bit or 64\-bit signed integers; modular reductions are implemented either as branch\-and\-correct or as Barrett reduction\. The total cost isO​\(r2\)O\(r^\{2\}\)small\-integer multiplications per reconstructed output, which forr=10r=10amortises to∼100\\sim 100small multiplications—negligible against an inner\-product reduction of length≳100\\gtrsim 100\.

## Appendix BPseudocode for the Fused Ozaki\-II GEMV Kernel

The pseudocode in Listing[1](https://arxiv.org/html/2606.06510#LST1)makes the structural points of the fused kernel explicit:rr*distinct*fragment objects \(one per modulus\),rr*distinct*accumulators, and a sized constant\-memory moduli table\. The listing is illustrative C\+\+17/CUDA pseudocode; production code would use the newertcgen05/cuteinterfaces on Hopper/Blackwell rather than the legacywmmaAPI\. For brevity the listing is written withint8\_tfragment types as in the original Ozaki II/INT8 formulation; the FP8 variant of Uchino et al\.\[[37](https://arxiv.org/html/2606.06510#bib.bib37)\]that we recommend for B300 and Rubin has the same structural skeleton withint8\_treplaced by the appropriate FP8 fragment type andint32\_taccumulators replaced by FP32 accumulators with the quantisation correction described in\[[37](https://arxiv.org/html/2606.06510#bib.bib37),[20](https://arxiv.org/html/2606.06510#bib.bib20)\]\.

1\#include<mma\.h\>

2usingnamespacenvcuda;

3

4constexprintR=10;

5\_\_constant\_\_int32\_tc\_moduli\[R\];

6

7template<intBATCH\>

8\_\_global\_\_voidozaki\_gemv\_kernel\(constdouble\*\_\_restrict\_\_A,

9constdouble\*\_\_restrict\_\_X,

10double\*\_\_restrict\_\_Y,

11intM,intN\)\{

12

13wmma::fragment<wmma::matrix\_a,16,16,16,

14int8\_t,wmma::row\_major\>a\_frag\[R\];

15wmma::fragment<wmma::matrix\_b,16,16,16,

16int8\_t,wmma::col\_major\>x\_frag\[R\];

17wmma::fragment<wmma::accumulator,16,16,16,

18int32\_t\>c\_acc\[R\];

19

20\#pragmaunroll

21for\(inti=0;i<R;\+\+i\)wmma::fill\_fragment\(c\_acc\[i\],0\);

22

23for\(intk=0;k<N;k\+=16\)\{

24

25doublea\_val=;

26doublex\_val=;

27

28

29int8\_ta\_res\[R\],x\_res\[R\];

30\#pragmaunroll

31for\(inti=0;i<R;\+\+i\)\{

32a\_res\[i\]=compute\_residue\(a\_val,c\_moduli\[i\]\);

33x\_res\[i\]=compute\_residue\(x\_val,c\_moduli\[i\]\);

34\}

35

36

37

38\#pragmaunroll

39for\(inti=0;i<R;\+\+i\)\{

40pack\_into\_frag\(a\_frag\[i\],a\_res\[i\]\);

41pack\_into\_frag\(x\_frag\[i\],x\_res\[i\]\);

42wmma::mma\_sync\(c\_acc\[i\],a\_frag\[i\],x\_frag\[i\],c\_acc\[i\]\);

43\}

44\}

45

46

47int32\_tresult\_buf\[R\];

48\#pragmaunroll

49for\(inti=0;i<R;\+\+i\)

50wmma::store\_matrix\_sync\(&result\_buf\[i\],c\_acc\[i\],16,

51wmma::mem\_row\_major\);

52doubley=garner\_reconstruct\(result\_buf,c\_moduli\);

53Y\[\]=y;

54\}

Listing 1:Illustrative pseudocode for the fused Ozaki\-II batched GEMV kernel; not production\-ready\.
## Appendix CNumerical Conditioning and the Choice of Scaling

The integer scaling in \([4](https://arxiv.org/html/2606.06510#S2.E4)\) is the only place where the emulated kernel can lose accuracy beyond the working\-precision unit round\-off\. The standard practice\[[31](https://arxiv.org/html/2606.06510#bib.bib31),[37](https://arxiv.org/html/2606.06510#bib.bib37)\]is to choose the diagonal scaling matricesD,ED,Ewith power\-of\-two diagonal entries \(so that the rescalingD−1​C~​E−1D^\{\-1\}\\tilde\{C\}E^\{\-1\}is itself error\-free infp64\), with each diagonal entry set so that the largest absolute value in the corresponding row/column ofA~\\tilde\{A\}\(resp\.B~\\tilde\{B\}\) reaches2p−1−12^\{p\-1\}\-1for an integer widthpp\. This maximises the signal\-to\-noise ratio of the integer quantisation\.

For inputs with rows of strongly heterogeneous magnitude, this row\-wise scaling can introduce noticeable error if a single row contains both very large and very small entries\. Adaptive precision schemes such as ADP\[[33](https://arxiv.org/html/2606.06510#bib.bib33)\]adjust the number of modulirr\(or equivalently the*slice*count for Ozaki I\) on a per\-row basis to recover the lost bits\. The TME model can be extended to capture ADP\-style adaptation by allowingα\\alphato vary with the row index; the projections in Table[3](https://arxiv.org/html/2606.06510#S6.T3)use a single globalr=10r=10for simplicity\.

## Appendix DRegister\-Fusion Details for Stencil and SpMV Kernels

This appendix gives the per\-kernel breakdown of HBM traffic and register pressure that determines whetherβ=1\\beta\\\!=\\\!1is realised in the fused kernels of §[5\.3](https://arxiv.org/html/2606.06510#S5.SS3)and §[5\.4](https://arxiv.org/html/2606.06510#S5.SS4)\.

#### Stencilβ\\betaaccounting\.

For the 7\-point fused stencil kernel of Algorithm[2](https://arxiv.org/html/2606.06510#alg2), the per\-output HBM traffic breaks down as:88B for the centre\-point read,≈8\\approx\\\!8B for the amortised neighbour reads \(each neighbour is reused by∼6\\sim\\\!6neighbouring outputs in a shared\-memory tile of∼323\\sim\\\!32^\{3\}points\), and88B for the write\. Total≈24\\approx\\\!24B per output, against∼7×2=14\\sim\\\!7\\\!\\times\\\!2=14fp64\-equivalent ops per output, givingOI≈0\.58\\mathrm\{OI\}\\\!\\approx\\\!0\.58FLOPS/Byte; this is the value used in Table[3](https://arxiv.org/html/2606.06510#S6.T3)\.

The register footprint per warp isrrresidue accumulators \(one per plane\) plus the working tile, totalling≈4​r\\approx\\\!4rINT32 registers per output beyond the working set\. On B300 SMs \(255 registers per thread, 32 threads per warp\), this is comfortable atr≤12r\\\!\\leq\\\!12and the bandwidth multiplierβ\\betaremains within1%1\\%of unity\. At largerrrregister spilling begins; the boundary is implementation\- specific and is the principal engineering risk for the stencil kernel\.

#### SpMV padding andβ\\beta\.

For the Blocked\-ELL Ozaki\-II SpMV kernel of Algorithm[3](https://arxiv.org/html/2606.06510#alg3), the bandwidth multiplier inherits the padding ratio of the underlying Blocked\-ELL format\. Letρpad∈\[1,∞\)\\rho\_\{\\text\{pad\}\}\\in\[1,\\infty\)be the average ratio of “stored” to “actual” non\-zeros, where structural zeros count as stored\. The fused kernel hasβ=ρpad\\beta\\\!=\\\!\\rho\_\{\\text\{pad\}\}\. For a 3\-D 7\-point Laplacian discretisation on a regular grid this isρpad=7/7=1\.0\\rho\_\{\\text\{pad\}\}=7/7=1\.0\(perfect block fit at𝑏𝑤=8\\mathit\{bw\}\\\!=\\\!8\); for a typical finite\-element matrix with row densities in\[6,24\]\[6,24\]at𝑏𝑤=32\\mathit\{bw\}\\\!=\\\!32one obtainsρpad≈2\\rho\_\{\\text\{pad\}\}\\\!\\approx\\\!2, so the kernel’s effective intensity halves and the speedup of Table[3](https://arxiv.org/html/2606.06510#S6.T3)should be read against half the listed intensity column\.

For strongly heterogeneous row lengths \(σ​\(nnz/row\)/μ​\(nnz/row\)\>1\\sigma\(\\text\{nnz/row\}\)/\\mu\(\\text\{nnz/row\}\)\\\!\>\\\!1\), hybrid CSR\-ELL or HYB formats are required: long rows fall back to CSR \(no padding\) and short rows use Blocked\-ELL\. The TME model accommodates this by treatingβ\\betaas a per\-row property in the spirit of the ADP extension in Appendix[C](https://arxiv.org/html/2606.06510#A3)\.

Similar Articles

The eighth-generation TPU: An architecture deep dive

Hacker News Top

Google unveils eighth-generation TPU 8t and TPU 8i, purpose-built for massive pre-training and inference with SparseCore, native FP4, and 9,600-chip superpods to power world models and agentic AI.

Getting peak TOPS on a Ryzen AI 7 350 NPU

Lobsters Hottest

A technical deep-dive into achieving peak TOPS performance on the AMD Ryzen AI 7 350 NPU, comparing it to Xilinx AIE-ML v2 AI engines and explaining the hardware architecture for matrix multiplication workloads.

EMiX: Emulating Beyond Single-FPGA Limits

Hacker News Top

Introduces EMiX, a scalable multi-FPGA framework for emulating multi-core RISC-V architectures beyond single-FPGA resource limits, demonstrated with a 64-core system across eight FPGAs.