Programmable Probabilistic Computer with 1M p-bits

Hacker News Top Papers

Summary

This paper presents a programmable probabilistic computer with one million p-bits by networking FPGAs, achieving Gibbs sampling at over a trillion flips per second for Ising models while introducing a design rule for scaling beyond single-chip limits.

No content available
Original Article
View Cached Full Text

Cached at: 06/28/26, 07:59 PM

# Programmable Probabilistic Computer with 1,000,000 p-bits
Source: [https://arxiv.org/html/2606.25313](https://arxiv.org/html/2606.25313)
Present address: \]Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA

Navid Anjum Aadit[navidanj@stanford\.edu](https://arxiv.org/html/2606.25313v1/mailto:[email protected])\[Department of Electrical & Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USAXiuqi ZhangDepartment of Electrical & Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USAShuvro ChowdhuryDepartment of Electrical & Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USAKevin Callahan\-CorayDepartment of Electrical & Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USAKyle LeeDepartment of Electrical & Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USASaleh BunaiyanDepartment of Electrical & Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USAElectrical Engineering Department, King Fahd University of Petroleum & Minerals, Dhahran 31261, Saudi ArabiaSanjay SeshanDepartment of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USAJason TwiggSiemens Digital Industries Software, USAAndrew SeawrightSiemens Digital Industries Software, USAForrest BrewerDepartment of Electrical & Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USATathagata SrimaniDepartment of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USAKerem Y\. Çamsarı[Corresponding author: camsari@ucsb\.edu](https://arxiv.org/html/2606.25313v1/mailto:Corresponding%20author:%[email protected])Department of Electrical & Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USA

###### Abstract

Probabilistic computers built from p\-bits have been proposed as hardware accelerators for sampling and optimizing Ising models, but existing systems have been confined to a single chip, capped by its capacity and memory bandwidth\. Here we break this limit by networking FPGAs into a single Ising machine far larger than any one device could hold, realizing a programmable probabilistic computer with one million p\-bits\. The machine performs Gibbs sampling at over a trillion flips per second while keeping every coupling weight in local on\-chip memory\. During execution, devices exchange nothing but 1\-bit boundary states\. This architecture exposes a question fundamental to any distributed sampler: how frequently boundary information must be refreshed for a partitioned machine to behave as an unpartitioned one\. Using 3D Edwards–Anderson spin glasses, we show that the answer is set by a single timing ratio,η=fcomm/fp​\-​bit\\eta=f\_\{\\mathrm\{comm\}\}/f\_\{\\mathrm\{p\\text\{\-\}bit\}\}, of the boundary\-exchange frequency to the local p\-bit update frequency\. Above a topology\-dependent threshold, the distributed machine matches a monolithic GPU reference\. Below it, residual energy still decays as a power law but with a reduced exponent, turning parallelism into a quantifiable throughput–accuracy tradeoff\. A theoretical cluster mean\-field model reproduces the same behavior, showing that this tradeoff is a universal property of partitioned stochastic dynamics\. These results provide a programmable million\-p\-bit platform, demonstrated across spin glasses, Max\-Cut, and Boolean satisfiability, together with a quantitative design rule for scaling probabilistic computers beyond the single\-chip limit\.

## I\. Introduction

The most powerful computing systems today are built from many chips acting as one\. Graphics processing unit \(GPU\) clusters must be networked to train frontier artificial intelligence models\[[1](https://arxiv.org/html/2606.25313#bib.bib1),[2](https://arxiv.org/html/2606.25313#bib.bib2)\], quantum processors that have scaled from tens to thousands of qubits now confront chip\-boundary constraints\[[3](https://arxiv.org/html/2606.25313#bib.bib3),[4](https://arxiv.org/html/2606.25313#bib.bib4),[5](https://arxiv.org/html/2606.25313#bib.bib5)\], and Ising machine spin arrays fill the available resources long before reaching the problem sizes that matter for hard combinatorial optimization\[[6](https://arxiv.org/html/2606.25313#bib.bib6),[7](https://arxiv.org/html/2606.25313#bib.bib7)\]\. Even when a problem fits on one chip, a second limit applies\. Each variable update must read coupling weights from memory, and when those weights spill off\-chip the update rate is capped by memory bandwidth\[[8](https://arxiv.org/html/2606.25313#bib.bib8),[9](https://arxiv.org/html/2606.25313#bib.bib9),[10](https://arxiv.org/html/2606.25313#bib.bib10),[11](https://arxiv.org/html/2606.25313#bib.bib11)\]\. Distributing the computation across multiple devices addresses both limits at once, but introduces a third: variables near chip boundaries are updated using neighbor states that have grown stale since the last exchange\.

Ising machines\[[12](https://arxiv.org/html/2606.25313#bib.bib12),[13](https://arxiv.org/html/2606.25313#bib.bib13),[14](https://arxiv.org/html/2606.25313#bib.bib14),[15](https://arxiv.org/html/2606.25313#bib.bib15)\]are hardware platforms that find low\-energy states of spin Hamiltonians and, more generally, sample from their Boltzmann distributions, serving both combinatorial optimization and probabilistic inference\[[16](https://arxiv.org/html/2606.25313#bib.bib16)\]\. They face both the capacity ceiling and the memory wall\. A canonical benchmark is the three\-dimensional Edwards–Anderson \(EA\) spin glass\[[17](https://arxiv.org/html/2606.25313#bib.bib17)\], whose ground\-state search is NP\-hard\[[18](https://arxiv.org/html/2606.25313#bib.bib18)\]and whose rich physics\[[19](https://arxiv.org/html/2606.25313#bib.bib19),[20](https://arxiv.org/html/2606.25313#bib.bib20),[21](https://arxiv.org/html/2606.25313#bib.bib21),[22](https://arxiv.org/html/2606.25313#bib.bib22)\]has made it a proving ground for optimization platforms for decades\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/Fig1.png)Fig\. 1:Distributed sparse Ising machines \(DSIMs\)\.\(a\)A sparse Ising graph at the million\-p\-bit scale \(fewer nodes rendered for clarity\)\.\(b\)A balanced min\-cut partition maps the graph onto multiple FPGAs, each updating its local subgraph from on\-chip weight memory\.\(c\)Hardware\-aware partitioning: a Potts cost function penalizes placing strongly connected partitions on distant devices, concentrating cut edges at short hop distances \(Supplementary Sec\.[S5](https://arxiv.org/html/2606.25313#S0.SS5)\)\.\(d\)Duplex boundary exchange: cut\-edge weights are duplicated as shadow weights on both sides of each cut, so only 1\-bit boundary states ever cross device boundaries\.\(e\)DSIM\-1: six FPGAs in a nearest\-neighbor chain over FMC links, each running an independent local clock\.\(f\)DSIM\-2: eighteen FPGAs on a Siemens Veloce proFPGA CS platform sharing a single master clock, with denser interconnect\.On this benchmark, quantum annealers have demonstrated coherent quantum\-critical scaling on 3D cubic lattices with up to∼2,700\{\\sim\}\\,2\{,\}700logical spins\[[3](https://arxiv.org/html/2606.25313#bib.bib3)\], and probabilistic computers running replica\-based Monte Carlo algorithms have recently matched and exceeded these scaling exponents on the same instances\[[23](https://arxiv.org/html/2606.25313#bib.bib23)\]\. For studying spin\-glass statistical mechanics rather than optimization, dedicated simulators such as the Janus field\-programmable gate array \(FPGA\) supercomputers\[[24](https://arxiv.org/html/2606.25313#bib.bib24),[25](https://arxiv.org/html/2606.25313#bib.bib25)\]and optimized GPU codes\[[26](https://arxiv.org/html/2606.25313#bib.bib26),[27](https://arxiv.org/html/2606.25313#bib.bib27)\]have reached millions of spins on regular lattices, while coherent Ising machines have reached10510^\{5\}spins\[[6](https://arxiv.org/html/2606.25313#bib.bib6)\], CMOS annealing chips tens of thousands\[[7](https://arxiv.org/html/2606.25313#bib.bib7)\], and simulated bifurcation machines face the same ceiling\[[28](https://arxiv.org/html/2606.25313#bib.bib28),[29](https://arxiv.org/html/2606.25313#bib.bib29)\]; comprehensive reviews appear in Refs\.\[[14](https://arxiv.org/html/2606.25313#bib.bib14),[15](https://arxiv.org/html/2606.25313#bib.bib15)\]\. Multi\-chip versions of these machines, including multi\-FPGA simulated bifurcation\[[30](https://arxiv.org/html/2606.25313#bib.bib30),[31](https://arxiv.org/html/2606.25313#bib.bib31)\]and multi\-chip analog designs\[[32](https://arxiv.org/html/2606.25313#bib.bib32)\], as well as networked dies that emulate a single large chip\[[9](https://arxiv.org/html/2606.25313#bib.bib9),[33](https://arxiv.org/html/2606.25313#bib.bib33)\], all advance in lockstep, exchanging boundary information synchronously so that it remains current by construction; communication bandwidth must therefore grow with the update rate \(Supplementary Sec\.[S2](https://arxiv.org/html/2606.25313#S0.SS2)\)\. What has never been measured is the cost of relaxing this synchrony: how stale boundary information can become before a distributed machine stops behaving like a monolithic one\.

Here we measure that cost and show that a single timing ratio, comparing boundary\-exchange frequency to local update speed, determines the performance of a distributed stochastic machine\. Our vehicle is the probabilistic computer: p\-bits, stochastic units that fluctuate between two states with tunable probability\[[34](https://arxiv.org/html/2606.25313#bib.bib34),[35](https://arxiv.org/html/2606.25313#bib.bib35),[36](https://arxiv.org/html/2606.25313#bib.bib36),[37](https://arxiv.org/html/2606.25313#bib.bib37),[38](https://arxiv.org/html/2606.25313#bib.bib38),[39](https://arxiv.org/html/2606.25313#bib.bib39),[40](https://arxiv.org/html/2606.25313#bib.bib40)\], updated in parallel through graph coloring so that capacity and throughput grow with every added device\[[41](https://arxiv.org/html/2606.25313#bib.bib41),[42](https://arxiv.org/html/2606.25313#bib.bib42)\]\. We build two complementary distributed sparse Ising machines \(DSIMs\), shown in Fig\.[1](https://arxiv.org/html/2606.25313#S1.F1)\. DSIM\-1 breaks lockstep by design: six FPGAs with fully independent local clocks let the freshness of boundary information be tuned at will\. DSIM\-2, an 18\-FPGA commercial prototyping platform with a shared master clock, demonstrates the rule at scale: one million p\-bits sampling at101210^\{12\}flips per second match a monolithic GPU reference, and overclocking beyond timing closure, which lowers the effective ratio, reproduces the predicted speed–accuracy tradeoff\. Above a topology\-dependent threshold of the timing ratio, the distributed machine is indistinguishable from an unpartitioned baseline; below it, residual energy still decays as a power law but with a reduced exponent, a direct tradeoff between speed and solution quality\. The same behavior emerges from cluster mean\-field theory \(CMFT\)\[[43](https://arxiv.org/html/2606.25313#bib.bib43),[44](https://arxiv.org/html/2606.25313#bib.bib44),[45](https://arxiv.org/html/2606.25313#bib.bib45),[46](https://arxiv.org/html/2606.25313#bib.bib46),[47](https://arxiv.org/html/2606.25313#bib.bib47)\], in which clusters run exact local Monte Carlo dynamics and exchange mean\-field boundary averages at a tunable frequency \(Supplementary Sec\.[S3](https://arxiv.org/html/2606.25313#S0.SS3)\): boundary staleness is a property of partitioned stochastic dynamics itself, not of any particular hardware\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/Fig2.png)Fig\. 2:A single timing ratio controls optimization quality \(L3=373L^\{3\}=37^\{3\}, DSIM\-1\)\.\(a\)Neighboring partitions exchange 1\-bit boundary states atfcommf\_\{\\mathrm\{comm\}\}while local p\-bits update atfp​\-​bitf\_\{\\mathrm\{p\\text\{\-\}bit\}\}\.\(b\)Final residual energy per spinρEf\\rho\_\{E\}^\{f\}versusfp​\-​bitf\_\{\\mathrm\{p\\text\{\-\}bit\}\}forfcommf\_\{\\mathrm\{comm\}\}from 1 kHz to 100 MHz, at a fixed budget of10610^\{6\}sweeps per run; the dashed line is the monolithic GPU baseline\.\(c\)The same data replotted againstη=fcomm/fp​\-​bit\\eta=f\_\{\\mathrm\{comm\}\}/f\_\{\\mathrm\{p\\text\{\-\}bit\}\}collapse onto a single curve\. The vertical band nearη≈300\\eta\\approx 300marks the onset of saturation, consistent with Eq\. \([2](https://arxiv.org/html/2606.25313#S3.E2)\)\. Error bars: 95% bootstrap confidence intervals over 10 instances×\\times10 runs\.
## II\. Distributed sparse Ising machines

We start from a sparse undirected Ising graph with weights\{Ji​j\}\\\{J\_\{ij\}\\\}and biases\{hi\}\\\{h\_\{i\}\\\}, where each spin is a p\-bit with outputmi=sgn​\[tanh⁡\(Ii\)\+r\]m\_\{i\}=\\mathrm\{sgn\}\[\\tanh\(I\_\{i\}\)\+r\]; hererris a uniform random number in\(−1,\+1\)\(\-1,\+1\)andIi=β​\(hi\+∑jJi​j​mj\)I\_\{i\}=\\beta\\bigl\(h\_\{i\}\+\\sum\_\{j\}J\_\{ij\}\\,m\_\{j\}\\bigr\)is the local field at inverse temperatureβ\\beta\. At largeNN, the graph is partitioned into subgraphs and each subgraph is mapped to a device, forming a distributed sparse Ising machine \(Fig\.[1](https://arxiv.org/html/2606.25313#S1.F1)\)\.

A balanced min\-cut partition, obtained with standard tools such as METIS\[[48](https://arxiv.org/html/2606.25313#bib.bib48)\]or KaHIP\[[49](https://arxiv.org/html/2606.25313#bib.bib49)\], keeps the number of cut edges small, making it cheap to duplicate the weights on those edges on both sides of the cut\. With these*shadow weights*in place, each partition computes its local fields entirely from on\-chip memory regardless of system size, and the only information that crosses device boundaries during execution is the boundary p\-bit state itself: 1 bit per boundary p\-bit, exchanged in both directions since each side needs the other’s states \(Fig\.[1](https://arxiv.org/html/2606.25313#S1.F1)d\)\.

Physical topology is not all\-to\-all, so some boundary traffic traverses multiple hops, increasing latency and loading shared links\. Standard partitioners minimize cut edges but ignore physical distance, so we introduce a Potts cost function that penalizes placing heavily connected partitions on distant devices, concentrating cut edges at short hop distances \(Fig\.[1](https://arxiv.org/html/2606.25313#S1.F1)c and Supplementary Sec\.[S5](https://arxiv.org/html/2606.25313#S0.SS5)\)\. For any partition, a worst\-case congestion metricCmaxC\_\{\\max\}combined with the coloring schedule bounds the feasible local update clock \(Supplementary Sec\.[S4](https://arxiv.org/html/2606.25313#S0.SS4)\)\.

Each device updates its local p\-bits at a clock ratefp​\-​bitf\_\{\\mathrm\{p\\text\{\-\}bit\}\}, while boundary states are transferred at a separate communication clock ratefcommf\_\{\\mathrm\{comm\}\}\. The key dimensionless parameter is

η=fcommfp​\-​bit\\eta=\\frac\{f\_\{\\mathrm\{comm\}\}\}\{f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\}\(1\)which is large when boundary information is refreshed frequently relative to local updates and small when boundary states grow stale between exchanges\.

We evaluate two platforms \(Fig\.[1](https://arxiv.org/html/2606.25313#S1.F1)e,f\)\. DSIM\-1 is a 6\-FPGA nearest\-neighbor chain with independent local clocks and source\-synchronous duplex links \(Supplementary Sec\.[S6](https://arxiv.org/html/2606.25313#S0.SS6)and Fig\.[S8](https://arxiv.org/html/2606.25313#S0.F8)\), allowingη\\etato be tuned freely\. DSIM\-2 is an 18\-FPGA Siemens Veloce proFPGA CS prototyping platform built on AMD VP1902 FPGAs, the largest FPGAs available\[[50](https://arxiv.org/html/2606.25313#bib.bib50)\], with a shared master clock and denser interconnect \(Supplementary Fig\.[S12](https://arxiv.org/html/2606.25313#S0.F12)\); therefp​\-​bitf\_\{\\mathrm\{p\\text\{\-\}bit\}\}andfcommf\_\{\\mathrm\{comm\}\}cannot be tuned independently, and overclocking means raising the master clock beyond the point at which timing closes\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/Fig3.png)Fig\. 3:Stale boundaries reduce the power\-law exponent identically in hardware and in theory \(L3=373L^\{3\}=37^\{3\}\)\.\(a\)DSIM\-1 residual energyρEf\\rho\_\{E\}^\{f\}versus sweeps; each curve is one\(fp​\-​bit,fcomm\)\(f\_\{\\mathrm\{p\\text\{\-\}bit\}\},f\_\{\\mathrm\{comm\}\}\)combination, spanningη\\etafrom0\(no communication\) to the exact limit \(unpartitioned GPU as monolithic\-reference\)\.\(b\)Fitted exponentκf\\kappa\_\{f\}versusη\\eta, saturating nearκf≈0\.27\\kappa\_\{f\}\\approx 0\.27\. Inset: large\-η\\etaregion; the optimal\-ratio band marks the onset of saturation, consistent with Eq\. \([2](https://arxiv.org/html/2606.25313#S3.E2)\)\.\(c,d\)The same observables from the CMFT model on a GPU, with identical partitioning, instances, andβ\\betaschedule\. The control parameter isSS, the number of internal sweeps between boundary exchanges; largeSScorresponds to smallη\\eta\. Inset in\(d\): small\-SSregion with the optimal\-SSmarker\. Dashed lines in\(b\)and\(d\): exact and no\-communication limits\. Error bars in\(a\)and\(c\): 95% bootstrap confidence intervals over 10 instances×\\times10 runs\.
## III\. Results

All EA experiments below use 10 disorder instances with 10 independent runs each; error bars are 95% bootstrap confidence intervals computed identically across all platforms and timing settings \(Methods\)\. We present DSIM\-1 first, whereη\\etacan be tuned independently, then scale to one million p\-bits on DSIM\-2\.

We begin withL3=373L^\{3\}=37^\{3\}EA spin glasses \(N=50,653N=50\{,\}653\) on DSIM\-1 at a fixed budget of10610^\{6\}Monte Carlo sweeps \(MCS\) per run, where one sweep updates allNNp\-bits once, varyingfp​\-​bitf\_\{\\mathrm\{p\\text\{\-\}bit\}\}over a wide range for multiplefcommf\_\{\\mathrm\{comm\}\}values, with neighboring partitions exchanging 1\-bit boundary states atfcommf\_\{\\mathrm\{comm\}\}\(Fig\.[2](https://arxiv.org/html/2606.25313#S1.F2)a\)\. The final residual energy per spin,ρEf=\(Ef−Eground\)/N\\rho\_\{E\}^\{f\}=\(E^\{f\}\-E\_\{\\mathrm\{ground\}\}\)/N, withEfE^\{f\}the final energy of a run andEgroundE\_\{\\mathrm\{ground\}\}a putative ground energy \(Methods\), depends on both clocks separately \(Fig\.[2](https://arxiv.org/html/2606.25313#S1.F2)b\)\. Replotted againstη\\eta, all curves collapse onto a single trend \(Fig\.[2](https://arxiv.org/html/2606.25313#S1.F2)c\): the optimization quality of a distributed run depends only on the ratio, not on either clock individually\. The collapse saturates onceη\\etaexceeds roughly300300for this system and mapping, quantitatively matching the predictionη≈305\\eta\\approx 305of the conservative bound

fp​\-​bit≤fcomm2​Ncolor​Cmax,f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\\leq\\frac\{f\_\{\\mathrm\{comm\}\}\}\{2\\,N\_\{\\mathrm\{color\}\}\\,C\_\{\\max\}\},\(2\)evaluated with the measured partition parameters, whereNcolorN\_\{\\mathrm\{color\}\}is the number of color groups in the update schedule andCmaxC\_\{\\max\}is a worst\-case congestion metric set by boundary sizes, hop distances, and available pins \(Supplementary Sec\.[S4](https://arxiv.org/html/2606.25313#S0.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/Fig4.png)Fig\. 4:Time\-to\-target at50,65350\{,\}653p\-bits \(L3=373L^\{3\}=37^\{3\}, DSIM\-1\)\.\(a\)Residual energyρEf\\rho\_\{E\}^\{f\}versus sweeps: GPU \(exact,κf≈0\.2693\\kappa\_\{f\}\\approx 0\.2693\), conservative DSIM\-1 at0\.100\.10MHz \(exact,κf≈0\.2637\\kappa\_\{f\}\\approx 0\.2637\), and overclocked DSIM\-1 at5050MHz \(inexact,κf≈0\.2289\\kappa\_\{f\}\\approx 0\.2289\); “exact” means boundary information is current at every update\. The small GPU–DSIM\-1 gap reflects platform differences in random\-number generation \(on\-chip LFSR\[[16](https://arxiv.org/html/2606.25313#bib.bib16)\]versus Philox\[[51](https://arxiv.org/html/2606.25313#bib.bib51)\]\) and arithmetic precision \(fixed point versus floating point, see Methods\), not staleness\. Data extend to10910^\{9\}sweeps; putative ground energies come from runs up to101010^\{10\}sweeps to prevent artificial bending of the power law \(Methods\)\.\(b\)The same data versus on\-chip annealing time \(one\-time weight\-load of0\.50\.5s excluded\)\. Measured flip rates:5\.1×1095\.1\\times 10^\{9\}/s at0\.100\.10MHz and2\.53×10122\.53\\times 10^\{12\}/s at5050MHz\. The overclocked mode reachesρE⋆=0\.07\\rho\_\{E\}^\{\\star\}=0\.07\(dashed\)410\.5×410\.5\\timesfaster andρE⋆=0\.002\\rho\_\{E\}^\{\\star\}=0\.00262\.07×62\.07\\timesfaster than the conservative mode\. Green dashed: projected 7 nm ASIC at100100MHz \(exact, Supplementary Sec\.[S8](https://arxiv.org/html/2606.25313#S0.SS8)\)\. Error bars: 95% bootstrap confidence intervals over 10 instances×\\times10 runs\.How does the full optimization trajectory change asη\\etavaries? Fig\.[3](https://arxiv.org/html/2606.25313#S2.F3)a shows residual\-energy traces across many timing settings: even when boundary information is stale \(smallη\\eta\), the decay remains close to a power law,ρEf∝ta−κf\\rho\_\{E\}^\{f\}\\propto t\_\{a\}^\{\-\\kappa\_\{f\}\}, withtat\_\{a\}the number of sweeps andκf\\kappa\_\{f\}the decay exponent extracted from log\-log fits\. We compare against an NVIDIA RTX 6000 Ada GPU running the same instances and annealing schedule without partitioning, the monolithic baseline throughout this work\. The exponent rises withη\\etaand saturates onceη\\etareaches the few\-hundred range \(Fig\.[3](https://arxiv.org/html/2606.25313#S2.F3)b\), consistent with Eq\. \([2](https://arxiv.org/html/2606.25313#S3.E2)\)\. The GPU yieldsκf≈0\.2693\\kappa\_\{f\}\\approx 0\.2693; a conservative DSIM\-1 atfp​\-​bit=0\.10f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=0\.10MHz yieldsκf≈0\.2637\\kappa\_\{f\}\\approx 0\.2637, the small gap attributed to two platform differences: the on\-chip linear\-feedback shift register \(LFSR\)\[[16](https://arxiv.org/html/2606.25313#bib.bib16)\]versus the GPU’s Philox generator\[[51](https://arxiv.org/html/2606.25313#bib.bib51)\], and the s\{4\}​\{1\}\\\{4\\\}\\\{1\\\}fixed\-point arithmetic versus floating point \(Methods\)\. On the other hand, an overclocked DSIM\-1 at5050MHz yieldsκf≈0\.2289\\kappa\_\{f\}\\approx 0\.2289\. To isolate the origin of this slope reduction, we physically disconnect the inter\-FPGA links so that each FPGA anneals only its local subgraph: per\-partition energies remain unchanged up to the highest local clocks \(Supplementary Fig\.[S9](https://arxiv.org/html/2606.25313#S0.F9)and Sec\.[S7](https://arxiv.org/html/2606.25313#S0.SS7)\), confirming that local updates are correct at all frequencies tested\. The slope reduction in coupled runs therefore comes from stale boundary information, not from timing violations in local updates\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/Fig5.png)Fig\. 5:Time\-to\-target at one million p\-bits \(L3=1003L^\{3\}=100^\{3\}, DSIM\-2\)\.\(a\)Residual energyρEf\\rho\_\{E\}^\{f\}versus sweeps: GPU \(exact,κf≈0\.2836\\kappa\_\{f\}\\approx 0\.2836\), conservative DSIM\-2 at11MHz \(exact,κf≈0\.2820\\kappa\_\{f\}\\approx 0\.2820\), and overclocked DSIM\-2 at33MHz \(inexact,κf≈0\.2565\\kappa\_\{f\}\\approx 0\.2565\)\. The residual GPU gap reflects platform differences in random\-number generation and arithmetic precision \(Methods\), not staleness\. Timing closes at11MHz; the33MHz point is the same synthesis globally overclocked, and beyond33MHz read/write synchronization fails\. Timing violations strike the longest paths first, the boundary transfers rather than the local update logic, so overclocking lowers the effectiveη\\etaat fixed nominal ratio: the33MHz point probes the stale\-boundary regime of the rule established on DSIM\-1\. Data shown extend to10710^\{7\}sweeps; putative ground energies were obtained from FPGA runs up to10910^\{9\}sweeps \(Methods\)\.\(b\)The same data versus on\-chip annealing time \(one\-time weight\-load of22s excluded\)\. Measured flip rates:101210^\{12\}/s at11MHz and3×10123\\times 10^\{12\}/s at33MHz\. The overclocked mode reachesρE⋆=0\.05\\rho\_\{E\}^\{\\star\}=0\.05\(dashed\)2\.82×2\.82\\timesfaster andρE⋆=0\.01\\rho\_\{E\}^\{\\star\}=0\.012\.23×2\.23\\timesfaster, with crossover nearρE⋆≈0\.004\\rho\_\{E\}^\{\\star\}\\approx 0\.004\. Green dashed: projected ASIC at100100MHz \(exact, Supplementary Sec\.[S8](https://arxiv.org/html/2606.25313#S0.SS8)\)\. Error bars: 95% bootstrap confidence intervals over 10 instances×\\times10 runs\.Why does staleness degrade the exponent in this particular way? To show that the behavior is not an artifact of our hardware, we develop a parallel cluster mean\-field theory, implemented on the same GPU, with identical partitioning, instances, and annealing schedule \(Supplementary Sec\.[S3](https://arxiv.org/html/2606.25313#S0.SS3)\)\. Each cluster runs exact local MCMC dynamics while boundary spins in neighboring clusters are represented by mean\-field averages exchanged only everySSinternal sweeps;SSis the algorithmic analog ofη\\eta, with largeSScorresponding to smallη\\eta\. The CMFT model reproduces the hardware quantitatively \(Fig\.[3](https://arxiv.org/html/2606.25313#S2.F3)c,d\): power\-law decay persists across the full range of the control parameter, with an exponent that saturates toward the exact limit under frequent exchange and degrades smoothly under infrequent exchange\. The fitted exponent curves collapse onto the same functional form once the two control parameters are aligned by a simple rescaling, and the mapping betweenη\\etaandSSis monotonic over the range where mean\-field averages are well converged \(Supplementary Fig\.[S2](https://arxiv.org/html/2606.25313#S0.F2)\)\. The agreement is no accident: both systems perform partitioned Gibbs sampling in which exact local dynamics consume boundary states that are not current, whether the delay comes from finite link speed or from mean fields held fixed between iterations\.

Is the reduced exponent of an overclocked machine worth its higher throughput? The answer depends on the target\. Because the graph\-colored architecture updates allNNp\-bits every clock cycle, the flip rate scales linearly with bothNNandfp​\-​bitf\_\{\\mathrm\{p\\text\{\-\}bit\}\}: forN=50,653N=50\{,\}653we measure5\.1×1095\.1\\times 10^\{9\}flips/s at0\.100\.10MHz and2\.53×10122\.53\\times 10^\{12\}flips/s at5050MHz \(Methods\), with the entire 6\-FPGA platform drawing 150–180 W\. Fig\.[4](https://arxiv.org/html/2606.25313#S3.F4)compares the GPU, conservative, and overclocked modes on identical sweep axes and against wall\-clock annealing time\. The overclocked mode produces far more flips per second, each consuming staler boundary data, so it reaches easy targets faster despite its shallower slope:410\.5×410\.5\\timesfaster atρE⋆=0\.07\\rho\_\{E\}^\{\\star\}=0\.07\. For harder targets the exponent gap erodes the advantage,62\.07×62\.07\\timesatρE⋆=0\.002\\rho\_\{E\}^\{\\star\}=0\.002, and eventually produces a crossover beyond which the conservative mode wins\. The tradeoff is generic to distributed stochastic machines: time\-to\-target is predicted only by throughput and exponent together\. Fig\.[4](https://arxiv.org/html/2606.25313#S3.F4)b also projects a 7 nm application\-specific integrated circuit \(ASIC\) implementing the same partition, which closes timing at 100 MHz with an area of 0\.66 mm2and 248 mW per partition \(Supplementary Sec\.[S8](https://arxiv.org/html/2606.25313#S0.SS8)and Table[S1](https://arxiv.org/html/2606.25313#S0.T1)\); because the local graph structure is size\-independent, the projection assumes the exact\-mode exponent at the higher clock\.

We then implementL3=1003L^\{3\}=100^\{3\}\(N=1,000,000N=1\{,\}000\{,\}000\) on DSIM\-2, partitioning across the Super Logic Regions \(SLRs\) within each FPGA as well as across FPGAs, for 72 sub\-partitions on 18 devices \(Supplementary Sec\.[S10](https://arxiv.org/html/2606.25313#S0.SS10)\)\. The design closes timing atfp​\-​bit=1f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=1MHz, a measured101210^\{12\}flips/s; globally overclocking the same synthesis to33MHz reaches3×10123\\times 10^\{12\}flips/s, and beyond33MHz read/write synchronization across the 18 FPGAs fails from accumulated timing violations\. The platform draws 1\.4–1\.6 kW\. The GPU reference yieldsκf≈0\.2836\\kappa\_\{f\}\\approx 0\.2836and the 1 MHz modeκf≈0\.2820\\kappa\_\{f\}\\approx 0\.2820, within the confidence interval of the GPU curve: a million\-p\-bit distributed machine matches monolithic performance while sustaining a trillion flips per second \(Fig\.[5](https://arxiv.org/html/2606.25313#S3.F5)a\)\. The 3 MHz mode yieldsκf≈0\.2565\\kappa\_\{f\}\\approx 0\.2565, the same tradeoff observed atL3=373L^\{3\}=37^\{3\}: time\-to\-target improves by2\.82×2\.82\\timesatρE⋆=0\.05\\rho\_\{E\}^\{\\star\}=0\.05and2\.23×2\.23\\timesatρE⋆=0\.01\\rho\_\{E\}^\{\\star\}=0\.01, with crossover nearρE⋆≈0\.004\\rho\_\{E\}^\{\\star\}\\approx 0\.004beyond which the conservative mode is preferable \(Fig\.[5](https://arxiv.org/html/2606.25313#S3.F5)b\)\.

Theη\\eta\-governed behavior above is established on cubic lattices, but the architecture applies to any sparse graph that can be colored and partitioned; we close with three capability demonstrations on harder topologies\. The G81 instance from the Gset library\[[52](https://arxiv.org/html/2606.25313#bib.bib52)\]is a20,00020\{,\}000\-node toroidal grid, open since 2000, whose best known cut improved only incrementally until the Cosm algorithm, a dynamical\-systems heuristic running on CPUs, found14,06014\{,\}060\[[53](https://arxiv.org/html/2606.25313#bib.bib53),[54](https://arxiv.org/html/2606.25313#bib.bib54)\], since certified optimal by an exact solver\[[54](https://arxiv.org/html/2606.25313#bib.bib54)\]\. Running adaptive parallel tempering with isoenergetic cluster moves\[[23](https://arxiv.org/html/2606.25313#bib.bib23)\]on DSIM\-1, an algorithm distinct from both Cosm and the simulated annealing used above, the p\-computer independently reaches the same certified optimum of14,06014\{,\}060\(verifiable bit\-string in Supplementary Sec\.[S9](https://arxiv.org/html/2606.25313#S0.SS9)\)\. On the Pegasus P41 and Zephyr Z50 graphs native to current and next\-generation D\-Wave quantum annealers\[[3](https://arxiv.org/html/2606.25313#bib.bib3),[4](https://arxiv.org/html/2606.25313#bib.bib4)\], we solve planted instances with ground states known by construction\[[55](https://arxiv.org/html/2606.25313#bib.bib55),[56](https://arxiv.org/html/2606.25313#bib.bib56)\], of39,04039\{,\}040and80,80080\{,\}800p\-bits respectively, on a subset of DSIM\-2 \(Supplementary Sec\.[S11](https://arxiv.org/html/2606.25313#S0.SS11)\)\. Finally, a random three\-literal satisfiability \(3SAT\) instance near the satisfiability phase transition\[[57](https://arxiv.org/html/2606.25313#bib.bib57)\]\(13,04213\{,\}042variables,55,55855\{,\}558clauses,α≈4\.26\\alpha\\approx 4\.26\), encoded as an invertible Ising circuit\[[41](https://arxiv.org/html/2606.25313#bib.bib41)\]with250,011250\{,\}011p\-bits, reaches55,41655\{,\}416of55,55855\{,\}558clauses satisfied \(99\.74%99\.74\\%\) after10910^\{9\}sweeps on DSIM\-2, closely tracking an optimized GPU baseline \(Supplementary Sec\.[S12](https://arxiv.org/html/2606.25313#S0.SS12)\); prior pairwise Ising SAT demonstrations were far smaller\[[58](https://arxiv.org/html/2606.25313#bib.bib58)\], with recent work encoding clauses through native higher\-order interactions\[[59](https://arxiv.org/html/2606.25313#bib.bib59)\]\. In every case the architecture is unchanged: partitions exchange only 1\-bit boundary states while all coupling weights remain on\-chip\.

## IV\. Conclusion

We have built a programmable probabilistic computer with one million p\-bits by networking FPGAs into a single Ising machine, and shown that the cost of distributing a stochastic computation is governed by one timing ratio,η=fcomm/fp​\-​bit\\eta=f\_\{\\mathrm\{comm\}\}/f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\. Above a topology\-dependent threshold the machine is indistinguishable from a monolithic one; below it, parallelism buys throughput at a quantifiable cost in solution quality\. Because a theoretical cluster mean\-field model reproduces the same behavior, that cost can be predicted in software before any hardware is built\. The underlying reason is that stochastic dynamics tolerate stale information gracefully, much as Hogwild\-style parallel Gibbs sampling survives stale reads across threads\[[60](https://arxiv.org/html/2606.25313#bib.bib60)\]; deterministic multi\-chip architectures, by contrast, must emulate one exact large chip \(Supplementary Sec\.[S13](https://arxiv.org/html/2606.25313#S0.SS13)\)\. The DSIM instead accepts that the computation is distributed and asks how much boundary delay can be tolerated before optimization quality changes appreciably\.η\\etaand the bound of Eq\. \([2](https://arxiv.org/html/2606.25313#S3.E2)\) depend only on the partition, the link budget, and the coloring schedule, so they serve directly as design equations for future implementations, which should favor independent local clocks over a shared master clock\. A representative partition implemented in a 7 nm predictive process \(ASAP7 PDK\[[61](https://arxiv.org/html/2606.25313#bib.bib61)\]\) closes timing at 100 MHz with 0\.66 mm2and 248 mW, requiring boundary exchange in the 6–12 GHz range, within the envelope of standard die\-to\-die interconnects such as UCIe\[[62](https://arxiv.org/html/2606.25313#bib.bib62)\]and BoW\[[63](https://arxiv.org/html/2606.25313#bib.bib63)\], and scales by replication \(Supplementary Sec\.[S8](https://arxiv.org/html/2606.25313#S0.SS8)\)\. The same compute–communication separation extends to three\-dimensional integration\[[33](https://arxiv.org/html/2606.25313#bib.bib33),[64](https://arxiv.org/html/2606.25313#bib.bib64)\]and to nanosecond, sub\-femtojoule stochastic magnetic tunnel junctions\[[40](https://arxiv.org/html/2606.25313#bib.bib40),[16](https://arxiv.org/html/2606.25313#bib.bib16)\], for which theη\\etaframework links device fluctuation rate to inter\-chip bandwidth\. The framework applies whenever a problem’s variables or weights exceed what one device can hold, opening a path to arbitrarily large probabilistic computers built from networks of small ones\.

## V\. Methods

##### Benchmark instances\.

We study 3D Edwards–Anderson spin glasses onL×L×LL\\times L\\times Llattices withJi​j∈\{±1\}J\_\{ij\}\\in\\\{\\pm 1\\\}drawn independently and uniformly at random on nearest\-neighbor edges\[[3](https://arxiv.org/html/2606.25313#bib.bib3),[23](https://arxiv.org/html/2606.25313#bib.bib23)\], with periodic boundaries inzzand open boundaries inxxandyy\. We reportL3=373L^\{3\}=37^\{3\}andL3=1003L^\{3\}=100^\{3\}, with 10 disorder instances at each size\. DSIM\-1 experiments use the Potts partitioning introduced here to align with the chain topology\. For DSIM\-2, graph\-level and SLR\-level partitioning was generated with METIS and then mapped onto the platform through the Veloce proFPGA CS implementation flow\.

##### Update schedule and sweep definition\.

P\-bits are updated by graph coloring withNcolorN\_\{\\mathrm\{color\}\}color groups; one sweep \(one MCS\) updates all groups once\.Ncolor=3N\_\{\\mathrm\{color\}\}=3forL3=373L^\{3\}=37^\{3\},22forL3=1003L^\{3\}=100^\{3\},22for Pegasus P41,66for Zephyr Z50,44for 3SAT, and22for G81\.

##### Annealing schedule and numeric format\.

EA results use simulated annealing withβ=0\.5,1\.0,…,5\.0\\beta=0\.5,1\.0,\\ldots,5\.0; Pegasus, Zephyr, and 3SAT useβ=0\.5,0\.625,…,10\\beta=0\.5,0\.625,\\ldots,10\. These schedules were chosen empirically and applied identically across the FPGA and GPU runs\. GPU references use floating\-point arithmetic\. Hardware uses fixed point: s\{4\}​\{1\}\\\{4\\\}\\\{1\\\}\(signed, 4 integer bits, 1 fractional bit\) for EA, s\{4\}​\{3\}\\\{4\\\}\\\{3\\\}for Pegasus, Zephyr, and 3SAT, and s\{4\}​\{6\}\\\{4\\\}\\\{6\\\}for G81 to accommodate the distinct inverse temperatures of adaptive parallel tempering\. Pseudorandom numbers are generated on\-chip with linear\-feedback shift registers \(LFSRs\)\[[16](https://arxiv.org/html/2606.25313#bib.bib16)\]on both DSIM\-1 and DSIM\-2, while the GPU baselines use the Philox generator\[[51](https://arxiv.org/html/2606.25313#bib.bib51)\]\.

##### GPU baseline\.

All GPU references run on an NVIDIA RTX 6000 Ada with the same instances and annealing schedules, on a single device with no partitioning\. The GPU serves as a monolithic reference for the decay exponentκf\\kappa\_\{f\}, not as a runtime competitor: GPU wall\-clock times vary across generations and implementations and are not reported\. We note that the three million unique weights atL3=1003L^\{3\}=100^\{3\}occupy about 3 MB at 8\-bit precision, within a single modern GPU’s L2 cache \(e\.g\., NVIDIA Blackwell B200\[[65](https://arxiv.org/html/2606.25313#bib.bib65)\]\)\.

##### DSIM\-2 operating points\.

The DSIM\-2 design closes timing atfp​\-​bit=1f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=1MHz\. Globally overclocking the same synthesis succeeds up to33MHz; beyond that, read/write synchronization across the 18 FPGAs fails from accumulated timing violations, which is why Fig\.[5](https://arxiv.org/html/2606.25313#S3.F5)contains two operating points\. Access to the platform was limited, which is why theL3=1003L^\{3\}=100^\{3\}analysis uses shorter sweep budgets thanL3=373L^\{3\}=37^\{3\}\(see below\)\.

##### Putative ground energies\.

Exact ground energies are not known at these sizes\. Following standard practice\[[24](https://arxiv.org/html/2606.25313#bib.bib24),[3](https://arxiv.org/html/2606.25313#bib.bib3)\], for each instance we define a putative ground energy as the minimum energy observed across all platforms and timing settings\. GPU baselines were generated up to10710^\{7\}sweeps at both sizes, while FPGA runs were extended to101010^\{10\}sweeps forL3=373L^\{3\}=37^\{3\}and10910^\{9\}forL3=1003L^\{3\}=100^\{3\}\. This gives the analysis window \(up to10910^\{9\}and10710^\{7\}sweeps, respectively\) at least one order of magnitude of buffer, preventing artificial bending of the power law at late times due to inaccuracies inEgroundE\_\{\\mathrm\{ground\}\}\.

##### Flip\-rate measurement\.

Flip rates are measured on\-chip\[[41](https://arxiv.org/html/2606.25313#bib.bib41)\]\. Each color group carries a counter with a fixed preset \(e\.g\.,50,00050\{,\}000cycles at a125125MHz reference clock\), defining a known time window during which p\-bit flips accumulate\. At the preset, a broadcast disable shared across all FPGAs stops all counters simultaneously; the per\-color counts are aggregated and divided by the elapsed time\.

##### Sweep counting and reported times\.

Runs are controlled by a common counter with a programmable preset at a125125MHz reference clock, providing both the completed MCS count and the wall\-clock annealing time\. On DSIM\-2 the broadcast disable arrives synchronously and readout states are exactly aligned; on DSIM\-1, independent local clocks introduce in principle a small stop\-time uncertainty across partitions, with no measurable effect on optimization quality since updates are slow at highβ\\beta\. Reported annealing times reflect on\-chip computation only: the one\-time weight\-load over PCIe \(about0\.50\.5s forL3=373L^\{3\}=37^\{3\},22s forL3=1003L^\{3\}=100^\{3\}\) and host read/write do not scale with sweep count and are excluded\.

## Acknowledgments

We acknowledge Siemens Digital Industries Software for access to the Veloce proFPGA CS platform, and for the tool access, platform support, and engineering time provided through N\.A\.A\.’s internship\. We thank Subhasish Mitra and Robert Radway for helpful discussions that improved the manuscript\. We acknowledge support from the Office of Naval Research \(ONR\) Young Investigator Program grant, the National Science Foundation \(NSF\) CAREER Award under grant number CCF 2106260, the Army Research Laboratory under grant number W911NF\-24\-1\-0228, the Semiconductor Research Corporation \(SRC\) grant, and the ONR\-MURI grant N000142312708\. Use was made of computational facilities purchased with funds from the National Science Foundation \(CNS\-1725797\) and administered by the Center for Scientific Computing \(CSC\)\. The CSC is supported by the California NanoSystems Institute and the Materials Research Science and Engineering Center \(MRSEC; NSF DMR 2308708\) at UC Santa Barbara\.

## Author contributions

N\.A\.A\. and K\.Y\.Ç\. conceived the project\. N\.A\.A\. designed and built the50,65350\{,\}653\-node distributed p\-computer at UCSB and the million\-node p\-computer at Siemens, led all hardware experiments and data analysis\. N\.A\.A\. and K\.Y\.Ç\. wrote the manuscript\. F\.B\. provided feedback on problems, architecture choices and FPGA implementation\. X\.Z\. designed and carried out the parallel cluster mean\-field theory study\. S\.B\. developed and ran all GPU benchmarks\. K\.C\.\-C\. and K\.L\. developed the Potts model\-based topology\-aware partitioning\. S\.C\. advised on 3D spin\-glass physics, instance generation, and scaling analysis\. S\.S\. performed the architecture\-level study and 7 nm ASIC projection under the supervision of T\.S\. C\.T\., J\.T\., and A\.S\. provided the proFPGA CS platform infrastructure, interconnect, and compile and mapping methodology at Siemens, with A\.S\. managing the Siemens team\. K\.Y\.Ç\. supervised the project, guided the experimental design and interpretation, contributed to the writing, and shaped the discussion and overall direction of the work\. All authors discussed the results and edited the manuscript\.

## Competing interests

The authors declare no competing interests\.

## Data and code availability

Data and code necessary to reproduce the main plots will be made available in a public repository upon publication\. Additional artifacts required to regenerate FPGA bitstreams are available upon reasonable request subject to platform licensing constraints\.

Supplementary Information Programmable Probabilistic Computer with 1,000,000 p\-bits

### S1Overview

This Supplementary Information provides the full details behind the distributed sparse Ising machine \(DSIM\) framework presented in the main text\. The main text identifies two limits that force distribution: the problem can outgrow a single chip’s capacity, and the coupling weights may not fit in on\-chip memory, so that off\-chip memory bandwidth caps the update rate\. The DSIM addresses both by partitioning the graph across devices and using shadow weights to keep all weights on\-chip, so that only the 1\-bit states of boundary probabilistic bits \(p\-bits\) need to cross device boundaries\. The timing ratioη=fcomm/fp​\-​bit\\eta=f\_\{\\mathrm\{comm\}\}/f\_\{\\mathrm\{p\\text\{\-\}bit\}\}then governs the behavior across regimes: saturation toward a monolithic baseline whenη\\etais large, a useful overclocked regime whenη\\etais smaller, and a quantitative correspondence with a theoretical cluster mean\-field model\. The supplementary sections are arranged as follows\. Section[S2](https://arxiv.org/html/2606.25313#S0.SS2)provides additional device\-level and algorithmic background\. Section[S3](https://arxiv.org/html/2606.25313#S0.SS3)gives the full parallel cluster mean\-field theory formulation and explains why it reproduces the hardware behavior in Fig\.[3](https://arxiv.org/html/2606.25313#S2.F3)c,d\. Section[S4](https://arxiv.org/html/2606.25313#S0.SS4)develops the communication\-cost metric and the conservative clocking bound \(Eq\. \([2](https://arxiv.org/html/2606.25313#S3.E2)\)\)\. Section[S5](https://arxiv.org/html/2606.25313#S0.SS5)shows how physical topology affects congestion and how topology\-aware Potts partitioning mitigates it\. Section[S6](https://arxiv.org/html/2606.25313#S0.SS6)describes source\-synchronous boundary transport on DSIM\-1\. Section[S7](https://arxiv.org/html/2606.25313#S0.SS7)presents the disconnected\-links control experiment\. Section[S8](https://arxiv.org/html/2606.25313#S0.SS8)translates the timing rule into a Universal Chiplet Interconnect Express \(UCIe\) feasibility check and reports 7 nm application\-specific integrated circuit \(ASIC\) metrics\. Section[S9](https://arxiv.org/html/2606.25313#S0.SS9)reports the G81 Max\-Cut result\. Section[S10](https://arxiv.org/html/2606.25313#S0.SS10)explains Super Logic Region \(SLR\) driven partitioning on DSIM\-2\. Sections[S11](https://arxiv.org/html/2606.25313#S0.SS11)–[S12](https://arxiv.org/html/2606.25313#S0.SS12)present Pegasus, Zephyr, and 3\-literal satisfiability \(3SAT\) results on DSIM\-2\. Section[S13](https://arxiv.org/html/2606.25313#S0.SS13)compares the DSIM with other multi\-chip architectures\. Throughout, we use the same residual\-energy definition as the main text: for a run returning final energyEfE^\{f\}, the final residual energy per spin is

ρEf=Ef−EgroundN,\\rho\_\{E\}^\{f\}=\\frac\{E^\{f\}\-E\_\{\\mathrm\{ground\}\}\}\{N\},\(S\.1\)whereNNis the total number of p\-bits andEgroundE\_\{\\mathrm\{ground\}\}is a putative ground energy established by aggregating the best energies found across extensive runs\.

### S2Background

The main text introduces p\-bits, the sparse Ising machine architecture, and the broader Ising machine context\. This section adds device\-level and algorithmic details that support the supplementary derivations\.

Since the initial demonstration of stochastic magnetic tunnel junction \(sMTJ\) based p\-bits\[[39](https://arxiv.org/html/2606.25313#bib.bib39)\], the device space has grown rapidly\. Heterogeneous complementary metal\-oxide\-semiconductor \(CMOS\) plus sMTJ prototypes\[[16](https://arxiv.org/html/2606.25313#bib.bib16),[66](https://arxiv.org/html/2606.25313#bib.bib66),[67](https://arxiv.org/html/2606.25313#bib.bib67)\]and on\-chip p\-bit cores using stochastic MTJs paired with 2D transistors\[[68](https://arxiv.org/html/2606.25313#bib.bib68)\]have brought p\-bit hardware closer to monolithic integration\. On the architecture side, the sparse Ising machine \(sIM\) has been extended to support all\-to\-all reconfigurability through multiplexed sparse topologies\[[42](https://arxiv.org/html/2606.25313#bib.bib42)\]and satisfiability solvers\[[69](https://arxiv.org/html/2606.25313#bib.bib69)\], and recent work showed that probabilistic computers \(p\-computers\) can match and exceed quantum\-annealer scaling exponents on hard 3D spin\-glass benchmarks\[[23](https://arxiv.org/html/2606.25313#bib.bib23)\]\.

When a problem exceeds a single device, related multi\-chip Ising efforts have partitioned the computation across networked FPGAs with autonomous synchronization\[[30](https://arxiv.org/html/2606.25313#bib.bib30),[31](https://arxiv.org/html/2606.25313#bib.bib31)\], scaled analog Ising machines to multi\-chip configurations\[[32](https://arxiv.org/html/2606.25313#bib.bib32)\], and extended CMOS annealing chips beyond single\-die limits\[[7](https://arxiv.org/html/2606.25313#bib.bib7)\]\. More generally, studies of networked computing dies have identified inter\-chip bandwidth as the primary multi\-chip scaling constraint\[[9](https://arxiv.org/html/2606.25313#bib.bib9),[33](https://arxiv.org/html/2606.25313#bib.bib33)\]\. The DSIM follows the same broad motivation but differs in how the cost of distribution is characterized: the main text shows that performance depends on how fresh the boundary information is, summarized by the timing ratioη\\eta\.

Cluster mean\-field theory \(CMFT\) provides a direct algorithmic counterpart to that hardware picture\. Yamamoto’s correlated cluster mean\-field theory\[[46](https://arxiv.org/html/2606.25313#bib.bib46)\]improved on earlier formulations\[[70](https://arxiv.org/html/2606.25313#bib.bib70),[43](https://arxiv.org/html/2606.25313#bib.bib43),[44](https://arxiv.org/html/2606.25313#bib.bib44),[45](https://arxiv.org/html/2606.25313#bib.bib45)\]by making the effective boundary fields depend on the cluster’s own configuration\. Our parallel\-CMFT implementation adapts this framework to a distributed optimization setting, as detailed in the next section\.

### S3Parallel cluster mean\-field theory

Parallel cluster mean\-field theory \(parallel\-CMFT\) is the algorithmic model compared with the hardware in Fig\.[3](https://arxiv.org/html/2606.25313#S2.F3)c,d of the main text\. We give the full formulation below and then explain why a graphics processing unit \(GPU\) based CMFT model quantitatively reproduces the optimization dynamics of the DSIM hardware\.

#### S3\.1Formulation

The exact Ising graph is partitioned into disjoint clusters \(also called partitions in the hardware context\), where spins are updated using stochastic Monte Carlo dynamics within each cluster\. A boundary p\-bit is one that has at least one neighbor in a different cluster\. The coupling weightsJi​jJ\_\{ij\}are unchanged, and the only approximation is that boundary p\-bits in neighboring clusters are represented by their mean\-field averages rather than their instantaneous states\[[70](https://arxiv.org/html/2606.25313#bib.bib70),[44](https://arxiv.org/html/2606.25313#bib.bib44),[43](https://arxiv.org/html/2606.25313#bib.bib43),[45](https://arxiv.org/html/2606.25313#bib.bib45),[46](https://arxiv.org/html/2606.25313#bib.bib46),[71](https://arxiv.org/html/2606.25313#bib.bib71)\]\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp1.png)Fig\. S1:Parallel cluster mean\-field framework\.\(a\)Original interaction graph\.\(b\)Partition of the graph into two clusters,AAandBB\.\(c\)Iterative update with boundary exchange: each cluster performsSSinternal sweeps before exchanging boundary mean\-field values with its neighbor\. This procedure is repeated for a total ofMMiterations\.To clarify the mechanism, we describe the method using a simplified two\-cluster example,AAandBB, as shown in Fig\.[S1](https://arxiv.org/html/2606.25313#S0.F1)\. ClustersAAandBBare updated in parallel\. During one cluster iteration, each cluster performsSSsweeps, where one sweep is a full Markov chain Monte Carlo \(MCMC\) update over all p\-bits in both clusters\. Within theseSSsweeps, each cluster updates its own p\-bits exactly, but when computing local fields for boundary p\-bits, the states of their neighbors in the other cluster are not available in real time\. Instead, those contributions are approximated using mean\-field values computed at the end of the previous iteration\. Each p\-bit takes a bipolar statemi∈\{−1,\+1\}m\_\{i\}\\in\\\{\-1,\+1\\\}\. For a boundary p\-bit, the exchanged mean\-field is computed as the average over theSSsweeps within one iteration\. Ifmi\(t\)m\_\{i\}^\{\(t\)\}is the state of p\-bitiiat thett\-th Monte Carlo sweep, then the mean\-field value afterSSsweeps is given by

⟨mi⟩=1S​∑t=1Smi\(t\)\.\\langle m\_\{i\}\\rangle=\\frac\{1\}\{S\}\\sum\_\{t=1\}^\{S\}m\_\{i\}^\{\(t\)\}\.
After completing theSSsweeps, the last configuration is retained as the updated state, and the newly computed boundary mean\-fields are exchanged between clusters\. This completes one cluster iteration\. These exchanged mean\-field values remain fixed during the next iteration and are updated only after the subsequent set ofSSsweeps\. Accordingly, the new local field for p\-bitiiin clusterAAis

Iinew=β​\(hi\+∑j∈AJi​j​mj\+∑k∈BJi​k​⟨mk⟩\),I\_\{i\}^\{\\mathrm\{new\}\}=\\beta\\biggl\(h\_\{i\}\+\\sum\_\{j\\in A\}J\_\{ij\}m\_\{j\}\+\\sum\_\{k\\in B\}J\_\{ik\}\\langle m\_\{k\}\\rangle\\biggr\),wherehih\_\{i\}is the original bias of p\-bitii,Ji​jJ\_\{ij\},Ji​kJ\_\{ik\}are the weight strengths between the respective p\-bit pairs, andβ\\betais the inverse temperature\.

#### S3\.2Why CMFT reproduces the hardware behavior

In CMFT, the number of sweeps per iteration,SS, plays the same role asη=fcomm/fp​\-​bit\\eta=f\_\{\\mathrm\{comm\}\}/f\_\{\\mathrm\{p\\text\{\-\}bit\}\}in hardware: both control how often boundary information is refreshed relative to local updates \(main text Fig\.[3](https://arxiv.org/html/2606.25313#S2.F3)a,b\)\. LargeSS\(infrequent exchange\) is analogous to smallη\\eta, andS=1S=1\(exchange after every sweep\) is analogous to largeη\\eta\.

In both systems, each cluster runs exact Gibbs sampling over its local subgraph, and the only approximation is that boundary information from neighboring clusters is not current\. In the hardware the staleness arises from finite communication time, in the CMFT model from holding mean\-field values fixed acrossSSsweeps\. When boundary information is refreshed frequently, the decay exponent saturates to the exact\-graph value \(Fig\.[3](https://arxiv.org/html/2606.25313#S2.F3)b,d\)\. When it is stale, the exponent degrades\.

This is confirmed by the collapse of the fitted exponent curves from the two systems onto the same functional form after a simple rescaling that mapsη\\etatoSS\(Fig\.[S2](https://arxiv.org/html/2606.25313#S0.F2)a\)\. The explicitη\\eta–SSmapping is monotonic where mean\-field averages are well converged \(Fig\.[S2](https://arxiv.org/html/2606.25313#S0.F2)b\)\. The CMFT model is therefore useful as a design\-screening tool: given a proposed partition and problem class, one can varySSto estimate the exponent degradation a physical DSIM would produce at the correspondingη\\eta, without fabricating hardware\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp2.png)Fig\. S2:Alignment of decay exponents from hardware and algorithmic control parameters\.\(a\)Slopeκf\\kappa\_\{f\}extracted from the residual\-energy decay as a function of the timing ratioη=fcomm/fp​\-​bit\\eta=f\_\{\\mathrm\{comm\}\}/f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\(red, from DSIM\-1 experiments\) and the number of sweeps per iterationSS\(blue, from parallel\-CMFT\)\. The horizontal axes are related through a least\-squares fit performed in log space, which rescales the sweep axis such that the two datasets collapse onto the same scaling curve\. The collapse indicates that both systems are shaped by the same mechanism: delayed boundary information in partitioned stochastic dynamics\.\(b\)Mapping betweenSSandη\\etaderived from\(a\), where each point pairs the two control parameters that yield the sameκf\\kappa\_\{f\}value\. The mapping is monotonic in the large\-SSregime \(left\) where mean\-field averages are well converged, and becomes noisier at smallSS\(right\) where finite\-sweep fluctuations introduce scatter\.

### S4Communication\-cost metric and conservative clocking bound

We now develop the communication\-cost metricCmaxC\_\{\\max\}that enters the conservative clocking bound of the main text, Eq\. \([2](https://arxiv.org/html/2606.25313#S3.E2)\), and then derive the bound itself\.

A DSIM is defined by two resources that scale very differently\. Local field updates are bandwidth\-limited by on\-chip weight access, while the evaluation of cross\-partition terms in those fields is rate\-limited by how fast boundary*states*can be exchanged\. The DSIM design goal is therefore not to move weights across devices, but to keep every weight local, and to pay only for exchanging 1\-bit boundary p\-bit states\. Once the graph is partitioned intoKKclusters, the system runs the same Ising MCMC dynamics as a monolithic machine, but boundary p\-bit states arrive with some delay rather than being available instantaneously\.

At the hardware level, boundary traffic is carried on a constrained physical interconnect\. For chain\-like wiring, boundary packets that logically connect clustersaaandbbmay need to traverse multiple hops, and the sustainable throughput is limited by the narrowest \(most pin\-limited\) link along the route\. This section defines a scalar cost metric that quantifies the worst\-case boundary\-transport bottleneck\. That number is what ultimately appears inside the conservative clocking bound of the main text\.

#### S4\.1Boundary sizes, hop distance, and pin budget

Consider a sparse, undirected Ising graph with symmetric weightsJi​j=Jj​iJ\_\{ij\}=J\_\{ji\}\. Let the vertex set be partitioned intoKKclusters, and label clusters bya,b∈\{1,…,K\}a,b\\in\\\{1,\\dots,K\\\}\. For each unordered pair\(a,b\)\(a,b\), define

- •ba​bb\_\{ab\}: the number of boundary p\-bit*states*that must be shipped from clusteraato clusterbb\. Operationally,ba​bb\_\{ab\}is the number of p\-bits in clusteraathat participate in cut edges with clusterbb\(equivalently, the number of 1\-bit state values that clusterbbneeds fromaato evaluate cross\-partition terms in its local fields\)\.
- •da​bd\_\{ab\}: the hop distance between the devices hosting clustersaaandbb, defined as the number of physical links that a boundary packet must traverse\. For a linear chain,da​b=\|a−b\|d\_\{ab\}=\|a\-b\|once an ordering has been chosen\.
- •Pa​bP\_\{ab\}: the number of usable data pins on the narrowest link along the route fromaatobb\. When a route crosses multiple links, the bottleneck link determinesPa​bP\_\{ab\}\.

Note thatba​bb\_\{ab\}is a property of the*partition*, whileda​bd\_\{ab\}andPa​bP\_\{ab\}are properties of the*physical mapping and wiring*\. A partitioning tool can minimize the total number of cut edges, but it does not, by itself, guarantee that those edges are aligned with short physical paths\.

#### S4\.2Total communication cost and worst\-path cost

We define two related costs\. The first is a total communication cost:

Ctot=∑a<bba​b​da​bPa​b\.C\_\{\\text\{tot\}\}=\\sum\_\{a<b\}b\_\{ab\}\\,\\frac\{d\_\{ab\}\}\{P\_\{ab\}\}\.\(S\.2\)This quantity is useful for comparing different mappings and partitions because it increases when a large number of boundary bits are forced onto long or narrow paths\. For clocking feasibility, however, the critical object is the worst \(most demanding\) route\. We therefore define a worst\-path cost:

Cmax=maxa<b⁡\(ba​b​da​bPa​b\)\.C\_\{\\max\}=\\max\_\{a<b\}\\left\(b\_\{ab\}\\,\\frac\{d\_\{ab\}\}\{P\_\{ab\}\}\\right\)\.\(S\.3\)
CmaxC\_\{\\max\}identifies the cluster pair that creates the tightest boundary\-transport bottleneck once the design is placed on the physical interconnect\.

#### S4\.3Why permutation matters \(even when the partition is fixed\)

For standard partitioners such as METIS\[[48](https://arxiv.org/html/2606.25313#bib.bib48)\]and KaHIP\[[49](https://arxiv.org/html/2606.25313#bib.bib49)\], the cluster labels are arbitrary\. A partition defines membership, but it does not define an ordering\. On a chain, the physical slot order\[1,2,…,K\]\[1,2,\\dots,K\]is a separate choice, and changing that order changesda​bd\_\{ab\}, and therefore changes bothCtotC\_\{\\text\{tot\}\}andCmaxC\_\{\\max\}\. ForK=6K=6, there are6\!/2=3606\!/2=360distinct orderings up to reversal\. Fig\.[S3](https://arxiv.org/html/2606.25313#S0.F3)a shows that the choice of ordering alone can changeCtotC\_\{\\text\{tot\}\}by more than a factor of two for a representativeL3=373L^\{3\}=37^\{3\}instance on a 6\-node chain, meaning that standard partitioners would require an explicit search over orderings to find the best placement\. The topology\-aware Potts partitioning introduced in Section[S5](https://arxiv.org/html/2606.25313#S0.SS5)avoids this search entirely: because the partition is constructed to align with the chain, the canonical ordering \(or its reverse\) already achieves the minimum cost \(Fig\.[S3](https://arxiv.org/html/2606.25313#S0.F3)b\)\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp3.png)Fig\. S3:Permutation sensitivity of communication cost on a 6\-FPGA chain\.Communication costCtotC\_\{\\text\{tot\}\}from Eq\. \([S\.2](https://arxiv.org/html/2606.25313#S0.E2)\) versus permutation index for six clusters of theL3=373L^\{3\}=37^\{3\}Edwards–Anderson \(EA\) spin glass\.\(a\)METIS partitioning: the default ordering is suboptimal and an explicit mapping search is needed\.\(b\)Potts partitioning: the topology\-aware partition is naturally aligned with the chain, so the canonical ordering \(or its reverse\) already achieves the minimum\.
#### S4\.4From cost metric to clocking bound

Equation \([2](https://arxiv.org/html/2606.25313#S3.E2)\) in the main text bounds the local update clock givenCmaxC\_\{\\max\}\. We derive it here\. Consider boundary traffic from clusteraato clusterbb\. At communication clockfcommf\_\{\\mathrm\{comm\}\}, a link withPa​bP\_\{ab\}data pins can deliverPa​bP\_\{ab\}boundary bits per clock cycle\. Shippingba​bb\_\{ab\}boundary states through one link therefore takesba​b/\(Pa​b​fcomm\)b\_\{ab\}/\(P\_\{ab\}f\_\{\\mathrm\{comm\}\}\)seconds\. Becauseba​bb\_\{ab\}typically exceedsPa​bP\_\{ab\}, the boundary bits must be sent in multiple frames that time\-division multiplex the available pins, so the end\-to\-end transfer time scales asba​b​da​b/\(Pa​b​fcomm\)b\_\{ab\}d\_\{ab\}/\(P\_\{ab\}f\_\{\\mathrm\{comm\}\}\)\. We include a factor of two to account for the packing and unpacking overhead at the sender and receiver, giving the transport time for one cluster pair:

τa​b=2​ba​b​da​bPa​b​fcomm\.\\tau\_\{ab\}=\\frac\{2\\,b\_\{ab\}\\,d\_\{ab\}\}\{P\_\{ab\}\\,f\_\{\\mathrm\{comm\}\}\}\.\(S\.4\)

#### S4\.5Including coloring and the clocking bound

Updates use graph coloring\[[41](https://arxiv.org/html/2606.25313#bib.bib41)\]withNcolorN\_\{\\mathrm\{color\}\}independent sets, so that one sweep \(one Monte Carlo sweep, MCS\) updates all colors once\. For boundary states to remain useful, they must be refreshed before the local p\-bits that depend on them are updated again\. To be safe, the communication network needs to handle boundary updates roughly as fast as the system cycles through its color updates\. The worst\-case communication time is therefore

τcomm=Ncolor​maxa<b⁡τa​b=2​Ncolor​Cmaxfcomm,\\tau\_\{\\text\{comm\}\}=N\_\{\\mathrm\{color\}\}\\max\_\{a<b\}\\tau\_\{ab\}=\\frac\{2\\,N\_\{\\mathrm\{color\}\}\\,C\_\{\\max\}\}\{f\_\{\\mathrm\{comm\}\}\},\(S\.5\)whereCmaxC\_\{\\max\}is defined in Eq\. \([S\.3](https://arxiv.org/html/2606.25313#S0.E3)\)\. The local p\-bit update period isτp\-bit=1/fp​\-​bit\\tau\_\{\\text\{p\-bit\}\}=1/f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\. A conservative requirement is that the local update period should not be much shorter than the time needed to refresh the most demanding boundary path,τp\-bit≥τcomm\\tau\_\{\\text\{p\-bit\}\}\\geq\\tau\_\{\\text\{comm\}\}\. Combining with Eq\. \([S\.5](https://arxiv.org/html/2606.25313#S0.E5)\) gives

fp​\-​bit≤fcomm2​Ncolor​Cmax≡fp​\-​bit,max,f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\\leq\\frac\{f\_\{\\mathrm\{comm\}\}\}\{2\\,N\_\{\\mathrm\{color\}\}\\,C\_\{\\max\}\}\\equiv f\_\{\\mathrm\{p\\text\{\-\}bit,max\}\},\(S\.6\)which is Eq\. \([2](https://arxiv.org/html/2606.25313#S3.E2)\) of the main text\.

#### S4\.6Evaluating the bound for DSIM\-1

We now evaluate Eq\. \([S\.6](https://arxiv.org/html/2606.25313#S0.E6)\) for theL3=373L^\{3\}=37^\{3\}system on the 6\-FPGA DSIM\-1 chain with topology\-aware Potts partitioning \(introduced next in Section[S5](https://arxiv.org/html/2606.25313#S0.SS5)\)\. The five inter\-FPGA links have pin countsP=\[54,30,54,26,54\]P=\[54,\\,30,\\,54,\\,26,\\,54\]\(from FPGA 1–2 through FPGA 5–6\)\. For a non\-adjacent pair\(a,b\)\(a,b\), the bottleneck pin count is the minimum over all links along the path fromaatobb\. From the measured partition, the worst\-case cluster pair is\(4,6\)\(4,\\,6\), with boundary sizeb46=660b\_\{46\}=660, hop distanced46=2d\_\{46\}=2, and bottleneck pin countP46=min⁡\(26,54\)=26P\_\{46\}=\\min\(26,\\,54\)=26\. This gives

Cmax=b46​d46P46=660×226≈50\.8\.C\_\{\\max\}=b\_\{46\}\\,\\frac\{d\_\{46\}\}\{P\_\{46\}\}=660\\times\\frac\{2\}\{26\}\\approx 50\.8\.WithNcolor=3N\_\{\\mathrm\{color\}\}=3, the conservative bound predicts a threshold ratio of

ηthreshold=2​Ncolor​Cmax=2×3×50\.8≈305,\\eta\_\{\\mathrm\{threshold\}\}=2\\,N\_\{\\mathrm\{color\}\}\\,C\_\{\\max\}=2\\times 3\\times 50\.8\\approx 305,which is consistent with the empirical onset of saturation nearη≈300\\eta\\approx 300observed in Fig\.[2](https://arxiv.org/html/2606.25313#S1.F2)c of the main text\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp4.png)Fig\. S4:Standard versus topology\-aware partitioning\.\(a\)A distance\-blind objective \(KaHIP/METIS\) minimizes cut size without encoding physical distance, producing all\-to\-all inter\-cluster connections\.\(b\)A Potts objective penalizes long\-distance interactions in the cluster order, producing partitions that align naturally with a chain and reduce worst\-path congestion\.

### S5Distance distribution and topology\-aware Potts partitioning

The communication\-cost metric defined above depends on the hop distanceda​bd\_\{ab\}of each cut edge\. A small total cut can still be harmful if a noticeable fraction of boundary interactions must traverse two or three hops, because those multi\-hop routes concentrate traffic on shared intermediate links\.

A direct way to see this is to classify each cut interaction by its distancedd\. For a 6\-cluster chain,d∈\{1,2,3,4,5\}d\\in\\\{1,2,3,4,5\\\}, and the distribution overdddepends on how partitions align with the chain\. When long\-distance traffic is present, it increases both total communication cost and worst\-path cost because the same physical links must carry boundary data for many pairs\. This motivates a partitioning strategy that is aware of the physical chain topology\.

Standard graph partitioners aim to minimize cut size while balancing cluster sizes, which is the right objective if the hardware topology is all\-to\-all or if any cluster can communicate with any other at the same cost\. A chain is neither: physical distance matters, and multi\-hop routes share links\. To produce partitions that are naturally ordered along a chain, we introduce a Potts objective that penalizes interactions between clusters that are far apart in cluster index\. The objective is used only to compute the cluster assignment and does not change the Ising problem being solved by the p\-computer\. Fig\.[S4](https://arxiv.org/html/2606.25313#S0.F4)illustrates the conceptual difference between a distance\-blind cut objective and the topology\-aware Potts objective\.

Letsi∈\{1,…,K\}s\_\{i\}\\in\\\{1,\\dots,K\\\}be the cluster label of nodeii, whereKKis the number of clusters\. We define

HPotts​\(𝐬\)=∑\(i,j\)∈ℰ\|Ji​j\|​κ​\(\|si−sj\|\)\+λ​∑q=1K\(nq−NK\)2,H\_\{\\text\{Potts\}\}\(\\mathbf\{s\}\)=\\sum\_\{\(i,j\)\\in\\mathcal\{E\}\}\|J\_\{ij\}\|\\,\\kappa\(\|s\_\{i\}\-s\_\{j\}\|\)\\;\+\\;\\lambda\\sum\_\{q=1\}^\{K\}\\left\(n\_\{q\}\-\\frac\{N\}\{K\}\\right\)^\{2\},\(S\.7\)whereℰ\\mathcal\{E\}is the set of edges in the graph,κ​\(d\)\\kappa\(d\)is a distance kernel that penalizes cuts between clusters whose indices differ byd=\|si−sj\|d=\|s\_\{i\}\-s\_\{j\}\|,nq=∑i𝟏​\{si=q\}n\_\{q\}=\\sum\_\{i\}\\mathbf\{1\}\\\{s\_\{i\}=q\\\}is the size of clusterqq,NNis the total number of nodes, and the hyperparameterλ\>0\\lambda\>0controls the strength of the balance penalty\. A simple kernel is

κ​\(d\)=\{0,d=0,δnear,d=1,δfar,d≥2,0<δnear<δfar,\\kappa\(d\)=\\begin\{cases\}0,&d=0,\\\\ \\delta\_\{\\text\{near\}\},&d=1,\\\\ \\delta\_\{\\text\{far\}\},&d\\geq 2,\\end\{cases\}\\qquad 0<\\delta\_\{\\text\{near\}\}<\\delta\_\{\\text\{far\}\},\(S\.8\)withδfar/δnear≫1\\delta\_\{\\text\{far\}\}/\\delta\_\{\\text\{near\}\}\\gg 1\. Minimizing Eq\. \([S\.7](https://arxiv.org/html/2606.25313#S0.E7)\) suppresses long\-range cluster interactions and concentrates the cut atd=1d=1\(and sometimesd=2d=2\) in the cluster index order\. Fig\.[S5](https://arxiv.org/html/2606.25313#S0.F5)confirms this: Potts partitioning concentrates over 73% of cut edges atd=1d=1, compared with under 48% for METIS\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp5.png)Fig\. S5:Distance distribution of cut edges for six clusters on a chain\.\(a\)METIS partitioning produces cuts at distances up tod=4d=4, with only 47\.4% of cut edges atd=1d=1\.\(b\)Potts partitioning concentrates 73\.1% of cut edges atd=1d=1and limits the maximum distance tod=2d=2\. Concentrating cut interactions at smallddreduces multi\-hop pressure on intermediate links and lowers worst\-path cost\. In both panels, % of cut denotes the fraction of total cut edges at each distance\.Changing the partition objective changes the physical communication pattern, but it does not change the optimization behavior\. Fig\.[S6](https://arxiv.org/html/2606.25313#S0.F6)confirms this: the residual\-energy decay is statistically indistinguishable between METIS and Potts partitions once the DSIM is operated in the appropriateη\\etaregime\. In practice, all DSIM\-1 experiments use Potts partitioning to align with the linear chain topology, while all DSIM\-2 experiments use METIS partitioning\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp6.png)Fig\. S6:Solution quality is unchanged under topology\-aware partitioning \(DSIM\-1\)\.Final residual energyρEf\\rho\_\{E\}^\{f\}versus sweeps for theL3=373L^\{3\}=37^\{3\}EA spin glass on DSIM\-1 using METIS partitions and topology\-aware Potts partitions\. The overlap indicates that the Potts objective improves wiring alignment without degrading the optimization result\. All points: 10 instances, 10 runs per instance, error bars are 95% bootstrap confidence intervals\.
### S6Source\-synchronous boundary transport on a 6\-node chain

This section describes the concrete 6\-node chain used in the DSIM\-1 experiments \(Fig\.[1](https://arxiv.org/html/2606.25313#S1.F1)e of the main text\)\. Boundary states are transported using a source\-synchronous style: data and a forwarded clock travel together, so the receiver samples incoming bits relative to the forwarded clock, without requiring a global phase relationship to the local update clock\. This keeps the boundary channel simple and low\-latency at the physical layer, and it matches the DSIM goal of exchanging boundary*states*cheaply\. Fig\.[S7](https://arxiv.org/html/2606.25313#S0.F7)sketches two complementary views\. Panel \(a\) shows the conceptual link, emphasizing that boundary transport is about moving packed 1\-bit states with a forwarded clock\. Panel \(b\) shows the full chain and emphasizes why hop count matters: boundary information between non\-neighboring partitions must traverse multiple intermediate boards, loading the shared links\. Before running any optimization, we verified each link by sending known test patterns and checking that every bit arrived correctly\. Fig\.[S8](https://arxiv.org/html/2606.25313#S0.F8)shows a photograph of the assembled six\-FPGA chain\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp7.png)Fig\. S7:Source\-synchronous boundary transport on a nearest\-neighbor chain\.\(a\)A conceptual source\-synchronous link: a sender forwards a clock together with packed boundary bits, and the receiver samples relative to the forwarded clock\.\(b\)The 6\-node chain used in DSIM\-1 measurements\. Non\-neighbor boundary interactions are carried as multi\-hop traffic that traverses intermediate links, which is the physical origin of hop\-dependent congestion captured byda​bd\_\{ab\}andCmaxC\_\{\\max\}\.![Refer to caption](https://arxiv.org/html/2606.25313v1/supp8.png)Fig\. S8:DSIM\-1 hardware\.The six\-FPGA nearest\-neighbor chain used in the DSIM\-1 experiments, with adjacent boards linked over FMC and FMC\+ connectors\.
### S7Disconnected\-links control: isolating the origin of slope loss

A key question raised by the overclocking results \(Fig\.[3](https://arxiv.org/html/2606.25313#S2.F3)of the main text\) is whether the modest slope loss originates from local timing violations at high clock rates or specifically from stale boundary information\. To isolate the origin, we run DSIM\-1 with boundary links disconnected\. In this mode, each FPGA evolves its own subgraph under the same simulated annealing schedule \(β=0\.5,1\.0,…,5\.0\\beta=0\.5,1\.0,\\ldots,5\.0\) and fixed\-point format s\{4\}​\{1\}\\\{4\\\}\\\{1\\\}\(signed, 4 integer bits, 1 fractional bit\)\[[41](https://arxiv.org/html/2606.25313#bib.bib41)\], but without receiving any neighbor boundary states\. If local updates were failing at high clock, this would appear as a degradation in local subgraph energy traces even in the disconnected setting\. Fig\.[S9](https://arxiv.org/html/2606.25313#S0.F9)shows that the per\-subgraph energy traces at0\.100\.10MHz,1515MHz, and5050MHz remain consistent for each of the six partitions\. The overlap across frequencies indicates that local updates produce correct results at all tested clocks, and therefore the slope loss observed in coupled runs is attributable to boundary exchange dynamics, not to local update errors\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp9.png)Fig\. S9:Local\-subgraph energies remain stable when boundary links are disconnected \(DSIM\-1\)\.Subgraph energy versus sweeps for each of the six partitions of theL3=373L^\{3\}=37^\{3\}EA spin glass on DSIM\-1, using the same update logic, annealing schedule \(β=0\.5,1\.0,…,5\.0\\beta=0\.5,1\.0,\\ldots,5\.0\), and fixed\-point format s\{4\}​\{1\}\\\{4\\\}\\\{1\\\}as the coupled DSIM\-1 runs, but with boundary links disconnected\. Three local clocks are shown:0\.100\.10MHz,1515MHz, and5050MHz\. The trajectories overlap for each partition over the full sweep range, supporting the interpretation that slope loss in coupled overclocked operation originates from boundary exchange, not from local update failure\. All points: 10 instances, 10 runs per instance, error bars are 95% bootstrap confidence intervals\.
### S8UCIe feasibility and 7 nm ASIC metrics

This section translates the conservative bound into a required link\-rate envelope using Universal Chiplet Interconnect Express \(UCIe\) as a concrete short\-reach die\-to\-die example\[[62](https://arxiv.org/html/2606.25313#bib.bib62)\], complementing the multi\-chip scaling analysis of Ref\.\[[33](https://arxiv.org/html/2606.25313#bib.bib33)\]\. Similar analysis applies to other die\-to\-die interconnects such as Bunch of Wires \(BoW\)\[[63](https://arxiv.org/html/2606.25313#bib.bib63)\], but the point here is to connect the DSIM timing rule to physical interconnect sizing\.

#### S8\.1Bandwidth requirement expressed as a communication clock

Starting from Eq\. \([S\.6](https://arxiv.org/html/2606.25313#S0.E6)\), a conservative requirement for supporting a chosenfp​\-​bitf\_\{\\mathrm\{p\\text\{\-\}bit\}\}is

fcomm≥2​Ncolor​Cmax​fp​\-​bit\.f\_\{\\mathrm\{comm\}\}\\geq 2\\,N\_\{\\mathrm\{color\}\}\\,C\_\{\\max\}\\,f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\.\(S\.9\)To illustrate the scale of the required communication clock, we plug in the values from the worst\-case pair of theL3=373L^\{3\}=37^\{3\}DSIM\-1 partition \(Section[S4](https://arxiv.org/html/2606.25313#S0.SS4)\):Ncolor=3N\_\{\\mathrm\{color\}\}=3, boundary sizeba​b=660b\_\{ab\}=660, hop distanceda​b=2d\_\{ab\}=2, givingCmax=ba​b​da​b/Pa​bC\_\{\\max\}=b\_\{ab\}d\_\{ab\}/P\_\{ab\}\. Substituting into the bound yields

fcomm=2​Ncolor​ba​b​da​bPa​b​fp​\-​bit\.f\_\{\\mathrm\{comm\}\}=\\frac\{2\\,N\_\{\\mathrm\{color\}\}\\,b\_\{ab\}\\,d\_\{ab\}\}\{P\_\{ab\}\}\\,f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\.\(S\.10\)Forfp​\-​bit=100f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=100MHz this becomes

fcomm=792,000Pa​b​MHz\.f\_\{\\mathrm\{comm\}\}=\\frac\{792\{,\}000\}\{P\_\{ab\}\}~\\text\{MHz\}\.\(S\.11\)Evaluating at two standard UCIe module widths\[[62](https://arxiv.org/html/2606.25313#bib.bib62)\]: forPa​b=64P\_\{ab\}=64,fcomm≈12\.4f\_\{\\mathrm\{comm\}\}\\approx 12\.4GHz, and forPa​b=128P\_\{ab\}=128,fcomm≈6\.2f\_\{\\mathrm\{comm\}\}\\approx 6\.2GHz\. Both rates fall within the standard UCIe operating range, confirming that the conservative bound of Eq\. \([S\.6](https://arxiv.org/html/2606.25313#S0.E6)\) can be satisfied with existing die\-to\-die interconnect technology\.

#### S8\.2A local 100 MHz digital core in 7 nm

A link feasibility check is only meaningful if the compute core can match it\. To validate that the local update logic is not the limiting factor, we implemented one representative partition of theL3=373L^\{3\}=37^\{3\}system as a standard\-cell design in the ASAP7 7 nm predictive process design kit \(PDK\)\[[61](https://arxiv.org/html/2606.25313#bib.bib61)\]\. The partition contains 8442 p\-bits using the same fixed\-point format s\{4\}​\{1\}\\\{4\\\}\\\{1\\\}\. The resulting place\-and\-route closed timing at 100 MHz\. Representative metrics are summarized in Table[S1](https://arxiv.org/html/2606.25313#S0.T1), and Fig\.[S10](https://arxiv.org/html/2606.25313#S0.F10)shows the place\-and\-route layout\. The shown physical dimensions are normalized \(1/4 scale each\) to account for how ASAP7 presents itself in the PDK\. These results confirm that, for a DSIM, the dominant scaling constraint is keeping boundary data fresh, not local update speed\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp10.png)Fig\. S10:7 nm physical design snapshot for oneL3=373L^\{3\}=37^\{3\}partition\.\(a\)Place\-and\-route view of an 8442 p\-bit partition \(6 neighbors per p\-bit\) using s\{4\}​\{1\}\\\{4\\\}\\\{1\\\}fixed\-point, occupying a815​μ815\\,\\mum×\\times815​μ815\\,\\mum footprint\.\(b\)Zoom of a25​μ25\\,\\mum×\\times37\.5​μ37\.5\\,\\mum window, with the logic of a single p\-bit highlighted in red\. A p\-bit’s standard cells do not form a contiguous rectangular block, as the placement tools distribute them over the surrounding area during optimization\. The design meets a 100 MHz target clock in ASAP7\[[61](https://arxiv.org/html/2606.25313#bib.bib61)\], confirming that the dominant scaling constraint for multi\-chip DSIMs is boundary exchange bandwidth, not local update logic\.Table S1:Representative metrics for one 8442 p\-bit partition in 7 nm CMOS\.

### S9Max\-Cut on the G81 Gset instance with 20,000 spins

This section provides the full experimental details and a reproducible spin configuration for the G81 Max\-Cut benchmark\. To demonstrate the distributed p\-computer on a well\-studied combinatorial optimization benchmark beyond spin glasses, we solve the G81 instance from the Gset library\[[52](https://arxiv.org/html/2606.25313#bib.bib52)\]\. G81 is a toroidal grid graph with20,00020\{,\}000spins and is among the largest and most intensively studied instances in the Max\-Cut literature\. The algorithm used here is adaptive parallel tempering with isoenergetic cluster moves \(APT\+ICM\)\[[23](https://arxiv.org/html/2606.25313#bib.bib23)\], a replica\-based method in which multiple copies of the Ising graph run at different temperatures with periodic replica exchanges, augmented by non\-local isoenergetic cluster moves that flip disagreeing spin clusters between replica pairs without changing their energies\. We employ 40 replicas organized as 10 inverse temperatures×\\times4 ICM replicas per temperature on DSIM\-1 atfp​\-​bit=1f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=1MHz, the conservative synchronous limit that satisfies Eq\. \([2](https://arxiv.org/html/2606.25313#S3.E2)\), withNcolor=2N\_\{\\mathrm\{color\}\}=2and fixed\-point format s\{4\}​\{6\}\\\{4\\\}\\\{6\\\}\(the wider format accommodates the distinct inverse temperatures required by adaptive parallel tempering\), and run for10610^\{6\}Monte Carlo sweeps with one sweep per swap attempt\. Because the G81 graph with its 40 replicas does not fit on the 6\-FPGA chain simultaneously, only one replica runs on\-chip at a time\. The 40 replicas are run sequentially via time\-division multiplexing, and replica swap decisions are managed by a MATLAB host between successive on\-chip runs\. The 10 inverse temperatures areβ=\{2\.00,2\.05,2\.13,2\.22,2\.34,2\.52,2\.75,3\.13,3\.89,5\.61\}\\beta=\\\{2\.00,2\.05,2\.13,2\.22,2\.34,2\.52,2\.75,3\.13,3\.89,5\.61\\\}, obtained from the APT preprocessing procedure\[[72](https://arxiv.org/html/2606.25313#bib.bib72)\]\.

Table S2:Comparison of Max\-Cut results on the G81 Gset instance\(20,00020\{,\}000spins\)\. DSIM\-1 matches the certified\-optimal cut of14,06014\{,\}060, first found by the Cosm algorithm\[[53](https://arxiv.org/html/2606.25313#bib.bib53),[54](https://arxiv.org/html/2606.25313#bib.bib54)\]and proven optimal with an exact solver\[[54](https://arxiv.org/html/2606.25313#bib.bib54)\]\.Table[S2](https://arxiv.org/html/2606.25313#S0.T2)compares the DSIM\-1 result against reported values from other solvers\. The APT\+ICM run on DSIM\-1 finds the maximum cut value of14,06014\{,\}060in14%14\\%of independent trials and the second\-best value14,05814\{,\}058in70%70\\%of trials\. The ground state is degenerate: DSIM\-1 identifies multiple distinct spin configurations that all achieve the14,06014\{,\}060cut value\. Fig\.[S11](https://arxiv.org/html/2606.25313#S0.F11)shows the G81 graph and the distribution of cut values across independent trials\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp11.png)Fig\. S11:Max\-Cut on the G81 Gset instance with 20,000 nodes \(DSIM\-1\)\.\(a\)The G81 toroidal grid graph with 20,000 spins and 40 replicas\.\(b\)Distribution of cut values found by APT\+ICM on DSIM\-1 across independent trials\. The system achieves the certified optimal cut of14,06014\{,\}060in14%14\\%of trials and concentrates70%70\\%of trials at14,05814\{,\}058\. The Cosm algorithm reports14,06014\{,\}060in approximately3%3\\%of trials\[[53](https://arxiv.org/html/2606.25313#bib.bib53)\]\(different sweep definition\)\.For verification, we provide the hexadecimal representation of one spin configuration achieving the cut value of14,06014\{,\}060\. Variables11through20,00020\{,\}000are encoded in order\. Expanding the hex string to binary and mapping\{0,1\}→\{−1,\+1\}\\\{0,1\\\}\\to\\\{\-1,\+1\\\}recovers the full spin assignment\.

```
0F2D95FA4FCE926841977C9594C45766672BD108BAB90163DDA3AC738A0DA6A2CA24A4AED00A27357143D175551E3D
BBCCE4FAA4542E6AF69EA75E3A7FA996B68FCD499048C7FF44DA1F592B2CB9D6E71FC22468F20D0B911C380A442CC7
B554E09D1FC3C9B058A0AB60B6803FAEEF51254E00BD4F23DD2B82DF22F153B7E965CEEEB3ECC21E574A578C491352
530EF7ED762BB78B1EC651C5ED1F70D06494F2B487D481DBDE559348FBA86A39BC77494415CEDF4CC42FB2520FE0F1
6D847E9E3D0D38E700F6B5C86FDE9104F92ADBA0AC9EA7B0D24A21F897E6E5D4563E3B9BCED8E07BC329A814B04A83
EF538BDECE3CEE73C867BE7811121B86C138BEEC61868CBAFD908E36140EE9763E7D78C266DFF014B8CC6B63FD0BEA
566A5EE9084BB1176C352E1E30907A7BB78CD4B1876D55C401886E87025454DEEB760506FA780BBB2844EA955D2E83
75C76BD39972AA53344B0B12933B4E6289BDE4E31BA1EADDB9671C7F515452641B6D1B7A90381218FC46D0CF4F5641
261B0C7AAD8AFB1A7CE9DEA1EC2249DDC5C43710F69CDB5D9D74ED9772FAAF66B713F0D7593B283F686C8233CA8E70
B6929D66272EE40AEA488A62A6A669EFED41E6A1B6900E6A1D52B81FFD784B96AF8CDC9B9084A79046B4F8781C5D2E
0DB6B521DCEEF258823A372E681EE014DDEA211E8564B88625928BF1855B647610A32C562C9738648149312714FCB7
2CAB01A17F5E30C551AF22B70C3F8E4177C5B76F3794E07F441F1B3E46A140EB7142FD1D0B3C2CABB4F2D1E05F4D36
455AA60CC6305308592CD1C0D21FACCC03FD2602382FD7848217EB68A46F4FF497A38F812FF0967D0F7F877DB8CBCA
95C579FDCA351C2FA10F124B5B27B7DC9C6B7D888C203B31C88F9B3CD2289DE8AD9C5D8C0BE49585D91E2F5A6BD8F4
19DB38BBF429C70D805BB75312D96D5A739CB5E4CA174B26A17979D6822B1868876E84D5A4521747EB1DFB3F911A5C
308622823515350A29976C9D71BBAE69B3A1622DEE201FE2B7AFA22A2E99315ACB89DD685509BF9460F63F822CC1A8
BABFFBD522827926FB5CE7E5017771E07257C43A97D2F7626D0EDBC5876D2F5AFAD25644570D09D58466DE950D5FAA
1E57E9BE2C288B4128027202C15AA45C007ECB4FA84CDE3271142F1C005A92ADE964D3557624FC8722D2B4FD302466
4F84DF3600B272DD7DC36E1309501A3E9CBCA5781FB4B5AB5658A731B73758FCC7E15031C68CB98DB0C2609699FFDA
54954167965EF055D0395F8B0031FACDB2031EDB3A1347D073C905375C9A449E194CF310433C9A9D1660E75D25FBF5
5D84239DFCE251EB125644D82FF3AFE8A32E50095E0CF74B1ED9932DC7E1B19AF320F05CC7555AAFA83BC4D714BE86
4071FFDA56B834C9EE85ABF9A10006D18651F6C19786716BFCFE99FDC94FECC1E48FE68910C1B1C52983CC24E88668
3BE054BA16E99111D10DFDF9920E25120FE4A6003B8631E0B3575CC89CC9640794A780F8CA473F5927766CD9CEEC97
F2FED8E277BEE121EC24F94DEA744D0F74E5C4300EA91579D695B92E8087588FB2A3271E69B520B7D6D682273EEC4B
C3D420DE71912D17BA9CF5F208EBDB71961E6DF27915C4AAE20B6451D66E7605F3343C273B0DDB379E7FEEE81AF1BC
3D4AAEB7E70F24F5194DC6B19BD358265ED437793B66A46F91A745568D612513F277073CBAB64DC9248615ACBF668D
008E0E77B3A381A69D5394339D5B8762FA6099A2CFC08B21A152AFDD22CE68FCC2028E43542CE940FEB8ABEC5C5FD5
6135D783BD570B83ED7F0F0C6BF5EB5D2A2EF4D144442C97C2957DED615BF5BC2C13285AE8B250EBB5A7ECF475F1FD
0AAA087D24F20DDEA203C1EBA82DDD152BCC7B7917D71E5C675F301715012A60AD9B8C84C483E7BA7CCEF9DDE24D10
A445978D4FF5316DEA4CB5BB90D6D55264C1E6968BD3F51B7A08D13F403A67948EB9AE35115A75F198FC81DC24E411
383D03B23943B2148D7C1BA48A90475425A8C5F862ABDB5507DEBBBFA0002DD10D70EB6DFA8ACB065C00D153A777E0
A71C555C8438E4F649598E0808BE81910E2D32C3AEC5B86441E330156CB0DB2E9F7D7CB3A7937425BA27E10C222857
6CF81F496260EFCCCC53D5FDA2272EAF5C1921FE892C06BA27FF30796BCD14F217CAEB5A755B3D8830A215ACE4FD68
7B51DFF4CEA5F322D21723FF17D5422A0FB2B26B8B6B4CCE9E318E0C1682FC23CD77417A97D61366EFE1521E287595
E0F148E8D0E879A81D7EDBEA8216F7091F843B961F01BA93F1F5593FE0ED73DC66B6E5F113F6773F22183BB1AFA6E8
E798F572E15C38EB73CAE39A119B7EF9756FC9EFE111CF4DE9D7EEC95051468776A93032E7D0BD7493FEF7E4FA1817
16F7CE02D961B1E5A2593DD8E8DC170A319C6F650CE6CD23C393AF11B7B2EE274482FFDA13A1B14FADEF2443E93CD4
21B304D79197B9E3BA5512C3481400013A05AD8A1AD060905BBA1A8915F515B61EBEB98F52555B5F26C0A91F1FCBEE
7C7EBDF2006420228CBC6BB731C0929236C74497806876FDE4554364589E6AF15DE324AB711265F4F5621F3BF072E4
5CEB267F56ADFF390AE7CFFC52CB2FF37AC95974ABCF578238EBE1AF07715955867DD48CF20CFA099F80F1172B0B71
12F7280FF2048C3A88F9FC9AD8D5C2FA1FAC7185A9B694881D8DDE80B1F152A77FD81C029CEC0B3C869C107D01D04E
72DAB9E12B720CA7834894EE02CE67C4EA5AAD8C884BEE75C9DB24AB02005D64126A6CBFB36F9142F409B52B2729F1
B97A0785095F348138B852D73757B86D1C2D62DB10CA1C4004964252BDDBC732A954EB1082AA76EC5FCF23D204C760
1DF3BE8CEDD5C571DD314D67C60282B9595D557AE0A8A1D06CA5418D9639024721091969C06541AF82035F5C0E7672
40E572726AD2071E077CC6178D398FAF0CC0B0279B1ECF9755477D466DC1A9206C3ADD66AB2605EEACCF417DF609A3
59B651F681DC06866FE3DC12A87A72E30664C2B7383348824718631F14C60AC26E8B4BC9FEFC8BAE14AE9CC0577C79
ADB37F036B91714033C614ACA80098F0DCCBF950A08DF4DF20EF7B20BAFD5FA0A3071E0558817E4B40DE4F1BE14406
F1B8D852599E23D8425BC0E398BC31806E7E9C10A299CACDBC2BF44A6BAB2DDCE1BC9F76556DA76795C59C82F6C963
2DAF2F71BEC89EFFD9FFFE7DE2D28B4C33A9E15BEDB64ACAB0AFEE54447C4241B7D78F8CA7204896E6A471C4BFC78E
46CC6C4BA617CBB50C8D20C6B69B1F5D9375B384F0F5574CF087D48602A33EE804EB11C9483101F0B992AD51ED6A5C
2092D79C67CCE89688A9385CA2F50025E8DA76F02D93522BCBCB3857C2D23BE4F4A4A46DEF66C0F308801B7DE27ED6
E943CF526A7DCB73B3634437A168A4398FAA60D2620CC141518BEE671F6AAA33229898676E6169633F8DF4DEEDF6A0
713E449D862D28717CD30A9E5BC91F051A7C8D2DEE1E0FA0593578E07A8A0B89EFE2A73734AC445033091F36DFF460
32FAD75D7080B970BB
```

### S10SLR\-driven partitioning at extreme scale

When scaling to one million p\-bits on DSIM\-2 \(Fig\.[5](https://arxiv.org/html/2606.25313#S3.F5)of the main text\), partitioning across boards is not the only constraint\. On modern large FPGAs, physical resources are segmented into Super Logic Regions \(SLRs\), connected by a limited number of inter\-SLR routing resources \(super long lines\)\. Without an explicit min\-cut at SLR boundaries, too many signals must cross between SLRs, exhausting the available inter\-SLR routing and preventing the design from closing\. For theL3=1003L^\{3\}=100^\{3\}EA benchmark and the large 3SAT benchmark, the implementation therefore introduces a second level of partitioning inside each FPGA\. Rather than treating each FPGA as a single partition, each SLR hosts a sub\-partition with its own local weight memory and update logic\. With 18 FPGAs and 4 SLRs per FPGA, this yields 72 sub\-partitions\. The same shadow\-weight principle is applied at SLR boundaries: if an interaction crosses an SLR boundary, the associated weight is duplicated on both sides so that every SLR can compute its local fields from local memory, while only the required boundary p\-bit states are exchanged across the internal boundary\. Fig\.[S12](https://arxiv.org/html/2606.25313#S0.F12)shows the assembled 18\-FPGA Siemens Veloce proFPGA CS system\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp12.png)Fig\. S12:DSIM\-2 hardware\.The 18\-FPGA Siemens Veloce proFPGA CS system used for the million\-p\-bit experiments\.
### S11Irregular planted Ising graphs on Pegasus and Zephyr topologies

The main text reports that the DSIM architecture handles non\-cubic graph topologies without modification\. This section presents the supporting data\.

A planted Ising instance is one in which a ground\-state spin configuration is chosen first and the coupling weights are then constructed so that this configuration is a ground state of known energy\[[55](https://arxiv.org/html/2606.25313#bib.bib55),[56](https://arxiv.org/html/2606.25313#bib.bib56)\]\. Because the ground\-state energy is known by construction, planted instances provide a controlled benchmark for verifying that an optimizer reaches it\.

Fig\.[S13](https://arxiv.org/html/2606.25313#S0.F13)shows representative energy traces for a Pegasus P41 instance \(39,04039\{,\}040p\-bits,Ncolor=2N\_\{\\mathrm\{color\}\}=2\) on 2 FPGAs atfp​\-​bit=5f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=5MHz, and a Zephyr Z50 instance \(80,80080\{,\}800p\-bits,Ncolor=6N\_\{\\mathrm\{color\}\}=6\) on 6 FPGAs atfp​\-​bit≈1\.67f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\\approx 1\.67MHz, both implemented on a subset of DSIM\-2 with METIS partitioning\. Both use fixed\-point format s\{4\}​\{3\}\\\{4\\\}\\\{3\\\}and a simulated annealing scheduleβ=0\.5,0\.625,…,10\\beta=0\.5,0\.625,\\ldots,10\. These traces are presented as single representative runs to demonstrate that the distributed architecture handles non\-cubic topologies\. A full statistical analysis with multiple instances and runs, as performed for the EA benchmarks, was not carried out for these planted instances because the primary goal here is a capability demonstration rather than a scaling measurement\. In both panels, the horizontal axis is the readout index from uniform sampling over a fixed10610^\{6\}\-sweep window, and the dashed lines indicate the planted ground\-state energies\. The simulated annealing energy converges toward the planted ground state over the course of the run, confirming that the distributed p\-computer successfully minimizes planted Ising instances on topologies native to quantum annealers\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp13.png)Fig\. S13:Irregular planted Ising graphs on Pegasus and Zephyr topologies \(DSIM\-2\)\.\(a\)Energy trace for a Pegasus P41 instance with39,04039\{,\}040p\-bits \(Ncolor=2N\_\{\\mathrm\{color\}\}=2, 2 FPGAs,fp​\-​bit=5f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=5MHz\)\.\(b\)Energy trace for a Zephyr Z50 instance with80,80080\{,\}800p\-bits \(Ncolor=6N\_\{\\mathrm\{color\}\}=6, 6 FPGAs,fp​\-​bit≈1\.67f\_\{\\mathrm\{p\\text\{\-\}bit\}\}\\approx 1\.67MHz\)\. Both are implemented on a subset of DSIM\-2 with METIS partitioning, using fixed\-point format s\{4\}​\{3\}\\\{4\\\}\\\{3\\\}and a simulated annealing scheduleβ=0\.5,0\.625,…,10\\beta=0\.5,0\.625,\\ldots,10\. The horizontal axis is readout index from uniform sampling over a fixed10610^\{6\}\-sweep window\. Dashed lines indicate planted ground\-state energies\. Single representative run for each topology\.
### S12Large invertible 3SAT instance with 250,011 p\-bits

We now move from structured Pegasus and Zephyr topologies to a highly irregular satisfiability benchmark\. The EA spin\-glass benchmarks in the main text use a regular lattice geometry, but a distributed p\-computer is not limited to that case\. Once weights are stored in local memories and only boundary states are exchanged, the same update logic can operate on highly irregular sparse graphs, provided the graph coloring schedule is chosen to respect the connectivity\. To stress this capability on a hard irregular instance, we consider a random 3\-literal satisfiability \(3SAT\) problem near the satisfiability phase transition\[[57](https://arxiv.org/html/2606.25313#bib.bib57)\]\. Prior pairwise Ising demonstrations of SAT have been limited to much smaller scales\[[58](https://arxiv.org/html/2606.25313#bib.bib58)\]; recent work has employed native higher\-order interactions to encode clauses more compactly\[[59](https://arxiv.org/html/2606.25313#bib.bib59)\]\. The instance parameters are

m=55,558​clauses,n=13,042​variables,m=55\{,\}558\\text\{ clauses\},\\qquad n=13\{,\}042\\text\{ variables\},so the clause\-to\-variable ratio isα=m/n≈4\.26\\alpha=m/n\\approx 4\.26\. The instance was generated using the CNFgen tool\[[75](https://arxiv.org/html/2606.25313#bib.bib75)\]as a uniform random 3SAT formula, where each clause is drawn independently and uniformly from all possible 3\-literal clauses overnnvariables\. Becauseα≈4\.26\\alpha\\approx 4\.26lies at the satisfiability phase transition for random 3SAT \(theoretical thresholdαc≈4\.267\\alpha\_\{c\}\\approx 4\.267\[[57](https://arxiv.org/html/2606.25313#bib.bib57)\]\), satisfiability of this particular instance is not guaranteed\. The goal is to stress the DSIM on a large, hard, irregular graph\. After mapping the problem to an Ising model through an invertible logic circuit and applying copy\-gate sparsification\[[35](https://arxiv.org/html/2606.25313#bib.bib35),[41](https://arxiv.org/html/2606.25313#bib.bib41),[69](https://arxiv.org/html/2606.25313#bib.bib69)\], the resulting sparse graph contains250,011250\{,\}011p\-bits\. This scale is reached because the mapping introduces auxiliary copy p\-bits\[[41](https://arxiv.org/html/2606.25313#bib.bib41)\]that preserve locality and sparsity while keeping the energy function consistent with the original SAT constraints\. Fig\.[S14](https://arxiv.org/html/2606.25313#S0.F14)summarizes the formulation at a systems level\. The first panel sketches the invertible logic network used to represent clauses and variable literals, the second panel shows the resulting sparse graph, and the third panel places the chosen instance on the standard SAT probability curve versusα\\alpha\. Copy conflicts introduced by sparsification are resolved by majority vote when decoding a logical assignment: when multiple p\-bit copies represent the same logical variable, the logical value is taken as the majority across those copies\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp14.png)Fig\. S14:Formulation of a large invertible 3SAT instance\.\(a\)Schematic of the invertible logic network used to encode 3SAT in an Ising circuit\.\(b\)Visualization of the resulting sparse Ising graph after copy\-gate sparsification, with 250,011 p\-bits\.\(c\)Schematic SAT probability curve versusα=m/n\\alpha=m/n, with the chosen instance atα≈4\.26\\alpha\\approx 4\.26near the phase transition\[[57](https://arxiv.org/html/2606.25313#bib.bib57)\]\. Copy conflicts are resolved by majority vote over duplicated variables when decoding\.To evaluate progress, we track the number of satisfied clauses versus sweeps under a fixed annealing schedule\. The instance is implemented on DSIM\-2 atfp​\-​bit=0\.5f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=0\.5MHz with METIS partitioning, using a 4\-color schedule \(Ncolor=4N\_\{\\mathrm\{color\}\}=4\) and the fixed\-point format s\{4\}​\{3\}\\\{4\\\}\\\{3\\\}\. The simulated annealing schedule isβ=0\.5,0\.625,…,10\\beta=0\.5,0\.625,\\ldots,10\. For comparison, we run the same schedule on an NVIDIA RTX 6000 Ada GPU\. Fig\.[S15](https://arxiv.org/html/2606.25313#S0.F15)shows satisfied clauses versus sweeps on a semilog scale\. The curves track closely in their asymptotic improvement versus sweeps, indicating that DSIM\-2 reproduces the scaling behavior of an optimized classical baseline even on a highly irregular graph with a larger coloring requirement than the EA lattice\.

At10910^\{9\}sweeps the best single run on the FPGA satisfies55,41655\{,\}416of55,55855\{,\}558clauses \(99\.74%99\.74\\%\) with142142clauses remaining unsatisfied\. The GPU reaches a comparable level \(55,40855\{,\}408best,150150unsatisfied\)\. The instance is not fully solved by either platform under the annealing schedule used, consistent with the extreme difficulty of random 3SAT near the phase transition at this scale\. Nevertheless, the close agreement between the two platforms over more than seven orders of magnitude in sweep count confirms that DSIM\-2 preserves optimization scaling even on highly irregular graphs\.

![Refer to caption](https://arxiv.org/html/2606.25313v1/supp15.png)Fig\. S15:Scaling on a large invertible 3SAT instance with 250,011 p\-bits \(DSIM\-2\)\.Average best satisfied clauses per run versus sweeps for the instance of Fig\.[S14](https://arxiv.org/html/2606.25313#S0.F14), implemented on DSIM\-2 atfp​\-​bit=0\.5f\_\{\\mathrm\{p\\text\{\-\}bit\}\}=0\.5MHz and on an NVIDIA RTX 6000 Ada GPU under the same schedule\. Each point is the mean of the best \(maximum\) satisfied\-clause count found within each of 10 independent runs\. Error bars denote 95% bootstrap confidence intervals across runs\. At10910^\{9\}sweeps the single best run \(DSIM\-2\) satisfies55,41655\{,\}416of55,55855\{,\}558clauses \(99\.74%99\.74\\%\), with142142clauses remaining unsatisfied\.
### S13Architectural comparisons

The main text shows that the DSIM’s stochastic nature allows it to tolerate boundary delay in a way that deterministic architectures cannot\. This section provides the concrete comparisons behind that claim\.

GPU and TPU clusters attack the memory wall with high\-bandwidth memory and high\-bandwidth interconnects\[[76](https://arxiv.org/html/2606.25313#bib.bib76),[77](https://arxiv.org/html/2606.25313#bib.bib77)\]\. That strategy is well matched to dense deterministic workloads, but it depends on substantial communication and memory infrastructure to keep the computation synchronized across devices\. Wafer\-scale systems such as Cerebras\[[78](https://arxiv.org/html/2606.25313#bib.bib78)\]take a different route by minimizing package boundaries and keeping a very large amount of SRAM on one piece of silicon, reducing off\-chip traffic at the cost of an unusually large monolithic implementation\.

Other multi\-chip accelerators keep the modularity of smaller dies while trying to soften the cost of communication\. Simba\[[79](https://arxiv.org/html/2606.25313#bib.bib79)\]is an example of a tiled multi\-chip accelerator, while Illusion and related architectures\[[9](https://arxiv.org/html/2606.25313#bib.bib9),[33](https://arxiv.org/html/2606.25313#bib.bib33)\]treat a network of chips as an approximation to a larger virtual system and identify inter\-chip bandwidth as a central scaling constraint\. CHIMERA\[[80](https://arxiv.org/html/2606.25313#bib.bib80)\]further uses nonvolatile resistive random\-access memory \(RRAM\) to reduce idle energy through temporal gating\. Across these systems, the common challenge is the same: once the model no longer fits on one device, communication and memory movement begin to shape performance\.

The DSIM differs in both what is communicated and how much delay can be tolerated\. By storing all coupling weights on\-chip through shadow weights, the DSIM limits inter\-device traffic to 1\-bit boundary states\. Because the computation is stochastic, delayed boundary information does not break the dynamics but instead introduces a controlled accuracy\-throughput tradeoff quantified by a single parameterη\\eta\. As demonstrated in the main text, this allows the designer to trade a small exponent penalty for higher throughput, making multi\-chip scaling a tunable design knob and opening a path to larger probabilistic computers built from networks of smaller devices\.

## References

- Sze et al\. \[2017\]Vivienne Sze, Yu\-Hsin Chen, Tien\-Ju Yang, and Joel S Emer\.Efficient processing of deep neural networks: A tutorial and survey\.*Proceedings of the IEEE*, 105\(12\):2295–2329, 2017\.
- Boroumand et al\. \[2018\]Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, et al\.Google workloads for consumer devices: Mitigating data movement bottlenecks\.In*Proceedings of the twenty\-third international conference on architectural support for programming languages and operating systems*, pages 316–331, 2018\.
- King et al\. \[2023\]Andrew D\. King, Jack Raymond, Trevor Lanting, Richard Harris, Alex Zucca, Fabio Altomare, Andrew J\. Berkley, Kelly Boothby, Sara Ejtemaee, Colin Enderud, Emile Hoskinson, Shuiyuan Huang, Eric Ladizinsky, Allison J\. R\. MacDonald, Gaelen Marsden, Reza Molavi, Travis Oh, Gabriel Poulin\-Lamarre, Mauricio Reis, Chris Rich, Yuki Sato, Nicholas Tsai, Mark Volkmann, Jed D\. Whittaker, Jason Yao, Anders W\. Sandvik, and Mohammad H\. Amin\.Quantum critical dynamics in a 5,000\-qubit programmable spin glass\.*Nature*, 617\(7959\):61–66, April 2023\.ISSN 1476\-4687\.
- King et al\. \[2025\]Andrew D King, Alberto Nocera, Marek M Rams, Jacek Dziarmaga, Roeland Wiersema, William Bernoudy, Jack Raymond, Nitin Kaushal, Niclas Heinsdorf, Richard Harris, et al\.Beyond\-classical computation in quantum simulation\.*Science*, 388\(6743\):199–204, 2025\.
- Mohseni et al\. \[2024\]Masoud Mohseni, Artur Scherer, K Grace Johnson, Oded Wertheim, Matthew Otten, Namit Anand, Navid Anjum Aadit, Yuri Alexeev, Gilad Ben\-Shach, Kirk M Bresniker, et al\.How to build a quantum supercomputer: Scaling from hundreds to millions of qubits\.*arXiv preprint arXiv:2411\.10406*, 2024\.
- Honjo et al\. \[2021\]Toshimori Honjo, Tomohiro Sonobe, Kensuke Inaba, Takahiro Inagaki, Takuya Ikuta, Yasuhiro Yamada, Takushi Kazama, Koji Enbutsu, Takeshi Umeki, Ryoichi Kasahara, et al\.100,000\-spin coherent Ising machine\.*Science advances*, 7\(40\):eabh0952, 2021\.
- Takemoto et al\. \[2021\]Takashi Takemoto, Kasho Yamamoto, Chihiro Yoshimura, Masato Hayashi, Masafumi Tada, Hiroaki Saito, Mayumi Mashimo, and Masanao Yamaoka\.4\.6 A 144Kb annealing system composed of 9×\\times16Kb annealing processor chips with scalable chip\-to\-chip connections for large\-scale combinatorial optimization problems\.In*2021 IEEE International Solid\-State Circuits Conference \(ISSCC\)*, volume 64, pages 64–66\. IEEE, 2021\.
- Horowitz \[2014\]Mark Horowitz\.1\.1 computing’s energy problem \(and what we can do about it\)\.In*2014 IEEE international solid\-state circuits conference digest of technical papers \(ISSCC\)*, pages 10–14\. IEEE, 2014\.
- Radway et al\. \[2021\]Robert M Radway, Andrew Bartolo, Paul C Jolly, Zainab F Khan, Binh Q Le, Pulkit Tandon, Tony F Wu, Yunfeng Xin, Elisa Vianello, Pascal Vivet, et al\.Illusion of large on\-chip memory by networked computing chips for neural network inference\.*Nature Electronics*, 4\(1\):71–80, 2021\.
- Dayo et al\. \[2025\]Samuel Dayo, Shuhan Liu, Peijing Li, Philip Levis, Subhasish Mitra, Thierry Tambe, David Tennenhouse, and H\-S Philip Wong\.The Future of Memory: Limits and Opportunities\.*arXiv preprint arXiv:2508\.20425*, 2025\.
- Gholami et al\. \[2024\]Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W Mahoney, and Kurt Keutzer\.AI and memory wall\.*IEEE Micro*, 44\(3\):33–39, 2024\.
- Kirkpatrick et al\. \[1983\]Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi\.Optimization by simulated annealing\.*Science*, 220\(4598\):671–680, 1983\.
- Lucas \[2014\]Andrew Lucas\.Ising formulations of many NP problems\.*Frontiers in Physics*, 2:5, 2014\.ISSN 2296\-424X\.
- Mohseni et al\. \[2022\]Naeimeh Mohseni, Peter L McMahon, and Tim Byrnes\.Ising machines as hardware solvers of combinatorial optimization problems\.*Nature Reviews Physics*, 4\(6\):363–379, 2022\.
- Finocchio et al\. \[2024\]Giovanni Finocchio, Jean Anne C Incorvia, Joseph S Friedman, Qu Yang, Anna Giordano, Julie Grollier, Hyunsoo Yang, Florin Ciubotaru, Andrii V Chumak, Azad J Naeemi, et al\.Roadmap for unconventional computing with nanotechnology\.*Nano Futures*, 8\(1\):012001, 2024\.
- Singh et al\. \[2024\]Nihal Sanjay Singh, Keito Kobayashi, Qixuan Cao, Kemal Selcuk, Tianrui Hu, Shaila Niazi, Navid Anjum Aadit, Shun Kanai, Hideo Ohno, Shunsuke Fukami, and Kerem Y Camsari\.CMOS plus stochastic nanomagnets enabling heterogeneous computers for probabilistic inference and learning\.*Nature Communications*, 15\(1\):2685, 2024\.
- Edwards and Anderson \[1975\]Samuel Frederick Edwards and Phil W Anderson\.Theory of spin glasses\.*Journal of Physics F: Metal Physics*, 5\(5\):965, 1975\.
- Barahona \[1982\]Francisco Barahona\.On the computational complexity of Ising spin glass models\.*Journal of Physics A: Mathematical and General*, 15\(10\):3241, 1982\.
- Parisi \[1979\]Giorgio Parisi\.Infinite number of order parameters for spin\-glasses\.*Physical Review Letters*, 43\(23\):1754, 1979\.
- Mézard et al\. \[1987\]Marc Mézard, Giorgio Parisi, and Miguel Angel Virasoro\.*Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications*, volume 9\.World Scientific Publishing Company, 1987\.
- Fisher and Huse \[1986\]Daniel S Fisher and David A Huse\.Ordered phase of short\-range Ising spin\-glasses\.*Physical review letters*, 56\(15\):1601, 1986\.
- Sherrington and Kirkpatrick \[1975\]David Sherrington and Scott Kirkpatrick\.Solvable model of a spin\-glass\.*Physical review letters*, 35\(26\):1792, 1975\.
- Chowdhury et al\. \[2025\]Shuvro Chowdhury, Navid Anjum Aadit, Andrea Grimaldi, Eleonora Raimondo, Atharva Raut, P Aaron Lott, Johan H Mentink, Marek M Rams, Federico Ricci\-Tersenghi, Massimo Chiappini, et al\.Pushing the boundary of quantum advantage in hard combinatorial optimization with probabilistic computers\.*Nature Communications*, 16\(1\):9193, 2025\.
- Belletti et al\. \[2008\]Francesco Belletti, Maria Cotallo, Andrés Cruz, Luis Antonio Fernandez, Antonio Gordillo\-Guerrero, Marco Guidetti, Andrea Maiorano, Filippo Mantovani, Enzo Marinari, Victor Martin\-Mayor, et al\.Janus: An FPGA\-based system for high\-performance scientific computing\.*Computing in Science & Engineering*, 11\(1\):48–58, 2008\.
- Baity\-Jesi et al\. \[2014\]Marco Baity\-Jesi, Rachel A Baños, Andres Cruz, Luis Antonio Fernandez, José Miguel Gil\-Narvión, Antonio Gordillo\-Guerrero, David Iniguez, Andrea Maiorano, Filippo Mantovani, Enzo Marinari, et al\.Janus II: A new generation application\-driven computer for spin\-system simulations\.*Computer Physics Communications*, 185\(2\):550–559, 2014\.
- Lulli et al\. \[2015\]Matteo Lulli, Massimo Bernaschi, and Giorgio Parisi\.Highly optimized simulations on single\- and multi\-GPU systems of the 3D Ising spin glass model\.*Computer Physics Communications*, 196:290–303, 2015\.
- Bernaschi et al\. \[2024\]Massimo Bernaschi, Isidoro González\-Adalid Pemartín, Víctor Martín\-Mayor, and Giorgio Parisi\.The QISG suite: High\-performance codes for studying quantum Ising spin glasses\.*Computer Physics Communications*, 298:109101, 2024\.
- Goto et al\. \[2019\]Hayato Goto, Kosuke Tatsumura, and Alexander R Dixon\.Combinatorial optimization by simulating adiabatic bifurcations in nonlinear Hamiltonian systems\.*Science advances*, 5\(4\):eaav2372, 2019\.
- Goto et al\. \[2021\]Hayato Goto, Kotaro Endo, Masaru Suzuki, Yoshisato Sakai, Taro Kanao, Yohei Hamakawa, Ryo Hidaka, Masaya Yamasaki, and Kosuke Tatsumura\.High\-performance combinatorial optimization based on classical mechanics\.*Science Advances*, 7\(6\):eabe7953, 2021\.
- Tatsumura et al\. \[2021\]Kosuke Tatsumura, Masaya Yamasaki, and Hayato Goto\.Scaling out Ising machines using a multi\-chip architecture for simulated bifurcation\.*Nature Electronics*, 4\(3\):208–217, 2021\.
- Kashimata et al\. \[2024\]Tomoya Kashimata, Masaya Yamasaki, Ryo Hidaka, and Kosuke Tatsumura\.Efficient and scalable architecture for multiple\-chip implementation of simulated bifurcation machines\.*IEEE Access*, 12:36606–36621, 2024\.
- Sharma et al\. \[2022\]Anshujit Sharma, Richard Afoakwa, Zeljko Ignjatovic, and Michael Huang\.Increasing ising machine capacity with multi\-chip architectures\.In*Proceedings of the 49th Annual International Symposium on Computer Architecture*, pages 508–521, 2022\.
- Srimani et al\. \[2024\]Tathagata Srimani, Robert Radway, Masoud Mohseni, Kerem Çamsarı, and Subhasish Mitra\.Next\-generation probabilistic computing hardware with 3D MOSAICs, Illusion scale\-up, and co\-design\.*arXiv preprint arXiv:2409\.11422*, 2024\.
- Camsari et al\. \[2019\]Kerem Y Camsari, Brian M Sutton, and Supriyo Datta\.P\-bits for probabilistic spin logic\.*Applied Physics Reviews*, 6\(1\):011305, 2019\.
- Camsari et al\. \[2017a\]Kerem Yunus Camsari, Rafatul Faria, Brian M Sutton, and Supriyo Datta\.Stochastic p\-bits for invertible logic\.*Physical Review X*, 7\(3\):031014, 2017a\.
- Kaiser and Datta \[2021\]Jan Kaiser and Supriyo Datta\.Probabilistic computing with p\-bits\.*Applied Physics Letters*, 119\(15\):150503, 2021\.
- Camsari et al\. \[2015\]Kerem Yunus Camsari, Samiran Ganguly, and Supriyo Datta\.Modular approach to spintronics\.*Scientific reports*, 5:10571, 2015\.
- Camsari et al\. \[2017b\]Kerem Yunus Camsari, Sayeef Salahuddin, and Supriyo Datta\.Implementing p\-bits with embedded MTJ\.*IEEE Electron Device Letters*, 38\(12\):1767–1770, 2017b\.
- Borders et al\. \[2019\]William A Borders, Ahmed Z Pervaiz, Shunsuke Fukami, Kerem Y Camsari, Hideo Ohno, and Supriyo Datta\.Integer factorization using stochastic magnetic tunnel junctions\.*Nature*, 573\(7774\):390–393, 2019\.
- Kaiser et al\. \[2019\]Jan Kaiser, Avinash Rustagi, Kerem Y Camsari, Jonathan Z Sun, Supriyo Datta, and Pramey Upadhyaya\.Subnanosecond fluctuations in low\-barrier nanomagnets\.*Physical Review Applied*, 12\(5\):054056, 2019\.
- Aadit et al\. \[2022a\]Navid Anjum Aadit, Andrea Grimaldi, Mario Carpentieri, Luke Theogarajan, John M Martinis, Giovanni Finocchio, and Kerem Y Camsari\.Massively parallel probabilistic computing with sparse Ising machines\.*Nature Electronics*, 5\(7\):460–468, 2022a\.
- Nikhar et al\. \[2024\]Srijan Nikhar, Sidharth Kannan, Navid Anjum Aadit, Shuvro Chowdhury, and Kerem Y Camsari\.All\-to\-all reconfigurability with sparse and higher\-order Ising machines\.*Nature Communications*, 15\(1\):8977, 2024\.
- Oguchi \[1951\]Takehiko Oguchi\.Statistics of the Three\-Dimensional Ferromagnet \(III\)\.*Journal of the Physical Society of Japan*, 6\(1\):31–35, 1951\.
- Bethe \[1935\]Hans A Bethe\.Statistical theory of superlattices\.*Proceedings of the Royal Society of London\. Series A\-Mathematical and Physical Sciences*, 150\(871\):552–575, 1935\.
- Pelizzola \[2005\]Alessandro Pelizzola\.Cluster variation method in statistical physics and probabilistic graphical models\.*Journal of Physics A: Mathematical and General*, 38\(33\):R309–R339, 2005\.
- Yamamoto \[2009\]Daisuke Yamamoto\.Correlated cluster mean\-field theory for spin systems\.*Physical Review B*, 79\(14\):144427, 2009\.
- Xing et al\. \[2012\]Eric P\. Xing, Michael I\. Jordan, and Stuart Russell\.A generalized mean field algorithm for variational inference in exponential families\.*arXiv preprint arXiv:1212\.2512*, 2012\.
- Karypis and Kumar \[1998\]George Karypis and Vipin Kumar\.A software package for partitioning unstructured graphs, partitioning meshes, and computing fill\-reducing orderings of sparse matrices\.*University of Minnesota, Department of Computer Science and Engineering, Army HPC Research Center, Minneapolis, MN*, 38:7–1, 1998\.
- Sanders and Schulz \[2013\]Peter Sanders and Christian Schulz\.KaHIP v3\.00–Karlsruhe High Quality Partitioning–User Guide\.*arXiv preprint arXiv:1311\.1714*, 2013\.
- AMD \[2023\]AMD\.AMD Versal Premium VP1902 Adaptive SoC, 2023\.Available at[https://www\.amd\.com/en/products/adaptive\-socs\-and\-fpgas/versal/premium\-series/vp1902\.html](https://www.amd.com/en/products/adaptive-socs-and-fpgas/versal/premium-series/vp1902.html)\.
- Salmon et al\. \[2011\]John K Salmon, Mark A Moraes, Ron O Dror, and David E Shaw\.Parallel random numbers: as easy as 1, 2, 3\.In*Proceedings of 2011 international conference for high performance computing, networking, storage and analysis*, pages 1–12, 2011\.
- Ye et al\. \[2003\]Yinyu Ye et al\.Gset: A benchmark library for graph partitioning and Max\-Cut, 2003\.Available at[https://web\.stanford\.edu/˜yyye/yyye/Gset/](https://web.stanford.edu/~yyye/yyye/Gset/)\.
- Zick \[2025\]Kenneth M Zick\.Performance report of heuristic algorithm that cracked the largest Gset Ising problems \(G81 cut = 14060\)\.*arXiv preprint arXiv:2505\.18508*, 2025\.
- Zick et al\. \[2026\]Kenneth M Zick, Nikhil Shukla, and Alexander Marakov\.Cosm: Collective Switched Motion for Fast and Accurate Sparse Ising Optimization\.*arXiv preprint arXiv:2605\.30355*, 2026\.
- Aadit et al\. \[2022b\]Navid Anjum Aadit, Andrea Grimaldi, Giovanni Finocchio, and Kerem Y\. Camsari\.Physics\-inspired Ising Computing with Ring Oscillator Activated p\-bits\.In*2022 IEEE 22nd International Conference on Nanotechnology \(NANO\)*, pages 393–396, 2022b\.
- Hen \[2019\]Itay Hen\.Equation planting: a tool for benchmarking Ising machines\.*Physical Review Applied*, 12\(1\):011003, 2019\.
- Mézard et al\. \[2002\]Marc Mézard, Giorgio Parisi, and Riccardo Zecchina\.Analytic and algorithmic solution of random satisfiability problems\.*Science*, 297\(5582\):812–815, 2002\.
- Cılasun et al\. \[2024\]Hüsrev Cılasun, Ziqing Zeng, Ramprasath S, Abhimanyu Kumar, Hao Lo, William Cho, William Moy, Chris H Kim, Ulya R Karpuzcu, and Sachin S Sapatnekar\.3SAT on an all\-to\-all\-connected CMOS Ising solver chip\.*Scientific reports*, 14\(1\):10757, 2024\.
- Kim and Sim \[2025\]Jaeyeong Kim and Jae\-Yoon Sim\.An 8K\-Spin Ising Machine IC with Reconfigurable Many\-Body Spin Interactions and Limitless Multichip Extension\.In*2025 Symposium on VLSI Technology and Circuits \(VLSI Technology and Circuits\)*, pages 1–3\. IEEE, 2025\.
- Johnson et al\. \[2013\]Matthew J Johnson, James Saunderson, and Alan Willsky\.Analyzing Hogwild parallel Gaussian Gibbs sampling\.*Advances in neural information processing systems*, 26:2715–2723, 2013\.
- Clark et al\. \[2016\]Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric\.ASAP7: A 7\-nm finFET predictive process design kit\.*Microelectronics Journal*, 53:105–115, 2016\.
- Consortium et al\. \[2022\]UCIe Consortium et al\.Universal Chiplet Interconnect Express \(UCIe\) Specification\.*UCIe Consortium, Technical Specification\. July*, 2022\.
- Ardalan et al\. \[2020\]Shahab Ardalan, Halil Cirit, Ramin Farjad, Mark Kuemerle, Ken Poulton, Suresh Subramanian, and Bapiraju Vinnakota\.Bunch of wires: An open die\-to\-die interface\.In*2020 IEEE Symposium on High\-Performance Interconnects \(HOTI\)*, pages 9–16\. IEEE, 2020\.
- Choi et al\. \[2025\]S Choi, A Raut, T Wu, S Dayo, A Bechdolt, G Dutta, S Li, DT Rich, RH Yang, AC Yu, et al\.Foundry Monolithic 3D Unlocks Large Throughput Benefits: 3D Memory with Tucked Sense Amplifiers \+ Logic using Heterogeneous Silicon CMOS \+ Resistive RAM \+ Carbon Nanotube FETs\.In*2025 IEEE International Electron Devices Meeting \(IEDM\)*, pages 1–4\. IEEE, 2025\.
- Tirumala and Wong \[2024\]Ajay Tirumala and Raymond Wong\.NVIDIA Blackwell Platform: Advancing Generative AI and Accelerated Computing\.In*2024 IEEE Hot Chips 36 Symposium \(HCS\)*, pages 1–33\. IEEE Computer Society, 2024\.
- Selcuk et al\. \[2025\]Kemal Selcuk, Navid Anjum Aadit, Corentin Delacour, Jared Quintana Silva, Nihal Sanjay Singh, Haruna Kaneko, Shun Kanai, Yu\-Jui Wu, Yi\-Hsuan Chen, Yu\-Sheng Chen, et al\.DAC\-Free p\-bits: Asynchronous Self\-Coloring and On\-Chip Annealing\.In*2025 IEEE International Electron Devices Meeting \(IEDM\)*, pages 1–4\. IEEE, 2025\.
- Duffee et al\. \[2025\]Christian Duffee, Jordan Athas, Yixin Shao, Noraica Davila Melendez, Eleonora Raimondo, Jordan A Katine, Kerem Y Camsari, Giovanni Finocchio, and Pedram Khalili Amiri\.An integrated\-circuit\-based probabilistic computer that uses voltage\-controlled magnetic tunnel junctions as its entropy source\.*Nature Electronics*, 8\(9\):784–793, 2025\.
- Daniel et al\. \[2024\]John Daniel, Zheng Sun, Xuejian Zhang, Yuanqiu Tan, Neil Dilley, Zhihong Chen, and Joerg Appenzeller\.Experimental demonstration of an on\-chip p\-bit core based on stochastic magnetic tunnel junctions and 2D MoS2 transistors\.*Nature Communications*, 15\(1\):4098, 2024\.
- Grimaldi et al\. \[2022\]Andrea Grimaldi, Luis Sánchez\-Tejerina, Navid Anjum Aadit, Stefano Chiappini, Mario Carpentieri, Kerem Camsari, and Giovanni Finocchio\.Spintronics\-compatible Approach to Solving Maximum\-Satisfiability Problems with Probabilistic Computing, Invertible Logic, and Parallel Tempering\.*Physical Review Applied*, 17:024052, Feb 2022\.
- Weiss \[1907\]Pierre Weiss\.L’hypothèse du champ moléculaire et la propriété ferromagnétique\.*Journal de Physique Théorique et Appliquée*, 6\(1\):661–690, 1907\.
- Zimmer et al\. \[2016\]FM Zimmer, M Schmidt, and Jonas Maziero\.Quantum correlated cluster mean\-field theory applied to the transverse Ising model\.*Physical Review E*, 93\(6\):062116, 2016\.
- Aadit et al\. \[2023\]Navid Anjum Aadit, Masoud Mohseni, and Kerem Y Camsari\.Accelerating Adaptive Parallel Tempering with FPGA\-based p\-bits\.In*2023 VLSI Technology and Circuits Symposium*, pages 1–2\. IEEE, 2023\.
- Shylo and Shylo \[2017\]Volodymyr P Shylo and Oleg V Shylo\.Algorithm portfolios and teams in parallel optimization\.In*Optimization Methods and Applications: In Honor of Ivan V\. Sergienko’s 80th Birthday*, pages 481–493\. Springer, 2017\.
- Shylo et al\. \[2015\]VP Shylo, Fred Glover, and IV Sergienko\.Teams of global equilibrium search algorithms for solving the weighted maximum cut problem in parallel\.*Cybernetics and Systems Analysis*, 51\(1\):16–24, 2015\.
- Lauria et al\. \[2017\]Massimo Lauria, Jan Elffers, Jakob Nordström, and Marc Vinyals\.CNFgen: A generator of crafted benchmarks\.In*International Conference on Theory and Applications of Satisfiability Testing*, pages 464–473\. Springer, 2017\.
- Jouppi et al\. \[2017\]Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al\.In\-datacenter performance analysis of a tensor processing unit\.In*Proceedings of the 44th annual international symposium on computer architecture*, pages 1–12, 2017\.
- NVIDIA Corporation \[2016\]NVIDIA Corporation\.NVIDIA Tesla P100: The Most Advanced Datacenter Accelerator Ever Built\.Technical Report WP\-08019\-001\_v01\.1, NVIDIA Corporation, 2016\.
- Lie \[2023\]Sean Lie\.Cerebras architecture deep dive: First look inside the hardware/software co\-design for deep learning\.*IEEE Micro*, 43\(3\):18–30, 2023\.
- Shao et al\. \[2019\]Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, et al\.Simba: Scaling deep\-learning inference with multi\-chip\-module\-based architecture\.In*Proceedings of the 52nd annual IEEE/ACM international symposium on microarchitecture*, pages 14–27, 2019\.
- Prabhu et al\. \[2022\]Kartik Prabhu, Albert Gural, Zainab F Khan, Robert M Radway, Massimo Giordano, Kalhan Koul, Rohan Doshi, John W Kustin, Timothy Liu, Gregorio B Lopes, et al\.CHIMERA: A 0\.92\-TOPS, 2\.2\-TOPS/W edge AI accelerator with 2\-MByte on\-chip foundry resistive RAM for efficient training and inference\.*IEEE Journal of Solid\-State Circuits*, 57\(4\):1013–1026, 2022\.

Similar Articles

20000 Gates and 20 MIPS [pdf]

Hacker News Top

Historical 1990 paper from Amdahl detailing the design of a 20 MIPS mainframe processor built with 20,000 logic gates.

EMiX: Emulating Beyond Single-FPGA Limits

Hacker News Top

Introduces EMiX, a scalable multi-FPGA framework for emulating multi-core RISC-V architectures beyond single-FPGA resource limits, demonstrated with a 64-core system across eight FPGAs.