Parallelizing Counterfactual Regret Minimization

arXiv cs.AI 05/15/26, 04:00 AM Papers
Summary
This paper presents a parallelization framework for CFR algorithms using linear algebra operations, achieving up to four orders of magnitude speedup on GPU compared to CPU implementations.
arXiv:2605.14277v1 Announce Type: new Abstract: Parallelization has played an instrumental role in the field of artificial intelligence (AI), drastically reducing the time taken to train and evaluate large AI models. In contrast to its impact in the broader field of AI, applying parallelization to computational game solving is relatively unexplored, despite its great potential. In this paper, we parallelize the family of counterfactual regret minimization (CFR) algorithms, which were central to important breakthroughs for solving large imperfect-information games. We present a generalized parallelization framework, reframing CFR as a series of linear algebra operations. Then, existing techniques for parallelizing linear algebra operations can be applied to accelerate CFR. We also describe how our technique can be applied to other tabular members of the CFR family of algorithms, including the state-of-the-art, such as CFR+, discounted CFR, and predictive variants of CFR. Experimentally, we show that our CFR implementation on a GPU is up to four orders of magnitude faster than Google DeepMind OpenSpiel's CFR implementations on a CPU.
Original Article
View Cached Full Text
Cached at: 05/15/26, 06:23 AM
# Parallelizing Counterfactual Regret Minimization
Source: [https://arxiv.org/html/2605.14277](https://arxiv.org/html/2605.14277)
Juho Kim Computer Science Department Carnegie Mellon University juhok@cs\.cmu\.edu &Tuomas Sandholm Computer Science Department, CMU Strategic Machine, Inc\. Strategy Robot, Inc\. Optimized Markets, Inc\. sandholm@cs\.cmu\.edu

###### Abstract

Parallelization has played an instrumental role in the field of artificial intelligence \(AI\), drastically reducing the time taken to train and evaluate large AI models\. In contrast to its impact in the broader field of AI, applying parallelization to computational game solving is relatively unexplored, despite its great potential\. In this paper, we parallelize the family of counterfactual regret minimization \(CFR\) algorithms, which were central to important breakthroughs for solving large imperfect\-information games\. We present a generalized parallelization framework, reframing CFR as a series of linear algebra operations\. Then, existing techniques for parallelizing linear algebra operations can be applied to accelerate CFR\. We also describe how our technique can be applied to other tabular members of the CFR family of algorithms, including the state\-of\-the\-art, such as CFR\+, discounted CFR, and predictive variants of CFR\. Experimentally, we show that our CFR implementation on a GPU is up to four orders of magnitude faster than Google DeepMind OpenSpiel’s CFR implementations on a CPU\.

## 1Introduction

Historically,parallelizationserved as a key enabler in the field of artificial intelligence \(AI\)\[[9](https://arxiv.org/html/2605.14277#bib.bib52),[16](https://arxiv.org/html/2605.14277#bib.bib53),[30](https://arxiv.org/html/2605.14277#bib.bib54)\], dramatically reducing the time taken for researchers to train large AI models on massive datasets by splitting the problem into smaller subproblems to be solved concurrently\. This drastic reduction in the training time also facilitated much faster hypothesis testing compared to when serial execution is used\. In hindsight, the explosive growth in the performance and capability of AI models in a myriad of real\-life applications would have been impossible without parallelization\.

In stark contrast to its broad impact in AI in general, parallelization is relatively unexplored in the area of computational game solving, although the need for parallelism is clearly evidenced by the fact that past landmark works on solving large games, such as Cepheus\[[31](https://arxiv.org/html/2605.14277#bib.bib17)\], Libratus\[[5](https://arxiv.org/html/2605.14277#bib.bib47)\], and Pluribus\[[6](https://arxiv.org/html/2605.14277#bib.bib18)\], all leveraged some form of parallelism\. Due to the sheer size of the games involved, developing these AI agents would have been infeasible without parallelization\. Additionally, there are ample reasons to apply parallelization even for more regular game\-solving tasks\. Indeed, we later show that many games, even those that are commonly used as benchmarks for game\-solving algorithms, provide substantial room for parallelization, using which speedup ofmultiple orders of magnitudecan be achieved\. Thus, leveraging this would allow researchers in the field to obtain much faster turnaround times in testing their theoretical results, hypotheses, or hyperparameter regimes\.

In computational game theory, the family ofcounterfactual regret minimization \(CFR\)algorithms\[[37](https://arxiv.org/html/2605.14277#bib.bib1)\]has been used to develop AI agents for large\-scale imperfect\-information games, most notably poker\[[31](https://arxiv.org/html/2605.14277#bib.bib17),[25](https://arxiv.org/html/2605.14277#bib.bib20),[6](https://arxiv.org/html/2605.14277#bib.bib18),[8](https://arxiv.org/html/2605.14277#bib.bib19)\]\. Variants of CFR have since been proposed to improve its convergence rate\[[32](https://arxiv.org/html/2605.14277#bib.bib2),[7](https://arxiv.org/html/2605.14277#bib.bib4),[12](https://arxiv.org/html/2605.14277#bib.bib38),[34](https://arxiv.org/html/2605.14277#bib.bib5),[35](https://arxiv.org/html/2605.14277#bib.bib57),[36](https://arxiv.org/html/2605.14277#bib.bib56)\], leverage Monte Carlo runouts\[[23](https://arxiv.org/html/2605.14277#bib.bib16)\], and incorporate deep learning\[[3](https://arxiv.org/html/2605.14277#bib.bib28),[24](https://arxiv.org/html/2605.14277#bib.bib29)\]\. In this paper, we propose a generalized parallelization framework, which reconsiders CFR and its variants as a series of linear algebra operations\. Then, we can apply standard techniques for parallelizing linear algebra operations to parallelize CFR and its variants\. While our framework does not necessarily increase the size of the game that can be solved, we experimentally show that it can make finding an optimal solution to these games dramatically faster\. Our experiments involving seven games of varying sizes show that our parallelized implementation on a GPU is up to about 18,889 and 3,413 times faster than Google DeepMind OpenSpiel’s Python and C\+\+ implementations on a CPU, respectively\. The speedup becomes more pronounced as the size of the game being solved grows\.

## 2Notation and background

This section defines the notation used in this paper and reviews CFR\.

### 2\.1Sequence\-form CFR

In this paper, we consider parallelizing the sequence\-form\[[33](https://arxiv.org/html/2605.14277#bib.bib43),[20](https://arxiv.org/html/2605.14277#bib.bib42),[10](https://arxiv.org/html/2605.14277#bib.bib31)\]version of CFR, as opposed to its classical version\[[37](https://arxiv.org/html/2605.14277#bib.bib1)\]\. Note that classical CFR and sequence\-form CFR are equivalent in the sense that they produce identical iterates, so parallelizing sequence\-form CFR also entails indirectly parallelizing classical CFR\. We will give more details about this choice later in our discussions\.

In the sequence\-form version, each player runs CFR to minimize the counterfactual regret of their respectivetree\-form sequential decision processes \(TFSDPs\), which are derived from the game tree\. In a TFSDP, every nodep∈𝒫p\\in\\mathcal\{P\}is either in the set of decision points𝒥\\mathcal\{J\}, the set of observation points𝒦\\mathcal\{K\}, or is an end of the decision process⊥\\bot\. At a decision pointjj, an available actiona∈𝒜ja\\in\\mathcal\{A\}\_\{j\}can be applied, andρ\(j,a\)\\rho\(j,a\)is used to denote the child node reached by applyingaaatjj\. Similarly, a signals∈𝒮ks\\in\\mathcal\{S\}\_\{k\}can be observed at an observation pointkk, and the corresponding child is denoted asρ\(k,s\)\\rho\(k,s\)\.

In every decision pointjj, a local regret minimizerℜj\\mathfrak\{R\}\_\{j\}mixes the set of available actions𝒜j\\mathcal\{A\}\_\{j\}\. In theory, any regret minimization algorithm that operates over the probability simplex can be used as a regret minimizer\. Locally, CFR makes use of regret matching \(RM\)\[[17](https://arxiv.org/html/2605.14277#bib.bib8)\], which outputs L1\-normalized floored regrets as probabilities on each iteration\. Other popular choices include RM\+, used by CFR\+\[[32](https://arxiv.org/html/2605.14277#bib.bib2)\], discounted RM, used by discounted CFR \(DCFR\)\[[7](https://arxiv.org/html/2605.14277#bib.bib4)\], and predictive RM \(PRM\) \(respectively, PRM\+\), used by predictive CFR \(PCFR\) \(respectively, PCFR\+\)\[[12](https://arxiv.org/html/2605.14277#bib.bib38)\]\.

LetΣ\+=\{\(j,a\):∀j∈𝒥,a∈𝒜j\}\\Sigma^\{\+\}=\\left\\\{\(j,a\):\\forall j\\in\\mathcal\{J\},a\\in\\mathcal\{A\}\_\{j\}\\right\\\}be the set of non\-empty sequences andΣ=Σ\+∪\{∅\}\\Sigma=\\Sigma^\{\+\}\\cup\\\{\\emptyset\\\}be the set of \(possibly empty\) sequences\. Here, we use∅\\emptysetto denote the empty sequence\. Withpjp\_\{j\}the parent sequence of a decision pointjj, a sequence\-form polytope can be defined as follows:

𝒳=\{x→∈ℝ≥0Σ:x→\[∅\]=1,∀j∈𝒥:∑a∈𝒜jx→\[\(j,a\)\]=x→\[pj\]\}\.\\mathcal\{X\}=\\left\\\{\\vec\{x\}\\in\\mathbb\{R\}\_\{\\geq 0\}^\{\\Sigma\}:\\vec\{x\}\[\\emptyset\]=1,\\forall j\\in\\mathcal\{J\}:\\sum\_\{a\\in\\mathcal\{A\}\_\{j\}\}\\vec\{x\}\[\(j,a\)\]=\\vec\{x\}\[p\_\{j\}\]\\right\\\}\.The CFR family of algorithms is a family of regret minimizers operating over the sequence\-form polytope\. On each iterationtt, CFR constructs a behavior strategyb→j\(t\)∈Δ\(𝒜j\)∀j∈𝒥\\vec\{b\}\_\{j\}^\{\(t\)\}\\in\\Delta\(\\mathcal\{A\}\_\{j\}\)\\ \\forall j\\in\\mathcal\{J\}from the outputs of its local regret minimizers, which is then converted into a sequence\-form strategyx→\(t\)∈𝒳\\vec\{x\}^\{\(t\)\}\\in\\mathcal\{X\}prior to being returned\. We denote this procedure byNextStrategy\. Then, CFR observes utilityu→\(t\)∈ℝΣ\\vec\{u\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\Sigma\}which is then converted into counterfactual utilitiesu→j\(t\)∈ℝ𝒜j∀j∈𝒥\\vec\{u\}\_\{j\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\mathcal\{A\}\_\{j\}\}\\ \\forall j\\in\\mathcal\{J\}to be passed to its local regret minimizers\. Henceforth, we denote this procedure byObserveUtility\. A complete pseudocode description of CFR in sequence\-form is provided in Algorithm[1](https://arxiv.org/html/2605.14277#algorithm1)\. In Line[1](https://arxiv.org/html/2605.14277#algorithm1), we assume that scalar division \(//\) produces an arbitrary strategy when‖\(r→j\(t−1\)\)\+‖1=0\\left\\lVert\\left\(\\vec\{r\}\_\{j\}^\{\(t\-1\)\}\\right\)^\{\+\}\\right\\rVert\_\{1\}=0\.

The per\-iteration time complexity of a player’s instance of sequence\-form CFR is linear with respect to the size of the player’s corresponding TFSDP\. This differs from the per\-iteration time complexity of classical CFR\[[37](https://arxiv.org/html/2605.14277#bib.bib1)\]: linear with respect to the size of the game tree\. The number of nodes in the game tree is often, but not always, greater than the total number of nodes across its player TFSDPs, so the tree\-traversal parts of sequence\-form CFR tend to be more efficient than classical CFR\.

That said, unlike classical CFR, sequence\-form CFR requires an extra step of calculating each player’s utility on each iteration, whose worst\-case time complexity is linear with respect to the size of the game tree\. However, a sparse matrix\-vector multiplication can be carried out in two\-player settings for this purpose \(or tensor operations in multiplayer settings\), which has a small constant factor in practice and can be easily parallelized\. As a result, in practice, sequence\-form CFR tends to yield performance improvements over the classical implementation of CFR\.

### 2\.2Parallelized game solvers

Unpublished works byReis \[[28](https://arxiv.org/html/2605.14277#bib.bib24)\]and byRudolf \[[29](https://arxiv.org/html/2605.14277#bib.bib25)\]have parallelized CFR on CUDA and found orders of magnitude speedups\. However, in Rudolf’s work, threads are assigned to every node, which then moves up the tree; this, in the worst case, results inquadraticwork\. This contrasts with the linear time complexity of CFR\. Reis’s work has several reproducibility issues \(e\.g\., an inaccessible codebase and code snippets containing syntax errors\)\. Both implementations require each thread to perform a “large number of control flow statements”\[[28](https://arxiv.org/html/2605.14277#bib.bib24)\]andgeneralizedkernel instructions\.

Instead, we reframe CFR as linear algebra operations in our framework\. Parallelizing linear algebra operations is a well\-studied problem in the field of systems, and one can frequently take advantage of optimized opcodes\. As it is platform\-independent, our approach is also more general than the aforementioned approaches\. Additionally, unlike Reis’s, our implementation is open\-source and compatible with games in OpenSpiel\[[22](https://arxiv.org/html/2605.14277#bib.bib6)\], a popular software library for game\-theoretic primitives\.

Another notable game\-solving algorithm is the adaptation of the excessive gap technique \(EGT\)\[[26](https://arxiv.org/html/2605.14277#bib.bib48)\]for solving extensive\-form games\[[14](https://arxiv.org/html/2605.14277#bib.bib49)\]\.Hodaet al\.\[[18](https://arxiv.org/html/2605.14277#bib.bib50)\]reduce the memory footprint to a square root of the original, whileGilpin and Sandholm \[[15](https://arxiv.org/html/2605.14277#bib.bib51)\]introduce sampling for speedup\. Both parallelize the matrix\-vector product during the expected utility calculation to achieve speedups \(this is akin to calculating the utilities passed to sequence\-form CFR\)\. Further,Kroeret al\.\[[21](https://arxiv.org/html/2605.14277#bib.bib41)\]introduce GPU implementations of the EGT, but they exploit the unique topology of poker TFSDPs by only parallelizing across the immediate children of the root node in the TFSDP, which can potentiallylimitparallelism in general, particularly when the root has few children\.

Parallelization was instrumental in several imperfect\-information game\-solving breakthroughs, including Libratus\[[6](https://arxiv.org/html/2605.14277#bib.bib18)\]and Cepheus\[[31](https://arxiv.org/html/2605.14277#bib.bib17)\]– superhuman or nearly optimal heads\-up Texas hold’em agents\. We first discuss the architecture used in Libratus\. During its blueprint strategy computation, Libratus’s workload was distributed \(typically\) over 1 \+ 195 compute nodes, depending on the card rollouts\. However, the variant of CFR used by Libratus, Monte Carlo CFR \(MCCFR\)\[[23](https://arxiv.org/html/2605.14277#bib.bib16)\], is a stochastic regret minimizer\[[11](https://arxiv.org/html/2605.14277#bib.bib40)\]\. As it operates under the setting ofstochasticregret minimization, its parallelization scheme cannot be readily adapted for use in CFR or its tabular variants\.

In contrast, the engineering of Cepheus involved parallelizing CFR\+, which is tabular\. During game solving, Cepheus split the game of heads\-up limit Texas Hold’em after the second betting round into a trunk and subgames\. Then, the subtrees were assigned to a unique compute node\. On each iteration, the trunk node sent probabilities to the subtree compute nodes, which, in turn, returned values back to the trunk node so it could finish its calculations\. Here, the subtree nodes wait for the trunk node to send probabilities, and the trunk node hangs until the subtree computations are finished\. Our parallelization scheme is different from that of Cepheus primarily in that a\) we effectively split the game at every depth, potentially yielding higher parallelism, and b\) we offer a domain\-independent parallelization framework\. Most importantly, our contribution isnot mutually exclusivewith Cepheus’s architecture: our technique can be augmented to Cepheus for further parallelism\.

## 3Parallelizing CFR and its tabular variants

1Inputa TFSDP\.

21exNextStrategy\(\)\(\):

//Calculate behavioral strategy

3for*each decision pointj∈𝒥j\\in\\mathcal\{J\}*do

4

Δ\(𝒜j\)∋b→j\(t\)≔\(r→j\(t−1\)\)\+/‖\(r→j\(t−1\)\)\+‖1\\Delta\(\\mathcal\{A\}\_\{j\}\)\\ni\\vec\{b\}\_\{j\}^\{\(t\)\}\\coloneqq\\left\(\\vec\{r\}\_\{j\}^\{\(t\-1\)\}\\right\)^\{\+\}/\\left\\lVert\\left\(\\vec\{r\}\_\{j\}^\{\(t\-1\)\}\\right\)^\{\+\}\\right\\rVert\_\{1\};

5

//Convert to sequence\-form

6

x→\(t\)\[∅\]≔1\\vec\{x\}^\{\(t\)\}\[\\emptyset\]\\coloneqq 1;

7for*each decision pointj∈𝒥j\\in\\mathcal\{J\}, top\-down*do

8for*each actiona∈𝒜ja\\in\\mathcal\{A\}\_\{j\}*do

9

x→\(t\)\[\(j,a\)\]≔x→\(t\)\[pj\]⋅b→j\(t\)\[a\]\\vec\{x\}^\{\(t\)\}\[\(j,a\)\]\\coloneqq\\vec\{x\}^\{\(t\)\}\[p\_\{j\}\]\\cdot\\vec\{b\}\_\{j\}^\{\(t\)\}\[a\];

10

11return

x→\(t\)∈𝒳\\vec\{x\}^\{\(t\)\}\\in\\mathcal\{X\};

12

131exObserveUtility\(u→\(t\)∈ℝΣ\)\\left\(\\vec\{u\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\Sigma\}\\right\):

//Memoize counterfactual utilities

14

v→\(t\)≔0→∈ℝ𝒫\\vec\{v\}^\{\(t\)\}\\coloneqq\\vec\{0\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\};

15for*each nodep∈𝒫p\\in\\mathcal\{P\}, bottom\-up*do

16if*𝒥∋j≔p\\mathcal\{J\}\\ni j\\coloneqq p*then

17

v→\(t\)\[j\]≔∑a∈𝒜jbj\(t\)\[a\]⋅\(u→\(t\)\[\(j,a\)\]\+v→\(t\)\[ρ\(j,a\)\]\)\\vec\{v\}^\{\(t\)\}\[j\]\\coloneqq\\sum\_\{a\\in\\mathcal\{A\}\_\{j\}\}b\_\{j\}^\{\(t\)\}\[a\]\\cdot\(\\vec\{u\}^\{\(t\)\}\[\(j,a\)\]\+\\vec\{v\}^\{\(t\)\}\[\\rho\(j,a\)\]\);

18

19else if*𝒦∋k≔p\\mathcal\{K\}\\ni k\\coloneqq p*then

20

v→\(t\)\[k\]≔∑s∈𝒮kv→\(t\)\[ρ\(k,s\)\]\\vec\{v\}^\{\(t\)\}\[k\]\\coloneqq\\sum\_\{s\\in\\mathcal\{S\}\_\{k\}\}\\vec\{v\}^\{\(t\)\}\[\\rho\(k,s\)\];

21

//Update counterfactual regrets

22for*each decision pointj∈𝒥j\\in\\mathcal\{J\}*do

23for*each actiona∈𝒜ja\\in\\mathcal\{A\}\_\{j\}*do

24

u→j\(t\)\[a\]≔u→\(t\)\[\(j,a\)\]\+v→\(t\)\[ρ\(j,a\)\]\\vec\{u\}\_\{j\}^\{\(t\)\}\[a\]\\coloneqq\\vec\{u\}^\{\(t\)\}\[\(j,a\)\]\+\\vec\{v\}^\{\(t\)\}\[\\rho\(j,a\)\];

25

26

ℝ𝒜j∋r→j\(t\)≔r→j\(t−1\)\+u→j\(t\)−\(b→j\(t\)\)⊤u→j\(t\)\\mathbb\{R\}^\{\\mathcal\{A\}\_\{j\}\}\\ni\\vec\{r\}\_\{j\}^\{\(t\)\}\\coloneqq\\vec\{r\}\_\{j\}^\{\(t\-1\)\}\+\\vec\{u\}\_\{j\}^\{\(t\)\}\-\\left\(\\vec\{b\}\_\{j\}^\{\(t\)\}\\right\)^\{\\top\}\\vec\{u\}\_\{j\}^\{\(t\)\};

27

Algorithm 1CFR\[[37](https://arxiv.org/html/2605.14277#bib.bib1),[10](https://arxiv.org/html/2605.14277#bib.bib31)\]
1Input:A TFSDP\.

21exNextStrategy\(\)\(\):

//Calculate behavioral strategy

3

ℝ≥0Σ\+∋b→\(t\)≔\(r→\(t−1\)\)\+⊘\(𝐂⊤\(𝐂\(r→\(t−1\)\)\+\)\)\\mathbb\{R\}\_\{\\geq 0\}^\{\\Sigma^\{\+\}\}\\ni\\vec\{b\}^\{\(t\)\}\\coloneqq\\left\(\\vec\{r\}^\{\(t\-1\)\}\\right\)^\{\+\}\\oslash\\left\(\\mathbf\{C\}^\{\\top\}\\left\(\\mathbf\{C\}\\,\\left\(\\vec\{r\}^\{\(t\-1\)\}\\right\)^\{\+\}\\right\)\\right\);

//Convert to sequence\-form

4

y→\(t\)≔e→p∗∈ℝ≥0𝒫\\vec\{y\}^\{\(t\)\}\\coloneqq\\vec\{e\}\_\{p^\{\*\}\}\\in\\mathbb\{R\}\_\{\\geq 0\}^\{\\mathcal\{P\}\};

5for*each level𝐋\(d\)\\mathbf\{L\}^\{\(d\)\}with weightsb→\(t\)\\vec\{b\}^\{\(t\)\}, top\-down*do

6

y→\(t\)≔y→\(t\)\+\(𝐋\(d\)\)⊤y→\(t\)\\vec\{y\}^\{\(t\)\}\\coloneqq\\vec\{y\}^\{\(t\)\}\+\\left\(\\mathbf\{L\}^\{\(d\)\}\\right\)^\{\\top\}\\vec\{y\}^\{\(t\)\};

7

8

𝒳∋x→\(t\)=𝐀⊤y→\(t\)\\mathcal\{X\}\\ni\\vec\{x\}^\{\(t\)\}=\\mathbf\{A\}^\{\\top\}\\vec\{y\}^\{\(t\)\};

9return

x→\(t\)\\vec\{x\}^\{\(t\)\};

10

111exObserveUtility\(u→\(t\)∈ℝΣ\)\\left\(\\vec\{u\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\Sigma\}\\right\):

//Memoize counterfactual utilities

12

v→\(t\)≔0→∈ℝ𝒫\\vec\{v\}^\{\(t\)\}\\coloneqq\\vec\{0\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\};

13

w→\(t\)≔𝐀u→\(t\)∈ℝ𝒫\\vec\{w\}^\{\(t\)\}\\coloneqq\\mathbf\{A\}\\,\\vec\{u\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\};

14for*each level𝐋\(d\)\\mathbf\{L\}^\{\(d\)\}with weightsb→\(t\)\\vec\{b\}^\{\(t\)\}, bottom\-up*do

15

v→\(t\)≔v→\(t\)\+𝐋\(d\)w→\(t\)\+𝐋\(d\)v→\(t\)\\vec\{v\}^\{\(t\)\}\\coloneqq\\vec\{v\}^\{\(t\)\}\+\\mathbf\{L\}^\{\(d\)\}\\,\\vec\{w\}^\{\(t\)\}\+\\mathbf\{L\}^\{\(d\)\}\\,\\vec\{v\}^\{\(t\)\};

16

//Update counterfactual regrets

17

ℝΣ\+∋q→\(t\)≔𝐁⊤\(w→\(t\)\+v→\(t\)\)\\mathbb\{R\}^\{\\Sigma^\{\+\}\}\\ni\\vec\{q\}^\{\(t\)\}\\coloneqq\\mathbf\{B\}^\{\\top\}\\left\(\\vec\{w\}^\{\(t\)\}\+\\vec\{v\}^\{\(t\)\}\\right\);

18

ℝΣ\+∋r→\(t\)≔r→\(t−1\)\+q→\(t\)−𝐂⊤\(𝐂\(b→\(t\)⊙q→\(t\)\)\)\\mathbb\{R\}^\{\\Sigma^\{\+\}\}\\ni\\vec\{r\}^\{\(t\)\}\\coloneqq\\vec\{r\}^\{\(t\-1\)\}\+\\vec\{q\}^\{\(t\)\}\-\\mathbf\{C\}^\{\\top\}\\left\(\\mathbf\{C\}\\left\(\\vec\{b\}^\{\(t\)\}\\odot\\vec\{q\}^\{\(t\)\}\\right\)\\right\);

19

4\.55em

Algorithm 2Parallelized CFR \(ours\)

We are now ready to present our results\. As previously alluded to, we parallelize CFR by rewriting it in linear algebra, after which we can apply standard parallelization techniques for linear algebra operations\. Later, we demonstrate that our parallelization framework can easily be applied to other mainstream tabular variants of CFR, such as CFR\+\[[32](https://arxiv.org/html/2605.14277#bib.bib2)\], DCFR\[[7](https://arxiv.org/html/2605.14277#bib.bib4)\], PCFR, and PCFR\+\[[12](https://arxiv.org/html/2605.14277#bib.bib38)\]\. We begin this section with an overview of our technique\.

### 3\.1Conversion via logic matrices

Sequence\-form CFR, whose pseudocode is shown in Algorithm[1](https://arxiv.org/html/2605.14277#algorithm1), involves vectors whose values are associated with either sequences \(i\.e\., inℝΣ\\mathbb\{R\}^\{\\Sigma\}\) or nodes \(i\.e\., inℝ𝒫\\mathbb\{R\}^\{\\mathcal\{P\}\}\)\. For example, at any timett, an output sequence\-form strategyx→\(t\)∈𝒳⊂ℝ≥0Σ\\vec\{x\}^\{\(t\)\}\\in\\mathcal\{X\}\\subset\\mathbb\{R\}^\{\\Sigma\}\_\{\\geq 0\}and observed utilityu→\(t\)∈ℝΣ\\vec\{u\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\Sigma\}associates each of their element with a sequence\(j,a\)∈Σ\(j,a\)\\in\\Sigma\. Similarly, in theObserveUtilityfunction, memoized counterfactual utilitiesv→∈ℝ𝒫\\vec\{v\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\}associate each element with a nodep∈𝒫p\\in\\mathcal\{P\}\. As we plan on using these values in linear algebra operations, CFR presents a need to reconcile values in different ‘representations’\. We do so by converting vectors betweenℝΣ\\mathbb\{R\}^\{\\Sigma\}andℝ𝒫\\mathbb\{R\}^\{\\mathcal\{P\}\}using logic matrices\.

Logic matrices are matrices whose entries are in the Boolean domain\{0,1\}\\\{0,1\\\}\. We show how a logic matrix can be used to convert vectors between a per\-sequence representation \(i\.e\.,ℝΣ\\mathbb\{R\}^\{\\Sigma\}\) and a per\-node representation \(i\.e\.,ℝ𝒫\\mathbb\{R\}^\{\\mathcal\{P\}\}\)\. By this, we mean to associate withρ\(j,a\)\\rho\(j,a\)a value previously associated with a sequence\(j,a\)\(j,a\)and vice versa\. Note thatρ\(⋅\)\\rho\(\\cdot\)maps an empty sequence∅\\emptysetto the root node\. We denote this logic matrix by𝐀∈ℝ𝒫×Σ\\mathbf\{A\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\\times\\Sigma\}and define each of its element as follows:Ap,\(j,a\)=𝟏p=ρ\(j,a\)A\_\{p,\(j,a\)\}=\\operatorname\{\\mathbf\{1\}\}\_\{p=\\rho\(j,a\)\}\. Here,𝟏\(⋅\)\\operatorname\{\\mathbf\{1\}\}\_\{\(\\cdot\)\}is an indicator function, which outputs11when the condition in brackets is true or0if otherwise\. Then, we can trivially convert between a vectorz→∈ℝ𝒫\\vec\{z\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\}and its per\-sequence representationz→′∈ℝΣ\\vec\{z\}\\,^\{\\prime\}\\in\\mathbb\{R\}^\{\\Sigma\}by calculatingz→=𝐀z→′\\vec\{z\}=\\mathbf\{A\}\\,\\vec\{z\}\\,^\{\\prime\}\. We also havez→′=𝐀⊤z→\\vec\{z\}\\,^\{\\prime\}=\\mathbf\{A\}^\{\\top\}\\vec\{z\}\. We further remark thatnnz\(𝐀\)=\|Σ\|nnz\(\\mathbf\{A\}\)=\|\\Sigma\|, wherennz\(⋅\)nnz\(\\cdot\)is a function that returns the number of non\-zero elements, and each row or column of𝐀\\mathbf\{A\}has at most a single non\-zero value\. These are later used in our complexity analysis\.

Additionally, by definition, a behavioral strategyb→\(t\)\\vec\{b\}^\{\(t\)\}at any timettcan be represented with a vector inℝΣ\+\\mathbb\{R\}^\{\\Sigma^\{\+\}\}\. Recall thatΣ\+\\Sigma^\{\+\}is the set of non\-empty sequences and contains all pairs of decision points and available actions\. Henceforth, we refer to vectors inℝΣ\+\\mathbb\{R\}^\{\\Sigma^\{\+\}\}as being in the behavioral representation\. Behaviorally\-represented vectors have exactly one less element than those in a per\-sequence representation: the empty sequence∅\\emptysetis no longer associated with a value\. The behavioral representation is also sufficient to capture cumulative regretsr→\(t\)\\vec\{r\}^\{\(t\)\}\. As done previously, we can define the following logic matrix𝐁=𝐀:,2:∈ℝ𝒫×Σ\+\\mathbf\{B\}=\\mathbf\{A\}\_\{:,2:\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\\times\\Sigma^\{\+\}\}\(assuming the first column in𝐀\\mathbf\{A\}corresponds to∅\\emptyset\)\. With this, we can translate fromz→∈ℝ𝒫\\vec\{z\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\}to its behavioral representationz→′′∈ℝΣ\+\\vec\{z\}\\,^\{\\prime\\prime\}\\in\\mathbb\{R\}^\{\\Sigma^\{\+\}\}by calculatingz→′′=𝐁⊤z→\\vec\{z\}\\,^\{\\prime\\prime\}=\\mathbf\{B\}^\{\\top\}\\,\\vec\{z\}\. Note thatnnz\(𝐁\)=\|Σ\+\|nnz\(\\mathbf\{B\}\)=\|\\Sigma^\{\+\}\|and each row/column of𝐁\\mathbf\{B\}has at most a single non\-zero\.

Next, we show how a logic matrix can be used to combine values associated with available actions in each decision point\. This ability is crucial in calculating behavioral strategiesb→\(t\)\\vec\{b\}^\{\(t\)\}and cumulative regretsr→\(t\)\\vec\{r\}^\{\(t\)\}at anytt\. For example, to calculate the behavioral strategy inNextStrategy, the floored counterfactual regrets must be L1 normalized action\-wise\. This entails obtaining the sums of floored counterfactual regrets associated with available actions in each decision point\. For this purpose, we define𝐂∈ℝ𝒥×Σ\+\\mathbf\{C\}\\in\\mathbb\{R\}^\{\\mathcal\{J\}\\times\\Sigma^\{\+\}\}whereCj,\(j′,a\)=𝟏j=j′C\_\{j,\(j^\{\\prime\},a\)\}=\\operatorname\{\\mathbf\{1\}\}\_\{j=j^\{\\prime\}\}\. Then, givenz→′′∈ℝΣ\+\\vec\{z\}\\,^\{\\prime\\prime\}\\in\\mathbb\{R\}^\{\\Sigma^\{\+\}\}, one can obtain the action\-wise sums by calculating𝐂z→′′\\mathbf\{C\}\\,\\vec\{z\}\\,^\{\\prime\\prime\}\. One can also propagate the collapsed values back with𝐂⊤𝐂z→′′\\mathbf\{C\}^\{\\top\}\\mathbf\{C\}\\,\\vec\{z\}\\,^\{\\prime\\prime\}\. Then,z→′′\\vec\{z\}\\,^\{\\prime\\prime\}can be L1\-normalized action\-wise by calculatingz→′′⊘\(𝐂⊤𝐂z→′′\)\\vec\{z\}\\,^\{\\prime\\prime\}\\oslash\\left\(\\mathbf\{C\}^\{\\top\}\\mathbf\{C\}\\,\\vec\{z\}\\,^\{\\prime\\prime\}\\right\)\. Here, ‘⊘\\oslash’ denotes the Hadamard division operator, which carries out element\-wise division\. Note thatnnz\(𝐂\)=\|Σ\+\|nnz\(\\mathbf\{C\}\)=\|\\Sigma^\{\+\}\|and each row and column of𝐂\\mathbf\{C\}has at mostmaxj∈𝒥⁡\|𝒜j\|\\max\_\{j\\in\\mathcal\{J\}\}\{\|\\mathcal\{A\}\_\{j\}\|\}and11non\-zeros, respectively\.

### 3\.2Parallelized tree traversal

Notably, in sequence\-form CFR, bothNextStrategyandObserveUtilityrequire complete tree traversals over the input TFSDP, top\-down for the former and bottom\-up for the latter\. As is our goal, we convert the tree traversal step in the original algorithm into linear algebra operations\. For this, we take inspiration from GraphBLAS\[[19](https://arxiv.org/html/2605.14277#bib.bib26)\], whose core insight is that multiplying a graph’s adjacency matrix with a vector is akin to a single step in breadth\-first search\.

TFSDPs can be decomposed into level structures\. Define an adjacency matrix𝐋\(d\)∈ℝ𝒫×𝒫\\mathbf\{L\}^\{\(d\)\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\\times\\mathcal\{P\}\}for each depthdd\. A non\-zero elementLp1,p2\(d\)L^\{\(d\)\}\_\{p\_\{1\},p\_\{2\}\}denotes that there is an edge of that weight connectingp1p\_\{1\}at depthd−1d\-1andp2p\_\{2\}at depthdd\. With these level graphs, a top\-down tree traversal can be achieved by applying the level graphs in the following order:𝐋\(1\),𝐋\(2\),…,𝐋\(k\)\\mathbf\{L\}^\{\(1\)\},\\mathbf\{L\}^\{\(2\)\},\\ldots,\\mathbf\{L\}^\{\(k\)\}wherekkis the height of the tree\. Applying them in the reverse order results in a bottom\-up traversal of the tree\. The number of non\-zeros inL\(d\)L^\{\(d\)\}is equal to the number of edges in the corresponding level structure\. Similarly, the maximum number of entries in any row inL\(d\)L^\{\(d\)\}is equal to the size of the largest set of nodes in depthddthat shares the same parent node\. The maximum number of entries in any column is equal to11\.

### 3\.3Algorithm description

Parallelized CFR, combining the above insights, is presented as Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2)\. In Line[2](https://arxiv.org/html/2605.14277#algorithm2), we assume that the Hadamard division operator ‘⊘\\oslash’ produces an arbitrary strategy forjjwhen‖\(r→j\(t−1\)\)\+‖1=0\\left\\lVert\\left\(\\vec\{r\}\_\{j\}^\{\(t\-1\)\}\\right\)^\{\+\}\\right\\rVert\_\{1\}=0\. In Line[2](https://arxiv.org/html/2605.14277#algorithm2), we use ‘⊙\\odot’ to denote the Hadamard product operator, which carries out element\-wise multiplication\. We ensured that the comments provided inside the pseudocode match those provided in the non\-parallelized CFR’s pseudocode in Algorithm[1](https://arxiv.org/html/2605.14277#algorithm1)so they can be compared easily\.

TheNextStrategyfunction, which returns a sequence\-form strategy for iterationtt, is defined in Line[2](https://arxiv.org/html/2605.14277#algorithm2)\. In Line[2](https://arxiv.org/html/2605.14277#algorithm2), we L1\-normalize the floored counterfactual regrets\(r→\(t\)\)\+\\left\(\\vec\{r\}^\{\(t\)\}\\right\)^\{\+\}at timettto produce the behavioral strategyb→\(t\)\\vec\{b\}^\{\(t\)\}\. The logic matrix𝐂\\mathbf\{C\}is used during this step\.

In the following lines, the behavioral strategyb→\(t\)\\vec\{b\}^\{\(t\)\}is converted to a sequence\-form strategy \(i\.e\., in𝒳\\mathcal\{X\}\)\. To do this, we first definey→\(t\)=e→p∗∈ℝ≥0𝒫\\vec\{y\}^\{\(t\)\}=\\vec\{e\}\_\{p^\{\*\}\}\\in\\mathbb\{R\}^\{\\mathcal\{P\}\}\_\{\\geq 0\}in Line[2](https://arxiv.org/html/2605.14277#algorithm2)\. Here,p∗p^\{\*\}denotes the root of the input TFSDP, ande→p\\vec\{e\}\_\{p\}is used to denote a unit vector whose element corresponding top∈𝒫p\\in\\mathcal\{P\}is set to one while the rest is set to zero\. Then, a top\-down traversal is performed between Lines[2](https://arxiv.org/html/2605.14277#algorithm2)and[2](https://arxiv.org/html/2605.14277#algorithm2)to propagate the probabilities fromb→\(t\)\\vec\{b\}^\{\(t\)\}down the tree\. Here, the edge weights in the adjacency matrix of the level graph are defined depending on the type of edge\. When the edge corresponds to applying an actionaaat a decision pointjj, the weight is set tob→\(j,a\)\(t\)\\vec\{b\}^\{\(t\)\}\_\{\(j,a\)\}\. Otherwise, the edge corresponds to a signal and its weight is set to11\. After the tree traversal,x→\(t\)∈𝒳\\vec\{x\}^\{\(t\)\}\\in\\mathcal\{X\}is calculated fromy→\(t\)\\vec\{y\}^\{\(t\)\}in Line[2](https://arxiv.org/html/2605.14277#algorithm2)by converting it from the per\-node representation to the per\-sequence representation\. The logic matrix𝐀\\mathbf\{A\}is used for this step\. Finally,x→\(t\)\\vec\{x\}^\{\(t\)\}is returned in Line[2](https://arxiv.org/html/2605.14277#algorithm2)\.

ObserveUtility, which, given utilities, updates the cumulative counterfactual regrets, is in Line[2](https://arxiv.org/html/2605.14277#algorithm2)\. In Line[2](https://arxiv.org/html/2605.14277#algorithm2),v→\(t\)\\vec\{v\}^\{\(t\)\}is initialized with zeros\. Then, in Line[2](https://arxiv.org/html/2605.14277#algorithm2), the input utilityu→\(t\)\\vec\{u\}^\{\(t\)\}is converted from a per\-sequence representation to a per\-node representation, and is denoted asw→\(t\)\\vec\{w\}^\{\(t\)\}\. The logic matrix𝐀\\mathbf\{A\}is used during this step\. From Line[2](https://arxiv.org/html/2605.14277#algorithm2)to Line[2](https://arxiv.org/html/2605.14277#algorithm2), a bottom\-up tree traversal is carried out to accumulate the counterfactual utilitiesu→\(t\)\\vec\{u\}^\{\(t\)\}up the tree\. The accumulated values are stored inv→\(t\)\\vec\{v\}^\{\(t\)\}\.

Next,v→\(t\)\\vec\{v\}^\{\(t\)\}is used to update the counterfactual regrets\. Line[2](https://arxiv.org/html/2605.14277#algorithm2)combinesv→\(t\)\\vec\{v\}^\{\(t\)\}andw→\(t\)\\vec\{w\}^\{\(t\)\}, converts the result from a per\-node representation to a behavioral representation using𝐁\\mathbf\{B\}, and denotes it asq→\(t\)\\vec\{q\}^\{\(t\)\}\. It is then used in Line[2](https://arxiv.org/html/2605.14277#algorithm2)to update the counterfactual regrets\. Updating the cumulative counterfactual regrets entails adding the counterfactual values, which, in turn, requires calculating the expected values at each decision point\. These are action\-wise sums of the counterfactual valuesq→\(t\)\\vec\{q\}^\{\(t\)\}, weighted by the probabilities of applying the corresponding actions, given by the behavioral strategyb→\(t\)\\vec\{b\}^\{\(t\)\}\. We can collapse the Hadamard productb→\(t\)⊙q→\(t\)\\vec\{b\}^\{\(t\)\}\\odot\\vec\{q\}^\{\(t\)\}using𝐂\\mathbf\{C\}to obtain the expected values at each decision point\. Just as we did in Line[2](https://arxiv.org/html/2605.14277#algorithm2), the collapsed values are propagated back to obtain the counterfactual valuesq→\(t\)−𝐂⊤\(𝐂\(b→\(t\)⊙q→\(t\)\)\)\\vec\{q\}^\{\(t\)\}\-\\mathbf\{C\}^\{\\top\}\\left\(\\mathbf\{C\}\\left\(\\vec\{b\}^\{\(t\)\}\\odot\\vec\{q\}^\{\(t\)\}\\right\)\\right\), which is then added to the counterfactual regrets\.

### 3\.4Complexity analysis

We use thework\-depth modelin our complexity analysis\. The work \(WW\) refers to the total number of operations performed by the algorithm, while the depth \(DD\) refers to the length of the longest span of operations in sequence\. We usecompressed sparse row \(CSR\)sparse matrix operations in our implementations\. Sparse matrix\-vector multiplication has total work linear in the number of non\-zeros in the matrix and depth proportional to the maximum number of non\-zeros in a row\[[1](https://arxiv.org/html/2605.14277#bib.bib46)\]\. All other linear algebra operations used in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2)– vector addition, vector subtraction, the Hadamard division, and the Hadamard product – have linear work and constant depth\. With this, we have the following lemma for the complexity of theNextStrategyfunction\.

###### Lemma 1\.

TheNextStrategyfunction in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2)has the total work ofW=Θ\(\|𝒫\|\)W=\\Theta\(\|\\mathcal\{P\}\|\)and the depth ofD=Θ\(log⁡\(maxj∈𝒥⁡\|𝒜j\|\)\+k\)D=\\Theta\(\\log\{\(\\max\_\{j\\in\\mathcal\{J\}\}\{\|\\mathcal\{A\}\_\{j\}\|\}\)\}\+k\)\.

Recall thatkkis the height of the TFSDP\. Due to space constraints, the proof is relegated to Appendix[A\.1](https://arxiv.org/html/2605.14277#A1.SS1)\. Now, we provide the complexity of theObserveUtilityfunction in Lemma[2](https://arxiv.org/html/2605.14277#Thmlemma2)\. DefiningBBas the degree of the TFSDP,i\.e\., the maximum number of children of any node,

###### Lemma 2\.

TheObserveUtilityfunction in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2)has the total work ofW=Θ\(\|𝒫\|\)W=\\Theta\(\|\\mathcal\{P\}\|\)and the depth ofD=Θ\(klog⁡B\)D=\\Theta\(k\\log\{B\}\)\.

The proof is in Appendix[A\.2](https://arxiv.org/html/2605.14277#A1.SS2)\. Lemmas[1](https://arxiv.org/html/2605.14277#Thmlemma1)and[2](https://arxiv.org/html/2605.14277#Thmlemma2)can be combined to give the total work and depth of parallelized CFR, described in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2)\.

###### Theorem 1\.

Each iteration of parallelized sequence\-form CFR, described in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2), has the total work ofW=Θ\(\|𝒫\|\)W=\\Theta\(\|\\mathcal\{P\}\|\)and the depth ofD=Θ\(klog⁡B\)D=\\Theta\(k\\log\{B\}\)\.

The proof is in Appendix[A\.3](https://arxiv.org/html/2605.14277#A1.SS3)\.

CombiningWWandDD, we obtain the following theoretical speedup potential:

###### Corollary 1\.

The parallelism of each iteration of parallelized sequence\-form CFR, described in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2), is\|𝒫\|klog⁡B\\frac\{\|\\mathcal\{P\}\|\}\{k\\log\{B\}\}\.

The proof is in Appendix[A\.4](https://arxiv.org/html/2605.14277#A1.SS4)\.

We conclude this section with the space complexity of our parallelized CFR\.

###### Theorem 2\.

The space complexity of parallelized sequence\-form CFR, described in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2), is linear with respect to the size of the TFSDP, i\.e\.,Θ\(\|𝒫\|\)\\Theta\(\|\\mathcal\{P\}\|\)\.

The proof is in Appendix[A\.5](https://arxiv.org/html/2605.14277#A1.SS5)\. It follows from the fact that the space complexity of sparse matrix\-vector multiplication is proportional to the number of non\-zeros, and every vector operation in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2)has linear space complexity\.

### 3\.5Parallelizing the tabular variants of CFR

In this section, we demonstrate that our parallelization framework is powerful enough to be applied to modern, mainstream members of the tabular family of CFR variants, including the state\-of\-the\-art\.

#### 3\.5\.1Alternating player updates

The classical implementation of CFR\[[37](https://arxiv.org/html/2605.14277#bib.bib1)\]interleaves the player updates so their strategies are updated simultaneously in each iteration\. Later works found that alternating the player updates, that is, updating the strategy of one player before the other and so on, empirically exhibits much faster convergence\. Alternating the player updates is trivial in the sequence\-form implementation of CFR: one simply needs to avoid interleaving the calls to both players’NextStrategyandObserveUtilityprocedures\. The same can be applied to parallelized sequence\-form CFR\.

1ObserveUtility\*\(u→\(t\)∈ℝΣ\)\\left\(\\vec\{u\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\Sigma\}\\right\):

//ObserveUtilityfrom parallel CFR

2ObserveUtility\(u→\(t\)\)\(\\vec\{u\}^\{\(t\)\}\);

3

ℝΣ\+∋r→\(t\)≔\(r→\(t\)\)\+\\mathbb\{R\}^\{\\Sigma^\{\+\}\}\\ni\\vec\{r\}^\{\(t\)\}\\coloneqq\\left\(\\vec\{r\}^\{\(t\)\}\\right\)^\{\+\};

Algorithm 3Excerpt of parallelized CFR\+
#### 3\.5\.2CFR\+

CFR\+\[[32](https://arxiv.org/html/2605.14277#bib.bib2)\], unlike CFR, floors the counterfactual regrets at each iteration and involves alternating player updates\. This ensures that unpromising actions are still played with zero probability, but can be quickly explored when they become promising\. Our parallelization framework can be extended to CFR\+simply by appending a single statement at the end ofObserveUtility, as shown in Algorithm[3](https://arxiv.org/html/2605.14277#algorithm3)\.

1ObserveUtility\*\(u→\(t\)∈ℝΣ\)\\left\(\\vec\{u\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\Sigma\}\\right\):

//ObserveUtilityfrom parallel CFR

2ObserveUtility\(u→\(t\)\)\(\\vec\{u\}^\{\(t\)\}\);

3

ℝΣ\+∋r→\(t\)≔r→\(t\)⊙\(\{tαtα\+1r\(j,a\)\(t\)\>0tβtβ\+1r\(j,a\)\(t\)<01r\(j,a\)\(t\)=0\)\(j,a\)∈Σ\+\\mathbb\{R\}^\{\\Sigma^\{\+\}\}\\ni\\vec\{r\}^\{\(t\)\}\\coloneqq\\vec\{r\}^\{\(t\)\}\\odot\\left\(\\begin\{cases\}\\frac\{t^\{\\alpha\}\}\{t^\{\\alpha\}\+1\}&r^\{\(t\)\}\_\{\(j,a\)\}\>0\\\\ \\frac\{t^\{\\beta\}\}\{t^\{\\beta\}\+1\}&r^\{\(t\)\}\_\{\(j,a\)\}<0\\\\ 1&r^\{\(t\)\}\_\{\(j,a\)\}=0\\\\ \\end\{cases\}\\right\)\_\{\(j,a\)\\in\\Sigma^\{\+\}\};

Algorithm 4Excerpt of parallelized DCFR
#### 3\.5\.3Discounted CFR \(DCFR\)

Discounted CFR \(DCFR\)\[[7](https://arxiv.org/html/2605.14277#bib.bib4)\], unlike CFR, discounts the cumulative counterfactual regrets, depending on their signs\. On each iteration, positive regrets are multiplied bytα/\(tα\+1\)t^\{\\alpha\}/\(t^\{\\alpha\}\+1\)whereas negative regrets are multiplied bytβ/\(tβ\+1\)t^\{\\beta\}/\(t^\{\\beta\}\+1\)\. CFR\+is a special case of DCFR whereα=∞\\alpha=\\inftyandβ=−∞\\beta=\-\\infty\.Brown and Sandholm \[[7](https://arxiv.org/html/2605.14277#bib.bib4)\]remarked that setting\(α,β\)=\(1\.5,0\)\(\\alpha,\\beta\)=\(1\.5,0\), quadratically weighting each iteration’s contributions to the average strategy, and alternating player updates consistently outperforms CFR\+in practice\. Algorithm[4](https://arxiv.org/html/2605.14277#algorithm4)demonstrates how our framework can be applied to parallelize DCFR\.

1NextStrategy\*\(m→\(t\)∈ℝΣ\)\\left\(\\vec\{m\}^\{\(t\)\}\\in\\mathbb\{R\}^\{\\Sigma\}\\right\):

//Back up counterfactual regrets

2

ℝΣ\+∋ρ→\(t\)≔r→\(t\)\\mathbb\{R\}^\{\\Sigma^\{\+\}\}\\ni\\vec\{\\rho\}^\{\(t\)\}\\coloneqq\\vec\{r\}^\{\(t\)\};

//Copy prev\. behavioral strategy

3

ℝ≥0Σ\+∋b→\(t\)≔b→\(t−1\)\\mathbb\{R\}^\{\\Sigma^\{\+\}\}\_\{\\geq 0\}\\ni\\vec\{b\}^\{\(t\)\}\\coloneqq\\vec\{b\}^\{\(t\-1\)\};

//Observe prediction

4ObserveUtility\(m→\(t\)\)\(\\vec\{m\}^\{\(t\)\}\);

//NextStrategyfrom parallel CFR

5

𝒳∋x→\(t\)≔\\mathcal\{X\}\\ni\\vec\{x\}^\{\(t\)\}\\coloneqqNextStrategy\(\)\(\);

//Restore counterfactual regrets

6

ℝΣ\+∋r→\(t\)≔ρ→\(t\)\\mathbb\{R\}^\{\\Sigma^\{\+\}\}\\ni\\vec\{r\}^\{\(t\)\}\\coloneqq\\vec\{\\rho\}^\{\(t\)\};

7return

x→\(t\)\\vec\{x\}^\{\(t\)\};

Algorithm 5Excerpt of PCFR\(\+\)
#### 3\.5\.4Predictive CFR \(PCFR\) and PCFR\+

Predictive CFR \(PCFR\)\(respectively, PCFR\+\)\[[12](https://arxiv.org/html/2605.14277#bib.bib38)\]is the predictive counterpart of CFR \(respectively, CFR\+\)\. In both PCFR and PCFR\+, a prediction vectorm→\(t\)\\vec\{m\}^\{\(t\)\}is used to guess the next iteration’s instantaneous counterfactual regrets\. A popular option is setting the prediction vector to the previous iteration’s counterfactual utilities,i\.e\.,m→\(t\)=u→\(t−1\)\\vec\{m\}^\{\(t\)\}=\\vec\{u\}^\{\(t\-1\)\}, which has been shown to perform quite well in practice\. Empirically, PCFR\+tends to exhibit state\-of\-the\-art performance in most applications\. We demonstrate how our framework can be applied to both PCFR and PCFR\+in Algorithm[5](https://arxiv.org/html/2605.14277#algorithm5)\.

## 4Experiments

![Refer to caption](https://arxiv.org/html/2605.14277v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.14277v1/x2.png)

Figure 1:Exploitabilities versus wall\-clock time for our CFR implementations and those of OpenSpiel\.For our experiments, we created two CFR implementations: one with parallelism and another without, and tested their runtime performance\. In the parallelized version, we used CuPy\[[27](https://arxiv.org/html/2605.14277#bib.bib15)\]for GPU\-accelerated linear algebra operations\. While there is a lack of good benchmarks for CFR performance, OpenSpiel\[[22](https://arxiv.org/html/2605.14277#bib.bib6)\]offers popular Python and C\+\+ CFR implementations, which we use as baselines to compare against\. We justify the choice of our baselines later in the discussions\. We used double\-precision floating\-point numbers in our implementations for fair comparison, as OpenSpiel uses the aforesaid format\. Our testbench specification is given in Appendix[B](https://arxiv.org/html/2605.14277#A2)\.

Table 1:Speedups \(or slowdowns\) achieved by our parallelized implementation compared to OpenSpiel’s C\+\+ implementation\.Game\# nodesSpeedupKuhn poker58−3\.6×102\-3\.6\\times 10^\{2\}Leduc poker9\.4×1039\.4\\times 10^\{3\}−1\.9\-1\.9Tiny battleship5\.9×1045\.9\\times 10^\{4\}6\.96\.9Liar’s dice2\.9×1052\.9\\times 10^\{5\}1\.51\.5Small battleship2\.3×1062\.3\\times 10^\{6\}2\.1×1022\.1\\times 10^\{2\}Medium battleship1\.5×1071\.5\\times 10^\{7\}1\.2×1031\.2\\times 10^\{3\}Large battleship5\.2×1075\.2\\times 10^\{7\}4\.1×1034\.1\\times 10^\{3\}We conducted our experiments on seven common benchmark games provided by OpenSpiel\[[22](https://arxiv.org/html/2605.14277#bib.bib6)\]: Kuhn poker, Leduc poker, liar’s dice, and four battleship games\[[13](https://arxiv.org/html/2605.14277#bib.bib55)\]of varying sizes, whose initialization parameters are shown in Appendix[C](https://arxiv.org/html/2605.14277#A3)\. For every implementation and game, CFR was run for 1,000 seconds\. We also ensured that at least eight iterations of CFR were executed for each setting\. Note that the minimum number of iterations can be low because our goal is not to solve these games but instead to observe the runtime behavior of different implementations of CFR\. The number of nodes in the game trees of the games tested ranges from 58 to6\.7×1076\.7\\times 10^\{7\}and is shown in Table[1](https://arxiv.org/html/2605.14277#S4.T1)\.

For each run, we recorded the iteration times, wall\-clock times, and exploitabilities\. Exploitability is a standard metric for quantifying the quality of a strategy profile in imperfect\-information games \(lower is better\)\. The computed exploitabilities over the wall\-clock time are plotted in Figure[1](https://arxiv.org/html/2605.14277#S4.F1)\. Note that some lines seem horizontally shifted because the corresponding implementations took substantially more time to produce even a single iterate\. Also, the speedups achieved by our parallelized implementation on a GPU compared to OpenSpiel’s more performant C\+\+ implementation on a CPU are shown in Table[1](https://arxiv.org/html/2605.14277#S4.T1)\. Here, positive \(respectively, negative\) values denote how many times faster \(respectively, slower\) our parallelized implementation is compared to OpenSpiel’s C\+\+ implementation\.

On smaller games, OpenSpiel’s implementations outperform our parallelized implementation\. Here, it is clear that the GPU overhead makes parallelization impractical on smaller scales\. However, as the game size grows, the speed of our parallelized implementation overtakes the others’, and the resulting speedup becomes more significant\. This indicates that, in practice, a high degree of parallelization is achieved by our technique\. Overall, our parallelized implementation is up to about2\.7×1042\.7\\times 10^\{4\}and4\.1×1034\.1\\times 10^\{3\}times faster than OpenSpiel’s Python and C\+\+ implementations, respectively\. We provide additional results, including iteration times and memory usage, in Appendix[D](https://arxiv.org/html/2605.14277#A4)\.

## 5Discussion

Although parallelizing CFR does not necessarily increase the size of the game that can be solved, it can make finding an optimal solution to these games much faster\. One prime use case of our parallelization framework for researchers is in developing new variants of CFR and benchmarking them\. In these settings, games are typically at most modestly large, and from thousands to tens of thousands of CFR iterations are often executed\. Additionally, our framework can be useful in solving games that are played in a repeated setting, whose individual game trees usually differ by little to none across repetitions\. An example of this is poker, whose action depth is determined by stack sizes\.

While we acknowledge that comparing the performance of our parallelized implementation on a GPU to that of OpenSpiel on a CPU may not facilitate a comparison that is entirely fair, our experiments nonetheless showed that parallelism can achieve extreme speedups, even in more regular use cases\. Also worth noting is that our implementation achieves this performance while completely retaining compatibility with games defined in OpenSpiel, which, again, is a widely used software library for game\-theoretic primitives\. This fact alone could be of separate interest to the extensive\-form game\-solving community\.

The degree of parallelism that can be achieved by our implementation also depends on the hardware being used \(e\.g\., the number of CUDA cores on a GPU, the number of CPU threads,etc\.\)\. While CPU multi\-threading can technically be used in place of GPUs to parallelize CFR, we would expect this to be less effective, especially since GPUs have optimized opcodes for performing large\-scale linear algebra operations, unlike most CPUs\.

While our focus was on parallelizing sequence\-form CFR, our ideas can also be applied to parallelize classical CFR, although several differences arise\. Most crucially, the space complexity of parallelized classical CFR would be linear with respect to the number of game tree nodes, which is generally larger than the space complexity of parallelized sequence\-form CFR \(linear with respect to the size of the tree\-form sequential decision processes\)\. Indeed, parallelized classical CFR requires some of the values specific to each player \(e\.g\., player reach probabilities\) to be needlessly duplicated across nodes, whereas this can be handled more efficiently in sequence\-form CFR\. Additionally, parallelized classical CFR requires introducing an additional logic matrix that represents, for each node, the player whose turn it is to act\. This logic matrix would then be used to select the relevant values for each player during the parallelized tree traversal\. To summarize, while our parallelization framework can also be applied to classical CFR, doing so results in an implementation that is less memory efficient than what can be achieved by parallelizing sequence\-form CFR\. Again, note that classical CFR and sequence\-form CFR are equivalent in the sense that they produce identical iterates, so by parallelizing sequence\-form CFR, we have also indirectly parallelized classical CFR\.

## 6Conclusions and future research

In this paper, we provided a generalized framework for parallelizing the tabular members of the CFR family of algorithms, and gave algorithmic descriptions of the parallelized counterparts of CFR, CFR\+, DCFR, PCFR, and PCFR\+\. In our analysis using the work\-depth model, the total work of the resulting parallel algorithm is equivalent to the time complexity of sequence\-form CFR, and the depth scales with the height of the TFSDP\. Our benchmarks show that our implementation on a GPU offers speedups of up to four orders of magnitude compared to Google DeepMind OpenSpiel’s\[[22](https://arxiv.org/html/2605.14277#bib.bib6)\]CFR implementations on a CPU\. While our technique does not necessarily increase the size of the game that can be solved, it can make solving games dramatically faster\. Our technique can be useful in researching new CFR variants and solving games that are played in a repeated setting\.

Possible topics for future work include parallelizing other game\-theoretic algorithms such as pruned CFR\[[4](https://arxiv.org/html/2605.14277#bib.bib44),[2](https://arxiv.org/html/2605.14277#bib.bib45),[5](https://arxiv.org/html/2605.14277#bib.bib47)\], finding best response, and exploitability calculations\.

## Acknowledgements

This work has been supported by the Vannevar Bush Faculty Fellowship ONR N00014\-23\-1\-2876, National Science Foundation grant RI\-2312342, and NIH award A240108S001\. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies\.

## References

- \[1\]\(1996\)Programming parallel algorithms\.39\(3\),pp\. 85–97\.Cited by:[§3\.4](https://arxiv.org/html/2605.14277#S3.SS4.p1.2)\.
- \[2\]N\. Brown, C\. Kroer, and T\. Sandholm\(2017\)Dynamic thresholding and pruning for regret minimization\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§6](https://arxiv.org/html/2605.14277#S6.p2.1)\.
- \[3\]N\. Brown, A\. Lerer, S\. Gross, and T\. Sandholm\(2019\)Deep counterfactual regret minimization\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1)\.
- \[4\]N\. Brown and T\. Sandholm\(2015\)Regret\-based pruning in extensive\-form games\.InProceedings of the Annual Conference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§6](https://arxiv.org/html/2605.14277#S6.p2.1)\.
- \[5\]N\. Brown and T\. Sandholm\(2017\)Reduced space and faster convergence in imperfect\-information games via pruning\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p2.1),[§6](https://arxiv.org/html/2605.14277#S6.p2.1)\.
- \[6\]N\. Brown and T\. Sandholm\(2018\)Superhuman AI for heads\-up no\-limit poker: Libratus beats top professionals\.Science359\(6374\),pp\. 418–424\.Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p2.1),[§1](https://arxiv.org/html/2605.14277#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p4.1)\.
- \[7\]N\. Brown and T\. Sandholm\(2019\)Solving imperfect\-information games via discounted regret minimization\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p3.3),[§3\.5\.3](https://arxiv.org/html/2605.14277#S3.SS5.SSS3.p1.5),[§3](https://arxiv.org/html/2605.14277#S3.p1.1)\.
- \[8\]N\. Brown and T\. Sandholm\(2019\)Superhuman AI for multiplayer poker\.Science365\(6456\),pp\. 885–890\.Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1)\.
- \[9\]J\. Dean, G\. Corrado, R\. Monga, K\. Chen, M\. Devin, M\. Mao, M\. Ranzato, A\. Senior, P\. Tucker, K\. Yang, Q\. Le, and A\. Ng\(2012\)Large scale distributed deep networks\.InProceedings of the Annual Conference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p1.1)\.
- \[10\]G\. Farina, C\. Kroer, and T\. Sandholm\(2019\)Regret circuits: composability of regret minimizers\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p1.1),[1](https://arxiv.org/html/2605.14277#algorithm1)\.
- \[11\]G\. Farina, C\. Kroer, and T\. Sandholm\(2020\)Stochastic regret minimization in extensive\-form games\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p4.1)\.
- \[12\]G\. Farina, C\. Kroer, and T\. Sandholm\(2021\)Faster game solving via predictive Blackwell approachability: connecting regret matching and mirror descent\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p3.3),[§3\.5\.4](https://arxiv.org/html/2605.14277#S3.SS5.SSS4.p1.2),[§3](https://arxiv.org/html/2605.14277#S3.p1.1)\.
- \[13\]G\. Farina, C\. K\. Ling, F\. Fang, and T\. Sandholm\(2019\)Correlation in extensive\-form games: saddle\-point formulation and benchmarks\.InProceedings of the Annual Conference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§4](https://arxiv.org/html/2605.14277#S4.p2.1)\.
- \[14\]A\. Gilpin, S\. Hoda, J\. Peña, and T\. Sandholm\(2007\)Gradient\-based algorithms for finding Nash equilibria in extensive form games\.InProceedings of the International Workshop on Internet and Network Economics \(WINE\),Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p3.1)\.
- \[15\]A\. Gilpin and T\. Sandholm\(2010\)Speeding up gradient\-based algorithms for sequential games\.InProceedings of the International Conference on Autonomous Agents and Multiagent Systems \(AAMAS\),Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p3.1)\.
- \[16\]P\. Goyal, P\. Dollár, R\. Girshick, P\. Noordhuis, L\. Wesolowski, A\. Kyrola, A\. Tulloch, Y\. Jia, and K\. He\(2018\)Accurate, large minibatch SGD: training ImageNet in 1 hour\.Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p1.1)\.
- \[17\]S\. Hart and A\. Mas\-Colell\(2000\)A simple adaptive procedure leading to correlated equilibrium\.Econometrica68\(5\),pp\. 1127–1150\.Cited by:[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p3.3)\.
- \[18\]S\. Hoda, A\. Gilpin, J\. Peǹa, and T\. Sandholm\(2010\)Smoothing techniques for computing Nash equilibria of sequential games\.35\(2\),pp\. 494–512\.Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p3.1)\.
- \[19\]J\. Kepner, P\. Aaltonen, D\. Bader, A\. Buluç, F\. Franchetti, J\. Gilbert, D\. Hutchison, M\. Kumar, A\. Lumsdaine, H\. Meyerhenke, S\. McMillan, C\. Yang, J\. D\. Owens, M\. Zalewski, T\. Mattson, and J\. Moreira\(2016\)Mathematical foundations of the GraphBLAS\.InProceedings of the IEEE High Performance Extreme Computing Conference \(HPEC\),Vol\.\.Cited by:[§3\.2](https://arxiv.org/html/2605.14277#S3.SS2.p1.1)\.
- \[20\]D\. Koller, N\. Megiddo, and B\. von Stengel\(1996\)Efficient computation of equilibria for extensive two\-person games\.14\(2\),pp\. 247–259\.Cited by:[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p1.1)\.
- \[21\]C\. Kroer, G\. Farina, and T\. Sandholm\(2018\)Solving large sequential games with the excessive gap technique\.InProceedings of the Annual Conference on Neural Information Processing Systems \(NeurIPS\),pp\.\.Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p3.1)\.
- \[22\]M\. Lanctot, E\. Lockhart, J\. Lespiau, V\. Zambaldi, S\. Upadhyay, J\. Pérolat, S\. Srinivasan, F\. Timbers, K\. Tuyls, S\. Omidshafiei, D\. Hennes, D\. Morrill, P\. Muller, T\. Ewalds, R\. Faulkner, J\. Kramár, B\. D\. Vylder, B\. Saeta, J\. Bradbury, D\. Ding, S\. Borgeaud, M\. Lai, J\. Schrittwieser, T\. Anthony, E\. Hughes, I\. Danihelka, and J\. Ryan\-Davis\(2019\)OpenSpiel: a framework for reinforcement learning in games\.External Links:1908\.09453Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p2.1),[§4](https://arxiv.org/html/2605.14277#S4.p1.1),[§4](https://arxiv.org/html/2605.14277#S4.p2.1),[§6](https://arxiv.org/html/2605.14277#S6.p1.1)\.
- \[23\]M\. Lanctot, K\. Waugh, M\. Zinkevich, and M\. Bowling\(2009\)Monte Carlo sampling for regret minimization in extensive games\.InProceedings of the Annual Conference on Neural Information Processing Systems \(NeurIPS\),pp\.\.Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p4.1)\.
- \[24\]S\. M\. McAleer, G\. Farina, M\. Lanctot, and T\. Sandholm\(2023\)ESCHER: eschewing importance sampling in games by computing a history value function to estimate regret\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1)\.
- \[25\]M\. Moravčík, M\. Schmid, N\. Burch, V\. Lisý, D\. Morrill, N\. Bard, T\. Davis, K\. Waugh, M\. Johanson, and M\. Bowling\(2017\)DeepStack: expert\-level artificial intelligence in heads\-up no\-limit poker\.Science356\(6337\),pp\. 508–513\.Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1)\.
- \[26\]Y\. Nesterov\(2005\)Excessive gap technique in nonsmooth convex minimization\.16\(1\),pp\. 235–249\.Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p3.1)\.
- \[27\]R\. Okuta, Y\. Unno, D\. Nishino, S\. Hido, and C\. Loomis\(2017\)CuPy: a NumPy\-compatible library for NVIDIA GPU calculations\.InProceedings of the Workshop on Machine Learning Systems \(LearningSys\) in the Annual Conference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§4](https://arxiv.org/html/2605.14277#S4.p1.1)\.
- \[28\]J\. Reis\(2015\)A GPU implementation of counterfactual regret minimization\.Master’s Thesis,University of Porto\.Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p1.1)\.
- \[29\]J\. Rudolf\(2021\)Counterfactual regret minimization on GPU\.Czech Technical University in Prague\.Cited by:[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p1.1)\.
- \[30\]A\. Sergeev and M\. D\. Balso\(2018\)Horovod: fast and easy distributed deep learning in TensorFlow\.Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p1.1)\.
- \[31\]O\. Tammelin, N\. Burch, M\. Johanson, and M\. Bowling\(2015\)Solving heads\-up limit Texas Hold’em\.InProceedings of the International Joint Conference on Artificial Intelligence \(IJCAI\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p2.1),[§1](https://arxiv.org/html/2605.14277#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.14277#S2.SS2.p4.1)\.
- \[32\]O\. Tammelin\(2014\)Solving large imperfect information games using CFR\+\.External Links:1407\.5042Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p3.3),[§3\.5\.2](https://arxiv.org/html/2605.14277#S3.SS5.SSS2.p1.1),[§3](https://arxiv.org/html/2605.14277#S3.p1.1)\.
- \[33\]B\. von Stengel\(1996\)Efficient computation of behavior strategies\.14\(2\),pp\. 220–246\.Cited by:[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p1.1)\.
- \[34\]H\. Xu, K\. Li, H\. Fu, Q\. Fu, J\. Xing, and J\. Cheng\(2024\)Dynamic discounted counterfactual regret minimization\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1)\.
- \[35\]B\. H\. Zhang, I\. Anagnostides, and T\. Sandholm\(2026\)Scale\-invariant regret matching and online learning with optimal convergence: bridging theory and practice in zero\-sum games\.Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1)\.
- \[36\]N\. Zhang, S\. McAleer, and T\. Sandholm\(2026\)Faster game solving via hyperparameter schedules\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1)\.
- \[37\]M\. Zinkevich, M\. Johanson, M\. Bowling, and C\. Piccione\(2007\)Regret minimization in games with incomplete information\.InProceedings of the Annual Conference on Neural Information Processing Systems \(NeurIPS\),pp\.\.Cited by:[§1](https://arxiv.org/html/2605.14277#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.14277#S2.SS1.p5.1),[§3\.5\.1](https://arxiv.org/html/2605.14277#S3.SS5.SSS1.p1.1),[1](https://arxiv.org/html/2605.14277#algorithm1)\.

## Appendix AOmitted proofs in Section[3\.4](https://arxiv.org/html/2605.14277#S3.SS4)

This appendix section contains proofs excluded from the main paper content due to space constraints\.

### A\.1Proof of Lemma[1](https://arxiv.org/html/2605.14277#Thmlemma1)

###### Proof\.

We begin by analyzing the total work\. Line[2](https://arxiv.org/html/2605.14277#algorithm2)features two sparse matrix\-vector multiplications and the Hadamard division operation\. The total work of the sparse matrix\-vector multiplications isΘ\(nnz\(𝐂\)\)=Θ\(\|Σ\+\|\)\\Theta\(nnz\(\\mathbf\{C\}\)\)=\\Theta\(\|\\Sigma^\{\+\}\|\), whereas the Hadamard division has the work ofΘ\(\|Σ\+\|\)\\Theta\(\|\\Sigma^\{\+\}\|\)\. Thus, the total work of Line[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(\|Σ\+\|\)\\Theta\(\|\\Sigma^\{\+\}\|\)\. Line[2](https://arxiv.org/html/2605.14277#algorithm2)has the work ofΘ\(\|𝒫\|\)\\Theta\(\|\\mathcal\{P\}\|\)\. Next, we analyze the total work of the top\-down tree traversal between Lines[2](https://arxiv.org/html/2605.14277#algorithm2)to[2](https://arxiv.org/html/2605.14277#algorithm2)\. On each iteration,\(𝐋\(d\)\)⊤y→\(t\)\\left\(\\mathbf\{L\}^\{\(d\)\}\\right\)^\{\\top\}\\vec\{y\}^\{\(t\)\}are added, in\-place, toy→\(t\)\\vec\{y\}^\{\(t\)\}\. First, we look at the sparse matrix\-vector multiplication, whose work isΘ\(nnz\(𝐋\(d\)\)\)\\Theta\\left\(nnz\\left\(\\mathbf\{L\}^\{\(d\)\}\\right\)\\right\)\. Over the entire for\-loop, the total work of the sparse matrix\-vector multiplication isΘ\(nnz\(𝐋\(1\)\)\)\+…\+Θ\(nnz\(𝐋\(k\)\)\)=Θ\(\|𝒫\|\)\\Theta\\left\(nnz\\left\(\\mathbf\{L\}^\{\(1\)\}\\right\)\\right\)\+\\ldots\+\\Theta\\left\(nnz\\left\(\\mathbf\{L\}^\{\(k\)\}\\right\)\\right\)=\\Theta\(\|\\mathcal\{P\}\|\), i\.e\., linear\. Next, we analyze the in\-place addition, which can be implemented efficiently so the total work is linear with respect to only the number of newly added values, which, in turn, is equal to the number of edges in the level structure at depthdd\. The edges don’t overlap across levels, so the total work of the in\-place additions over the for\-loop is proportional to the total number of edges, i\.e\.,Θ\(\|𝒫\|\)\\Theta\(\|\\mathcal\{P\}\|\)\. In sum, Lines[2](https://arxiv.org/html/2605.14277#algorithm2)to[2](https://arxiv.org/html/2605.14277#algorithm2)has the total work ofΘ\(\|𝒫\|\)\\Theta\(\|\\mathcal\{P\}\|\)\. Next, the work in Line[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(nnz\(𝐀\)\)=Θ\(Σ\)\\Theta\(nnz\(\\mathbf\{A\}\)\)=\\Theta\(\\Sigma\)\. Note that\|Σ\|=O\(\|𝒫\|\)\|\\Sigma\|=O\(\|\\mathcal\{P\}\|\)and\|Σ\+\|=O\(\|𝒫\|\)\|\\Sigma^\{\+\}\|=O\(\|\\mathcal\{P\}\|\)\. Overall, the total work isW=Θ\(\|𝒫\|\)W=\\Theta\(\|\\mathcal\{P\}\|\)\.

Now, we analyze the depth\. We begin with Line[2](https://arxiv.org/html/2605.14277#algorithm2)\. The depth of the inner and outer CSR sparse matrix\-vector multiplications areΘ\(log⁡\(maxj∈𝒥⁡\|𝒜j\|\)\)\\Theta\(\\log\{\(\\max\_\{j\\in\\mathcal\{J\}\}\{\|\\mathcal\{A\}\_\{j\}\|\}\)\}\)andΘ\(1\)\\Theta\(1\), respectively, whereas the depth of the Hadamard division isΘ\(1\)\\Theta\(1\)\. Thus, the depth of Line[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(log⁡\(maxj∈𝒥⁡\|𝒜j\|\)\)\\Theta\(\\log\{\(\\max\_\{j\\in\\mathcal\{J\}\}\{\|\\mathcal\{A\}\_\{j\}\|\}\)\}\)\. The depth of Line[2](https://arxiv.org/html/2605.14277#algorithm2)is triviallyΘ\(1\)\\Theta\(1\)\. The depth of Line[2](https://arxiv.org/html/2605.14277#algorithm2)is alsoΘ\(1\)\\Theta\(1\)\. This is repeatedkktimes, so the depth of the for\-loop between Lines[2](https://arxiv.org/html/2605.14277#algorithm2)and[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(k\)\\Theta\(k\)\. Finally, the depth of Line[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(1\)\\Theta\(1\)\. Overall, the depth isD=Θ\(log⁡\(maxj∈𝒥⁡\|𝒜j\|\)\+k\)D=\\Theta\(\\log\{\(\\max\_\{j\\in\\mathcal\{J\}\}\{\|\\mathcal\{A\}\_\{j\}\|\}\)\}\+k\)\. ∎

### A\.2Proof of Lemma[2](https://arxiv.org/html/2605.14277#Thmlemma2)

###### Proof\.

We begin by analyzing the total work\. Line[2](https://arxiv.org/html/2605.14277#algorithm2)has the work ofΘ\(\|𝒫\|\)\\Theta\(\|\\mathcal\{P\}\|\), and Line[2](https://arxiv.org/html/2605.14277#algorithm2)has the work ofΘ\(nnz\(𝐀\)\)\+Θ\(\|𝒫\|\)=Θ\(\|Σ\|\)\+Θ\(\|𝒫\|\)=Θ\(\|𝒫\|\)\\Theta\(nnz\(\\mathbf\{A\}\)\)\+\\Theta\(\|\\mathcal\{P\}\|\)=\\Theta\(\|\\Sigma\|\)\+\\Theta\(\|\\mathcal\{P\}\|\)=\\Theta\(\|\\mathcal\{P\}\|\)\. Following the same process used in the proof of Lemma[1](https://arxiv.org/html/2605.14277#Thmlemma1), we can derive the total work of the for\-loop between Lines[2](https://arxiv.org/html/2605.14277#algorithm2)and[2](https://arxiv.org/html/2605.14277#algorithm2)to beΘ\(\|𝒫\|\)\\Theta\(\|\\mathcal\{P\}\|\)\. Line[2](https://arxiv.org/html/2605.14277#algorithm2)has the total work ofΘ\(\|𝒫\|\)\+Θ\(nnz\(𝐁\)\)=Θ\(\|𝒫\|\)\+Θ\(\|Σ\+\|\)=Θ\(\|𝒫\|\)\\Theta\(\|\\mathcal\{P\}\|\)\+\\Theta\(nnz\(\\mathbf\{B\}\)\)=\\Theta\(\|\\mathcal\{P\}\|\)\+\\Theta\(\|\\Sigma^\{\+\}\|\)=\\Theta\(\|\\mathcal\{P\}\|\)\. In Line[2](https://arxiv.org/html/2605.14277#algorithm2), the two sparse matrix\-vector multiplications have the work ofΘ\(nnz\(𝐂\)\)=Θ\(\|Σ\+\|\)\\Theta\(nnz\(\\mathbf\{C\}\)\)=\\Theta\(\|\\Sigma^\{\+\}\|\)each, whereas the rest of the operations have the total work ofΘ\(\|Σ\+\|\)\\Theta\(\|\\Sigma^\{\+\}\|\)\. This yields the work ofΘ\(\|Σ\+\|\)\\Theta\(\|\\Sigma^\{\+\}\|\)for Line[2](https://arxiv.org/html/2605.14277#algorithm2)\. Overall, the total work isW=Θ\(\|𝒫\|\)W=\\Theta\(\|\\mathcal\{P\}\|\)\.

The depths of Lines[2](https://arxiv.org/html/2605.14277#algorithm2)and[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(1\)\\Theta\(1\)\. As in the depth analysis in the proof of Lemma[1](https://arxiv.org/html/2605.14277#Thmlemma1), the depth of the in\-place addition in Line[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(1\)\\Theta\(1\), and, over the for\-loop,Θ\(k\)\\Theta\(k\)\. The maximum number of non\-zero entries in any row in𝐋\(d\)\\mathbf\{L\}^\{\(d\)\}for any depthddis bounded byBB, the degree of the input TFSDP, so the CSR sparse matrix\-vector multiplications in Line[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(log⁡B\)\\Theta\(\\log\{B\}\), and, over the for\-loop,Θ\(klog⁡B\)\\Theta\(k\\log\{B\}\)\. Thus, Lines[2](https://arxiv.org/html/2605.14277#algorithm2)and[2](https://arxiv.org/html/2605.14277#algorithm2)have the depth ofΘ\(klog⁡B\)\\Theta\(k\\log\{B\}\)\. The depth of Line[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(1\)\\Theta\(1\)\. Finally, the depth of Line[2](https://arxiv.org/html/2605.14277#algorithm2)isΘ\(log⁡\(maxj∈𝒥⁡\|𝒜j\|\)\)\\Theta\(\\log\{\(\\max\_\{j\\in\\mathcal\{J\}\}\{\|\\mathcal\{A\}\_\{j\}\|\}\)\}\)\. Note thatmaxj∈𝒥⁡\|𝒜j\|=O\(B\)\\max\_\{j\\in\\mathcal\{J\}\}\{\|\\mathcal\{A\}\_\{j\}\|\}=O\(B\)\. So, the overall depth isD=Θ\(klog⁡B\)D=\\Theta\(k\\log\{B\}\)\. ∎

### A\.3Proof of Theorem[1](https://arxiv.org/html/2605.14277#Thmtheorem1)

###### Proof\.

Using Lemmas[1](https://arxiv.org/html/2605.14277#Thmlemma1)and[2](https://arxiv.org/html/2605.14277#Thmlemma2),W=Θ\(\|𝒫\|\)\+Θ\(\|𝒫\|\)=Θ\(\|𝒫\|\)W=\\Theta\(\|\\mathcal\{P\}\|\)\+\\Theta\(\|\\mathcal\{P\}\|\)=\\Theta\(\|\\mathcal\{P\}\|\)andD=Θ\(log⁡\(maxj∈𝒥⁡\|𝒜j\|\)\+k\)\+Θ\(klog⁡B\)=Θ\(klog⁡B\)D=\\Theta\(\\log\{\(\\max\_\{j\\in\\mathcal\{J\}\}\{\|\\mathcal\{A\}\_\{j\}\|\}\)\}\+k\)\+\\Theta\(k\\log\{B\}\)=\\Theta\(k\\log\{B\}\), as required\. ∎

### A\.4Proof of Corollary[1](https://arxiv.org/html/2605.14277#Thmcorollary1)

###### Proof\.

It follows immediately from Theorem[1](https://arxiv.org/html/2605.14277#Thmtheorem1)and the definition of parallelism\. ∎

### A\.5Proof of Theorem[2](https://arxiv.org/html/2605.14277#Thmtheorem2)

###### Proof\.

Since all vectors and sparse matrices involved as operands requireΘ\(\|𝒫\|\)\\Theta\(\|\\mathcal\{P\}\|\)space, it suffices to show that every linear algebra operation in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2)has a linear space complexity\. The space complexity of sparse matrix\-vector multiplication is proportional to the number of non\-zeros, and every vector operation in Algorithm[2](https://arxiv.org/html/2605.14277#algorithm2)has linear space complexity, as required\. ∎

## Appendix BTestbench specification

Our testbench contains an AMD Ryzen 9 3900X 12\-core, 24\-thread desktop processor, 128 GB memory, and Nvidia GeForce RTX 4090 24 GB VRAM graphics card, containing 16,384 CUDA cores\.

## Appendix CInitialization parameters for battleship

Table 2:Initialization parameters of the four battleship games we used in our benchmarks\. The values shown in the table correspond exactly to the arguments passed to OpenSpiel\.SizeBoardShip\# shotsHeightWidthSizesValuesTiny22\[2\]\[1\]3Small32\[2\]\[1\]3Medium44\[1\]\[1\]2Large33\[1,2\]\[1,1\]2Four battleship games \(plus three others\) were used during the benchmarks of our parallelized CFR implementation against OpenSpiel’s\. Section[4](https://arxiv.org/html/2605.14277#S4)relegated the exact parameters used to initialize the four battleship games to this appendix section\. Table[2](https://arxiv.org/html/2605.14277#A3.T2)tabulates these parameters\.

## Appendix DAdditional experimental results

![Refer to caption](https://arxiv.org/html/2605.14277v1/x3.png)Figure 2:Iteration times and \(CUDA\) memory usage versus the size of the game being solved\.Table 3:Iteration runtimes of various CFR implementations\. For each game, the best performance is bolded\. Each average iteration time value is accompanied by the standard error of the mean\.GameAverage iteration timeOursOpenSpielParallelizedUnparallelizedC\+\+PythonKuhn poker14\.3±0\.114\.3\\pm 0\.1ms\\mathrm\{ms\}791±1791\\pm 1µs\\mathrm\{\\SIUnitSymbolMicro s\}𝟑𝟗±𝟏\\mathbf\{39\\pm 1\}µs\\mathrm\{\\SIUnitSymbolMicro s\}1\.04±0\.011\.04\\pm 0\.01ms\\mathrm\{ms\}Leduc poker25\.2±0\.125\.2\\pm 0\.1ms\\mathrm\{ms\}51\.3±0\.151\.3\\pm 0\.1ms\\mathrm\{ms\}13\.4±0\.1\\mathbf\{13\.4\\pm 0\.1\}ms\\mathrm\{ms\}154±1154\\pm 1ms\\mathrm\{ms\}Tiny battleship24\.2±0\.1\\mathbf\{24\.2\\pm 0\.1\}ms\\mathrm\{ms\}649±1649\\pm 1ms\\mathrm\{ms\}168±1168\\pm 1ms\\mathrm\{ms\}1\.03±0\.011\.03\\pm 0\.01s\\mathrm\{s\}Liar’s dice38\.6±0\.1\\mathbf\{38\.6\\pm 0\.1\}ms\\mathrm\{ms\}1\.31±0\.011\.31\\pm 0\.01s\\mathrm\{s\}60\.8±0\.160\.8\\pm 0\.1ms\\mathrm\{ms\}1\.53±0\.011\.53\\pm 0\.01s\\mathrm\{s\}Small battleship25\.7±0\.1\\mathbf\{25\.7\\pm 0\.1\}ms\\mathrm\{ms\}13\.1±0\.113\.1\\pm 0\.1s\\mathrm\{s\}5\.38±0\.025\.38\\pm 0\.02s\\mathrm\{s\}32\.1±0\.732\.1\\pm 0\.7s\\mathrm\{s\}Medium battleship19\.4±0\.1\\mathbf\{19\.4\\pm 0\.1\}ms\\mathrm\{ms\}8\.45±0\.048\.45\\pm 0\.04s\\mathrm\{s\}22\.8±0\.222\.8\\pm 0\.2s\\mathrm\{s\}181±10181\\pm 10s\\mathrm\{s\}Large battleship22\.0±0\.1\\mathbf\{22\.0\\pm 0\.1\}ms\\mathrm\{ms\}19\.6±0\.219\.6\\pm 0\.2s\\mathrm\{s\}89\.6±4\.789\.6\\pm 4\.7s\\mathrm\{s\}590±47590\\pm 47s\\mathrm\{s\}Table 4:On the left, memory usage of the CFR implementations, and on the right, CUDA memory usage of our parallelized implementation\.GameMemory usageCUDAmemory usageOursOpenSpielParallelizedUnparallelizedC\+\+PythonKuhn poker414 MB9\.24 GB3\.39 GB323 MB50\.1 KBLeduc poker405 MB10\.1 GB208 MB219 MB791 KBTiny battleship460 MB9\.94 GB315 MB483 MB12\.8 MBLiar’s dice510 MB14\.0 GB668 MB1\.08 GB17\.2 MBSmall battleship2\.95 GB9\.72 GB4\.34 GB6\.54 GB295 MBMedium battleship11\.8 GB11\.8 GB20\.2 GB21\.7 GB306 MBLarge battleship38\.0 GB37\.9 GB82\.0 GB85\.2 GB755 MB

We continue from the remarks we made regarding our experimental results in Section[4](https://arxiv.org/html/2605.14277#S4)\. Note that the iterates produced by different CFR implementations diverge for some games\. This is primarily due to the following factors: a\) regret matching involves mathematical operations with high numerical sensitivity, b\) the degree of machine precision differs slightly between a GPU and a CPU, and c\) our implementations have different thresholds for checking zero\-vectors compared to OpenSpiel’s\. This difference is quite striking for the medium battleship game\. While this is not visible in the plot, the peak exploitabilities differ even between the two OpenSpiel implementations by around a factor of three \(0\.050\.05vs\.0\.140\.14\), and the peaks between our implementations also differ by about an order of magnitude\.

Figure[2](https://arxiv.org/html/2605.14277#A4.F2)showcases how the iteration times, memory usage, and CUDA memory usage vary as the size of the game being solved increases\. For the iteration times, we again see that our parallelized implementation is slower than the baselines for smaller games, but, as the game size increases, our implementation begins to outperform all other implementations, and the difference grows with the game size\. The performance of our unparallelized implementation sits roughly between those of OpenSpiel’s Python and C\+\+ implementations\.

The process memory usage and CUDA memory usage \(for our parallelized implementation\) are tabulated in Table[4](https://arxiv.org/html/2605.14277#A4.T4)\. Note that we kept track of average strategies for a logarithmic number of iterations, so the memory consumed by keeping track of the strategy profiles is also reflected by the memory usage\. Our script’s behavior reflects how one would use CFR in real\-life applications, as one would typically save the average strategies across some iterations to keep track of the learning process\. Note that our implementations and OpenSpiel represent games and strategy profiles differently, so it is difficult to directly compare the memory usage\. Here, the process memory usage of the implementations ranges from 208 MB to 85\.2 GB and is shown to scale with the game size\. Similarly, the \(CUDA\) memory usage of our parallelized implementation ranges from 50\.1 KB to 755 MB and rises roughly linearly with respect to the game size\.
Parallelizing Counterfactual Regret Minimization

Similar Articles

Dual GPU llama.cpp speedup

Pipeline parallelism in llama.cpp may be wasting your VRAM

NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released!

@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

@raphaelsrty: Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did. I…

Submit Feedback

Similar Articles

Pipeline parallelism in llama.cpp may be wasting your VRAM
NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released!
@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…
@raphaelsrty: Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did. I…