Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques

arXiv cs.AI Papers

Summary

This paper identifies vulnerabilities in the AIVAT variance reduction technique when the heuristic value function is not fixed prior to evaluation, and shows how to propagate heuristic uncertainty to further reduce variance, achieving a 43% reduction in the number of samples needed for statistical conclusions.

arXiv:2605.14261v1 Announce Type: new Abstract: How should an agent's performance in a multiagent environment be evaluated when there is a limited sample size or a high cost of running a trial? The AIVAT family of variance reduction techniques was proposed to address this challenge by introducing unbiased low-variance estimators of agents' expected payoffs. An important component of AIVAT is a heuristic value function that discriminates between potentially low- and high-value counterfactual histories. A notable gap in the literature is that there is little to no constraint or guideline on how the heuristic value function should be chosen or how uncertainty in its output should be handled. In our first contribution, we parameterize the heuristic value function to highlight AIVAT's potential vulnerabilities: a) the sample variance can be set pathologically low by directly applying gradient descent on the sample variance, and b) one can p-hack to draw a desired statistical conclusion via gradient descent/ascent on the test statistic. The main takeaway is that the heuristic value function should be fixed prior to observing the evaluation data! In our second contribution, we show how the heuristic uncertainty can be propagated to quantify the uncertainty of AIVAT estimates. It is then possible to further reduce the variance using inverse-variance weighted averaging, but AIVAT's unbiasedness guarantee may have to be sacrificed. In our experiments, we use a dataset of 10,000 poker hands to demonstrate our heuristic pathology and uncertainty results, with the latter yielding a 43.0% reduction in the number of samples (poker hands) needed to draw statistical conclusions.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:22 AM

# Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques
Source: [https://arxiv.org/html/2605.14261](https://arxiv.org/html/2605.14261)
Juho Kim Computer Science Department Carnegie Mellon University juhok@cs\.cmu\.edu &Tuomas Sandholm Computer Science Department, CMU Strategic Machine, Inc\. Strategy Robot, Inc\. Optimized Markets, Inc\. sandholm@cs\.cmu\.edu

###### Abstract

How should an agent’s performance in a multiagent environment be evaluated when there is a limited sample size or a high cost of running a trial? The AIVAT family of variance reduction techniques was proposed to address this challenge by introducing unbiased low\-variance estimators of agents’ expected payoffs\. An important component of AIVAT is a heuristic value function that discriminates between potentially low\- and high\-value counterfactual histories\. A notable gap in the literature is that there is little to no constraint or guideline on how the heuristic value function should be chosen or how uncertainty in its output should be handled\.

In our first contribution, we parameterize the heuristic value function to highlight AIVAT’s potential vulnerabilities: a\) the sample variance can be set pathologically low by directly applying gradient descent on the sample variance, and b\) one can p\-hack to draw a desired statistical conclusion via gradient descent/ascent on the test statistic\. The main takeaway is that the heuristic value function should be fixed prior to observing the evaluation data\! In our second contribution, we show how the heuristic uncertainty can be propagated to quantify the uncertainty of AIVAT estimates\. It is then possible to further reduce the variance using inverse\-variance weighted averaging, but AIVAT’s unbiasedness guarantee may have to be sacrificed\. In our experiments, we use a dataset of 10,000 poker hands to demonstrate our heuristic pathology and uncertainty results, with the latter yielding a 43\.0% reduction in the number of samples \(poker hands\) needed to draw statistical conclusions\.

## 1Introduction

Evaluating an agent’s performance in a multiagent environment is often challenging, such as when running each trial is costly or time\-consuming\. This is especially true when demonstrating the superhuman capability of an AI agent, requiring human experts to compete against the agent for a long time\. For example, the evaluation of the superhuman heads\-up poker AI agent Libratus\[[4](https://arxiv.org/html/2605.14261#bib.bib1)\]lasted 20 days, morning to evening, and involved four human professionals competing in parallel for a $200,000 prize pool\. The resource\-intensive process of generating sufficient data to draw statistically significant conclusions cannot be avoided unless a low\-variance estimator of the outcome is used\.

The AIVAT\[[6](https://arxiv.org/html/2605.14261#bib.bib6)\]family of variance reduction techniques was proposed to handle the often high\-variance nature of extensive\-form games\. AIVAT reduces the variance introduced by both nature and player actions in primarily two ways\. First, using a heuristic value function, AIVAT evaluates the potential values of counterfactual histories when counterfactual actions are applied to observed histories with known probabilities\. Second, AIVAT uses the fact that, regardless of a particular player’s hidden information, others would have acted identically as they did in the original observations\.Burchet al\.\[[6](https://arxiv.org/html/2605.14261#bib.bib6)\]showed that the AIVAT estimator is unbiased, and experimentally demonstrated a reduction in the required number of trials by “more than a factor of 10” to make the same statistical claims as when it is not used\. Furthermore, the power of AIVAT increases as the number of players whose strategy is known increases\. However, it is also true that its power decreases as fewer player strategies are taken into account, with it being reduced to MIVAT\[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]when only chance probabilities are known\.

The ‘5 humans \+ 1 AI’ experiment of the poker AI agent Pluribus\[[5](https://arxiv.org/html/2605.14261#bib.bib7)\], evaluations of another poker AI agent DeepStack\[[11](https://arxiv.org/html/2605.14261#bib.bib18)\], and several editions of the Annual Computer Poker Competitions \(ACPC\)\[[1](https://arxiv.org/html/2605.14261#bib.bib12)\]represent the landmark applications of the AIVAT family of techniques\. The first application was particularly striking in that, although Pluribus finished the experiment with a negative payoff overall, AIVAT was able to show that Pluribus was, in fact, superhuman\.

### 1\.1Our contributions

In this paper, we provide two types of contributions with regard to the AIVAT family of variance reduction techniques\. The first type iscautionary\. We expand on the fact that surprisingly little has been said about the constraints on how the heuristic value function can be developed\. In the proof of the unbiasedness of the advantage sum byZinkevichet al\.\[[15](https://arxiv.org/html/2605.14261#bib.bib3)\], which forms the basis of DIVAT\[[2](https://arxiv.org/html/2605.14261#bib.bib2)\], MIVAT\[[13](https://arxiv.org/html/2605.14261#bib.bib5)\], and AIVAT\[[6](https://arxiv.org/html/2605.14261#bib.bib6)\], they state that “any” and “all” heuristic value functions yield an unbiased estimator of the true value\.White and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]suggest learning a linear value function from the sample data, where they optimize for the sample variance as a proxy of the true variance, but offer no additional guidelines on the learning process\.Burchet al\.\[[6](https://arxiv.org/html/2605.14261#bib.bib6)\]use their AI agent’s self\-play values as the arbitrary fixed heuristic value function, which was also involved during the data\-generating process of the same data on which the technique is applied\. In this paper, we highlight AIVAT’s potential vulnerabilities by showing that it is possible to learn a heuristic value function that a\) obtains pathologically low variance or b\) p\-hacks to falsely draw a desired statistical conclusion about an agent’s performance\. Using the gameplay of Pluribus, we train such a function by parameterizing the heuristic outputs and applying gradient descent/ascent on the desired objective\. The main takeaway is that the heuristic value functionshould be fixed prior to observing the evaluation data\! \(The use of Pluribus data is purely for demonstration and should not be taken as a criticism ofBrown and Sandholm \[[5](https://arxiv.org/html/2605.14261#bib.bib7)\]; their results were correct\.\)

In our second contribution, we note that while AIVAT takes account of the uncertainty associated with player actions, in its usage of the heuristic value function, it introduces another source of uncertainty, namely, how certain the heuristic value function is in its outputs\. One may be more certain of some outputs of the heuristic value function while being less so about others\. This is certainly the case when the value function outputs are nondeterministically\-approximated game\-theoretic values \(e\.g\., Monte Carlo rollouts and/or randomized clustering during abstraction\) or are learned from existing data\[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]and are predicting from inputs in the low\- versus high\-density region of the training distribution\. We quantify the uncertainty in terms of the variance and demonstrate how the heuristic value function uncertainty can be propagated to the estimate level, obtaining a measure of how uncertain a particular value estimate is\. We also show that inverse\-variance weighting can be applied, where less weight is given to estimates with more uncertainty and vice versa, to achievefurther variance reduction, albeit at the risk of incorporating some bias into the estimate\. Nonetheless, we a\) show the necessary condition for this bias to be zero, b\) demonstrate how this bias can be estimated, and c\) contend that it is unlikely for game\-playing agents to manipulate the bias to appear to play better\. Using the gameplay data of Pluribus, we report a reduction of up to 43\.0% in the number of required trials \(i\.e\., poker hands\) to reach statistical conclusions\. Our findings make multiagent evaluation more scalable\.

## 2Notation and background

In this section, we define the notation used throughout the paper and provide the background on extensive\-form games and the AIVAT family of variance reduction techniques\.

### 2\.1Extensive\-form games

In this paper, we focus our analysis on extensive\-form games, but ideas from AIVAT can also be applied to other representations\. An extensive\-form game has a finite set of playersPP\(including chancepcp\_\{c\}\) and historiesHH\. Every history is a sequence of actions played by each playeri∈Pi\\in P, and is associated with a playerp​\(h\)p\(h\)and a set of available actionsA​\(h\)A\(h\)\.h⋅a=h′h\\cdot a=h^\{\\prime\}denotes that applyinga∈A​\(h\)a\\in A\(h\)athhleads toh′h^\{\\prime\}\. Ifp​\(h\)=pcp\(h\)=p\_\{c\}, thenfc​\(h,a\)f\_\{c\}\(h,a\)gives a fixed probability distribution over each available actiona∈A​\(h\)a\\in A\(h\)\. Each terminal historyz∈Z⊆Hz\\in Z\\subseteq Hhas a utilityui​\(z\)u\_\{i\}\(z\)for every playerii\.

The imperfect information setting is represented by information setsℐi\\mathcal\{I\}\_\{i\}: a partition of histories belonging to a non\-chance playeri∈P∖\{pc\}i\\in P\\setminus\\\{p\_\{c\}\\\}\. A playeriicannot distinguish betweenh,h′∈I∈ℐih,h^\{\\prime\}\\in I\\in\\mathcal\{I\}\_\{i\}\. Therefore,A​\(h\)=A​\(h′\)A\(h\)=A\(h^\{\\prime\}\), and we denote the set of available actions at an information set byA​\(I\)A\(I\)\. Each playeriiplays with a strategyσi​\(I\)\\sigma\_\{i\}\(I\)which assigns a probability distribution overA​\(I\)A\(I\)\. Then a strategy profileσ\\sigmais defined as a tuple of all player strategies\. We useπ​\(h\)\\pi\(h\)to represent the probability of reachinghhgiven players play usingσ\\sigma\. The contribution of playeriito this probability isπi​\(h\)\\pi\_\{i\}\(h\)\.

#### 2\.1\.1Agent evaluation

Agent evaluation in extensive\-form games typically requires obtaining an estimate of the expected utility of a particular playeriion a given strategy profileσ\\sigma:

𝔼z∈Z⁡\[ui​\(z\)\|σ\]=∑z∈Zπ​\(z\)​ui​\(z\)\.\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[u\_\{i\}\(z\)\|\\sigma\]=\\sum\_\{z\\in Z\}\\pi\(z\)u\_\{i\}\(z\)\.Before the advent of variance reduction techniques for extensive\-form games, Monte Carlo samplesz1,…​zTz\_\{1\},\\ldots z\_\{T\}were drawn independently to calculate the mean player utility,

u¯i=1T​∑t=1Tui​\(zt\),\\bar\{u\}\_\{i\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}u\_\{i\}\(z\_\{t\}\),which is an unbiased estimator,i\.e\.,

𝔼⁡\[u¯i\|σ\]=𝔼z∈Z⁡\[ui​\(z\)\|σ\],\\operatorname\{\\mathbb\{E\}\}\[\\bar\{u\}\_\{i\}\|\\sigma\]=\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[u\_\{i\}\(z\)\|\\sigma\],and has the following variance:

Var⁡\[u¯i\|σ\]=1T​Var⁡\[ui​\(z\)\|σ\]\.\\operatorname\{Var\}\[\\bar\{u\}\_\{i\}\|\\sigma\]=\\frac\{1\}\{T\}\\operatorname\{Var\}\[u\_\{i\}\(z\)\|\\sigma\]\.
The multiagent environment’s stochasticity and desired statistical significance level influence the choice ofTT\. One is therefore limited by the cost of running each trial, and it may be far too expensive to draw a statistically significant conclusion using the Monte Carlo method\.

### 2\.2Variance reduction techniques

The AIVAT family of variance reduction techniques specializes the control variates method for agent evaluation in extensive\-form games to give a low\-variance estimate of any functionv​\(z\)v\(z\), an example of which is player utility\. The current state\-of\-the\-art variance reduction technique is AIVAT\[[6](https://arxiv.org/html/2605.14261#bib.bib6)\], which can be thought of as a combination of its two predecessors: imaginary observations\[[3](https://arxiv.org/html/2605.14261#bib.bib4)\]and the advantage sum\[[15](https://arxiv.org/html/2605.14261#bib.bib3)\]\. The AIVAT family of techniques yields unbiased estimators of the value function\.

#### 2\.2\.1Control variates

The control variates method\[[7](https://arxiv.org/html/2605.14261#bib.bib14), Ch\. 4\]is a standard way to reduce variance in Monte Carlo methods\. We begin by giving a brief description in the language of agent evaluation\. In this setting, we seek to estimate𝔼z∈Z⁡\[v​\(z\)\|σ\]\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\]\. Suppose the existence of another value functionw​\(⋅\)w\(\\cdot\)whereω=𝔼z∈Z⁡\[w​\(z\)\|σ\]\\omega=\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[w\(z\)\|\\sigma\]is known\. Then, the following is an unbiased estimator of𝔼z∈Z⁡\[v​\(z\)\|σ\]\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\]:

v^​\(z\)=v​\(z\)−c​\(w​\(z\)−ω\)\\hat\{v\}\(z\)=v\(z\)\-c\(w\(z\)\-\\omega\)for any choice of constantcc\. Its variance is as follows:

Var⁡\(v^​\(z\)\)=V​a​r​\(v​\(z\)\)\+c2​V​a​r​\(w​\(z\)\)−2​c​Cov⁡\(v​\(z\),w​\(z\)\)\.\\operatorname\{Var\}\(\\hat\{v\}\(z\)\)=Var\(v\(z\)\)\+c^\{2\}Var\(w\(z\)\)\-2c\\operatorname\{Cov\}\(v\(z\),w\(z\)\)\.\(1\)Whenc2​V​a​r​\(w​\(z\)\)−2​c​Cov⁡\(v​\(z\),w​\(z\)\)≤0c^\{2\}Var\(w\(z\)\)\-2c\\operatorname\{Cov\}\(v\(z\),w\(z\)\)\\leq 0, variance reduction is achieved\. The optimal choice ofcccan be derived by differentiating \([1](https://arxiv.org/html/2605.14261#S2.E1)\) with respect tocc:

c∗=Cov⁡\(v​\(z\),w​\(z\)\)Var⁡\(w​\(z\)\),c^\{\*\}=\\frac\{\\operatorname\{Cov\}\(v\(z\),w\(z\)\)\}\{\\operatorname\{Var\}\(w\(z\)\)\},which results in

Var\(v^\(z\)\)=\(1−Corr\(v\(z\),w\(z\)\)2\)Var\(v\(z\)\)\.\\operatorname\{Var\}\(\\hat\{v\}\(z\)\)=\(1\-\\operatorname\{Corr\}\(v\(z\),w\(z\)\)^\{2\}\)\\operatorname\{Var\}\(v\(z\)\)\.

#### 2\.2\.2Advantage sum

The advantage sum technique uses control variates with a heuristic value function to reduce the variance of the estimate, and is of the following form:

v^​\(z\)=v​\(z\)−v^c​\(z\),\\hat\{v\}\(z\)=v\(z\)\-\\hat\{v\}\_\{c\}\(z\),wherev^​\(⋅\)\\hat\{v\}\(\\cdot\),v​\(⋅\)v\(\\cdot\), andv^c​\(⋅\)\\hat\{v\}\_\{c\}\(\\cdot\)denote the estimate, value function, and correction term, respectively\. The correction term, which can be thought of as control variates, is defined as follows:

v^c​\(z\)=∑h⋅a∈K​\(z\)\(v′​\(h⋅a\)−∑a′∈A​\(h\)fc​\(h,a′\)​v′​\(h⋅a′\)\),\\hat\{v\}\_\{c\}\(z\)=\\sum\_\{h\\cdot a\\in K\(z\)\}\{\\left\(v^\{\\prime\}\(h\\cdot a\)\-\\sum\_\{a^\{\\prime\}\\in A\(h\)\}f\_\{c\}\(h,a^\{\\prime\}\)v^\{\\prime\}\(h\\cdot a^\{\\prime\}\)\\right\)\},withv′​\(⋅\)v^\{\\prime\}\(\\cdot\)the heuristic value function andK​\(z\)K\(z\)the set of histories precedingzzwhere the probability distribution of available actions at the immediate parent history is known\. MIVAT\[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]represents a special case of the advantage sum where no player’s action probabilities, except those of nature, are known\. In this particular context, the subtraction of the correction term can be thought of as canceling the effect of luck \(or lack thereof\)\. No matter the choice of the heuristic value functionv′​\(⋅\)v^\{\\prime\}\(\\cdot\),Zinkevichet al\.\[[15](https://arxiv.org/html/2605.14261#bib.bib3)\]showed that the advantage sum \(and hence MIVAT\) is an unbiased estimator of the expected value,i\.e\.,𝔼z∈Z⁡\[v^​\(z\)\|σ\]=𝔼z∈Z⁡\[v​\(z\)\|σ\]\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[\\hat\{v\}\(z\)\|\\sigma\]=\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\]\.

White and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]proposed using a linear function as the heuristic value function, which is to be trained on existing sample data\. In doing so, they proposed to minimize the sample variance as a proxy of the true variance, hence the following optimization problem for the mean\-squared error:

Minimize:v^:Z↦ℝ​∑t=1T\(v^​\(zt\)−1T​∑t′=1Tv^​\(zt′\)\)2\.\\underset\{\\hat\{v\}:Z\\mapsto\\operatorname\{\\mathbb\{R\}\}\}\{\\textbf\{Minimize: \}\}\\sum\_\{t=1\}^\{T\}\\left\(\\hat\{v\}\(z\_\{t\}\)\-\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}\\hat\{v\}\(z\_\{t^\{\\prime\}\}\)\\right\)^\{2\}\.\(2\)They derived a closed\-form formula of an optimal linear value function given a feature engineering function and applied it to heads\-up fixed\-limit, heads\-up no\-limit, and 6\-max fixed\-limit poker, observing reductions of up to about 62%, 23%, and 18%, respectively, in the standard deviation\.

#### 2\.2\.3Imaginary observations

Imaginary observations do not use a heuristic value function or control variate terms\. Instead, given a trial outcomezz, it directly applies the value function on a subset of terminal nodes and obtains an estimate of the expected payoff using importance sampling\. It is well\-suited for estimating the expected payoff of alternative strategies \(which they dubbed the off\-policy case\)\. The imaginary observation estimator is unbiased in both the on\-policy and off\-policy cases with full information; however, this guarantee is lost with partial information\[[3](https://arxiv.org/html/2605.14261#bib.bib4)\]\. Nevertheless, imaginary observations in the partial information case remain a useful tool as they can yield a low\-variance estimate\.

#### 2\.2\.4AIVAT

AIVAT, visualized in Appendix[A](https://arxiv.org/html/2605.14261#A1), is a generalization of the advantage sum where imaginary observations are applied to each considered history\. It is of the following form:

v^​\(z\)=v^b​\(z\)\+v^c​\(z\),\\hat\{v\}\(z\)=\\hat\{v\}\_\{b\}\(z\)\+\\hat\{v\}\_\{c\}\(z\),wherev^b​\(z\)\\hat\{v\}\_\{b\}\(z\)is the base term:

v^b​\(z\)=∑z′∈U​\(z\)π​\(z′\)​v​\(z′\)∑z′∈U​\(z\)π​\(z′\),\\hat\{v\}\_\{b\}\(z\)=\\frac\{\\sum\_\{z^\{\\prime\}\\in U\(z\)\}\\pi\(z^\{\\prime\}\)v\(z^\{\\prime\}\)\}\{\\sum\_\{z^\{\\prime\}\\in U\(z\)\}\\pi\(z^\{\\prime\}\)\},andv^c​\(z\)\\hat\{v\}\_\{c\}\(z\)is the correction term:

v^c​\(z\)=∑h⋅a∈K​\(z\)\(∑a′∈A​\(U​\(h\)\)∑h′∈U​\(h\)π​\(h′⋅a′\)​v′​\(h′⋅a′\)∑h′∈U​\(h\)π​\(h′\)−∑h′∈U​\(h\)π​\(h′⋅a\)​v′​\(h′⋅a\)∑h′∈U​\(h\)π​\(h′⋅a\)\)\.\\hat\{v\}\_\{c\}\(z\)=\\sum\_\{h\\cdot a\\in K\(z\)\}\\left\(\\frac\{\\sum\_\{a^\{\\prime\}\\in A\(U\(h\)\)\}\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\\cdot a^\{\\prime\}\)v^\{\\prime\}\(h^\{\\prime\}\\cdot a^\{\\prime\}\)\}\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\)\}\-\\frac\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\\cdot a\)v^\{\\prime\}\(h^\{\\prime\}\\cdot a\)\}\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\\cdot a\)\}\\right\)\.
U​\(h\)U\(h\)is defined as a set of histories differing fromhhonly by the private information belonging top​\(h\)p\(h\)\.A​\(U​\(h\)\)A\(U\(h\)\)is the common set of actions available to that player\.Burchet al\.\[[6](https://arxiv.org/html/2605.14261#bib.bib6)\]showed that AIVAT is an unbiased estimator regardless ofv′​\(h\)v^\{\\prime\}\(h\)and proposed using an AI agent’s self\-play values as the outputs of the heuristic value function, which can also be used to generate the very data being evaluated\.

## 3Heuristic pathologies

We are ready to present our new results\. We begin by presenting acautionary tale, highlighting AIVAT’s potential vulnerabilities\. We show that one can learn a heuristic value function to a\) obtain pathologically low variance or b\) p\-hack to falsely draw a desired statistical conclusion about an agent’s performance\. Our results underscore the need to fix the heuristic value function before evaluation\.

For convenience, we simplify the expression for the AIVAT estimatev^​\(⋅\)\\hat\{v\}\(\\cdot\)by rewriting it as an affine function of the outputs of the value functionv′​\(⋅\)v^\{\\prime\}\(\\cdot\), as follows:

v^​\(z\)=b​\(z\)\+∑h∈H𝐜​\(z\)h​v′​\(h\),\\hat\{v\}\(z\)=b\(z\)\+\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}v^\{\\prime\}\(h\),\(3\)whereb​\(z\)b\(z\)is the affine shift and𝐜​\(z\)∈ℝH\\mathbf\{c\}\(z\)\\in\\operatorname\{\\mathbb\{R\}\}^\{H\}is the vector of coefficients of each heuristic outputv′​\(⋅\)v^\{\\prime\}\(\\cdot\)\. Depending on the game, the size ofHHmay be huge; however, for our purpose, we only need to consider the histories and counterfactual historieshhencountered during agent evaluation \(i\.e\., where𝐜​\(z\)h≠0\\mathbf\{c\}\(z\)\_\{h\}\\neq 0\), which is usually a small fraction of the original game tree size if the game is large\. The detailed derivation process, as well as howb​\(z\)b\(z\)and𝐜​\(z\)\\mathbf\{c\}\(z\)are defined, are relegated to Appendix[B](https://arxiv.org/html/2605.14261#A2)\.

In our development of a pathological heuristic value function, we define a vector𝜽∈ℝH\\bm\{\\theta\}\\in\\operatorname\{\\mathbb\{R\}\}^\{H\}to parameterize each heuristic output as follows:

v𝜽′​\(h\)=θh\.v\_\{\\bm\{\\theta\}\}^\{\\prime\}\(h\)=\\theta\_\{h\}\.\(4\)By parameterizing as such, we give the heuristic value function the maximum expressive power\. Again, the size ofHHcan be huge\. However, we only need to parameterize the \(counterfactual\) histories encountered during agent evaluation\. We can further simplify the AIVAT estimate expression:

v^𝜽​\(z\)=b​\(z\)\+∑h∈H𝐜​\(z\)h​v𝜽′​\(h\)=b​\(z\)\+⟨𝐜​\(z\),𝜽⟩\.\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\)=b\(z\)\+\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}v\_\{\\bm\{\\theta\}\}^\{\\prime\}\(h\)=b\(z\)\+\\langle\\mathbf\{c\}\(z\),\\bm\{\\theta\}\\rangle\.\(5\)
#### 3\.0\.1Optimizing for the sample variance

Using \([5](https://arxiv.org/html/2605.14261#S3.E5)\), the proxy objective of optimization problem given byWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]in \([2](https://arxiv.org/html/2605.14261#S2.E2)\) can be refined as follows:

Minimize:𝜽∈ℝH​C​\(𝜽\)=∑t=1T\(v^𝜽​\(zt\)−1T​∑t′=1Tv^𝜽​\(zt′\)\)2=∑t=1T\(\(b​\(zt\)−b¯\)\+⟨𝐜​\(zt\)−𝐜¯,𝜽⟩\)2,\\displaystyle\\underset\{\\bm\{\\theta\}\\in\\operatorname\{\\mathbb\{R\}\}^\{H\}\}\{\\textbf\{Minimize: \}\}C\(\\bm\{\\theta\}\)=\\sum\_\{t=1\}^\{T\}\\left\(\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\_\{t\}\)\-\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\_\{t^\{\\prime\}\}\)\\right\)^\{2\}=\\sum\_\{t=1\}^\{T\}\\left\(\\left\(b\(z\_\{t\}\)\-\\bar\{b\}\\right\)\+\\left<\\mathbf\{c\}\(z\_\{t\}\)\-\\bar\{\\mathbf\{c\}\},\\bm\{\\theta\}\\right\>\\right\)^\{2\},whereb¯=1T​∑t=1Tb​\(zt\)\\bar\{b\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}b\(z\_\{t\}\)and𝐜¯=1T​∑t=1T𝐜​\(zt\)\\bar\{\\mathbf\{c\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbf\{c\}\(z\_\{t\}\)\. \(A more detailed derivation is given in Appendix[C](https://arxiv.org/html/2605.14261#A3)\.\)

###### Proposition 1\.

On a given set of trials, there exists an optimal parameter vector𝛉∗\\bm\{\\theta\}^\{\*\}that achieves the lowest possible sample variance of the estimates\.

The proof is in Appendix[D\.1](https://arxiv.org/html/2605.14261#A4.SS1)\. It follows from the fact that finding the pathological heuristic outputs that minimize the sample variance of a given dataset is a least\-squares problem\.

#### 3\.0\.2Optimizing for thet\-statistic

However, optimizing for the sample variance may not be very interesting, as we do not explicitly control what we can say about the given data\. More specifically, one of the main purposes of using variance reduction techniques is to draw statistical conclusions about an agent’s performance\. This is usually done\[[5](https://arxiv.org/html/2605.14261#bib.bib7)\]via a one\-sidedt\-test from which a p\-value is computed and compared against a desired statistical confidence level\. A p\-value can potentially be hacked by either minimizing or maximizing thet\-statistic, as shown in the optimization problem below\.

Optimize:𝜽∈ℝH​v𝜽¯−μ0s𝜽/T,\\displaystyle\\underset\{\\bm\{\\theta\}\\in\\operatorname\{\\mathbb\{R\}\}^\{H\}\}\{\\textbf\{Optimize: \}\}\\frac\{\\bar\{v\_\{\\bm\{\\theta\}\}\}\-\\mu\_\{0\}\}\{s\_\{\\bm\{\\theta\}\}/\\sqrt\{T\}\},wherev𝜽¯=∑t=1Tv^𝜽​\(zt\)\\bar\{v\_\{\\bm\{\\theta\}\}\}=\\sum\_\{t=1\}^\{T\}\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\_\{t\}\)ands𝜽2=∑t=1T\(v^𝜽​\(zt\)−v𝜽¯\)2T−1s\_\{\\bm\{\\theta\}\}^\{2\}=\\frac\{\\sum\_\{t=1\}^\{T\}\(\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\_\{t\}\)\-\\bar\{v\_\{\\bm\{\\theta\}\}\}\)^\{2\}\}\{T\-1\}\. We apply gradient descent/ascent on the parameters to minimize or maximize our objective\.

### 3\.1Experiments on heuristic pathologies

We conducted experiments on the publicly released poker hand history data from the Pluribus experiment\[[5](https://arxiv.org/html/2605.14261#bib.bib7)\]\. The rules of Texas hold’em are given in Appendix[E](https://arxiv.org/html/2605.14261#A5)\.

#### 3\.1\.1Optimizing for the sample variance

Table 1:Results on the Pluribus data in milli\-big blinds per hand \(mbb/h\) using a pathological heuristic value function that minimizes the sample variance\.S​ESEstands for the standard error of the mean\.In our first experiment, we aim to produce a result that is of extremely low variance that does not pertain to reality\. To do so, we optimize on the sample variance as the proxy objective\. Since Pluribus’s action distribution is not known, our AIVAT implementation reduces to MIVAT\. We trained a heuristic value function, outputting an estimate of the expected utilities at every player position\. Although the least\-squares solution could theoretically have been solved, the transformed data matrix\[\(𝐜​\(zt\)−𝐜¯\)⊤\]t∈\{1,…,T\}\[\(\\mathbf\{c\}\(z\_\{t\}\)\-\\bar\{\\mathbf\{c\}\}\)^\{\\top\}\]\_\{t\\in\\\{1,\\dots,T\\\}\}was too large to fit into memory, so we applied gradient descent on the parameters for 250 iterations with the Adam optimizer \(η=100,β1=0\.9,β2=0\.999,λ=0\\eta=100,\\beta\_\{1\}=0\.9,\\beta\_\{2\}=0\.999,\\lambda=0\)\. The unusually high learning rate \(η\\eta\) was chosen as lower \(and more conventional\) choices \(e\.g\.,0\.0010\.001\) converged too slowly\.

Using the learned parameters, we were able to achievepathologicallylow variance estimates of all players’ expected utilities, as shown in Table[1](https://arxiv.org/html/2605.14261#S3.T1)\. The results show that every player won by an extremely high margin of over 2,000 mbb/h\. Considering that the poker community typically characterizes win rates of 100 mbb/h as immense, these values are clearly unrealistic\. They also violate the zero\-sum constraint; this is because the heuristic value function was trained without such a constraint\. This experiment shows that, when the data is fixed and the heuristic value function is directly optimized using the proxy objective given byWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\], one can produce a nonsensical result that does not pertain to reality, especially with enough degrees of freedom to be exploited\.

#### 3\.1\.2Optimizing for thet\-statistic

In our second experiment, we explore how to falsely make desired claims about the given data by p\-hacking via gradient descent/ascent to either minimize or maximize thet\-statistics\. Here, we again train a parameterized heuristic value function except that, this time, a player’st\-statistic is being optimized\. For each run, when thet\-statistic is being minimized, the null hypothesis is that a player did not have a negative win rate\. Conversely, when a player’st\-statistic is being maximized, the null hypothesis is that the player did not have a positive win rate\. The goal of this experiment is to obtain a small enough p\-value for each hypothesis test to draw statistically significant conclusions that every player both won and lost\. The parameters were optimized for 10 iterations with the Adam optimizer \(η=100,β1=0\.9,β2=0\.999,λ=0\\eta=100,\\beta\_\{1\}=0\.9,\\beta\_\{2\}=0\.999,\\lambda=0\)\. Again, the common choices forη\\etaconverged too slowly\.

The results are shown in Table[2](https://arxiv.org/html/2605.14261#S3.T2)\. For every player, it was possible to train separate heuristic value functions, one for winning and another for losing, using which one could show that they both won and lost on the same data with overwhelmingly low p\-values\. The implication of this is that although the AIVAT family of variance reduction techniques is guaranteed to be unbiased, one can still craft a heuristic value function that supports whatever conclusion one wants to draw when the data being evaluated is known a priori\. One interpretation of this result is that, when anadversarycontrols the heuristic value function and is aware of the data being evaluated, the adversary can make the evaluator draw conclusions that are nonsensical, contradictory, or incorrect\.

Table 2:Results on the Pluribus data using pathological heuristic value functions for eacht\-statistic\.Note that we are not criticizing the application of AIVAT byBrown and Sandholm \[[5](https://arxiv.org/html/2605.14261#bib.bib7)\]\. We are simply using their dataset to demonstrate how invalid conclusions can be drawn due to heuristic pathologies\. These pathologies also shed light on the need tofixthe heuristic value function prior to evaluation\. While we are unaware of existing AIVAT applications that violate this, to our knowledge, we are the first to point out vulnerabilities of this nature in the AIVAT family of variance reduction techniques\.

In the next section, we provide a practical methodology of training and evaluating the AIVAT family of estimators, especially in the face of data scarcity, which leads to additional variance reduction\.

## 4Further variance reduction using heuristic uncertainty

We now move to our second contribution, which is a new way of obtaining additional variance reduction on top of AIVAT\. We continue our discussion about the heuristic value function in variance reduction techniques by approaching it from a different angle: uncertainty of heuristic outputs\. We use variance to quantify uncertainty\. We continue from the simplified expression shown in \([3](https://arxiv.org/html/2605.14261#S3.E3)\)\. Defining the covariance matrix𝚺​\(z\)\\mathbf\{\\Sigma\}\(z\)where, for each\(h1,h2\)∈H×H\(h\_\{1\},h\_\{2\}\)\\in H\\times H,𝚺​\(z\)h1,h2=Cov⁡\(v′​\(h1\),v′​\(h2\)\)\\mathbf\{\\Sigma\}\(z\)\_\{h\_\{1\},h\_\{2\}\}=\\operatorname\{Cov\}\(v^\{\\prime\}\(h\_\{1\}\),v^\{\\prime\}\(h\_\{2\}\)\),

Var⁡\(v^​\(z\)\)=Var⁡\(b​\(z\)\+∑h∈H𝐜​\(z\)h​v′​\(h\)\)=Var⁡\(∑h∈H𝐜​\(z\)h​v′​\(h\)\)=𝐜​\(z\)⊤​𝚺​\(z\)​𝐜​\(z\)\.\\operatorname\{Var\}\(\\hat\{v\}\(z\)\)=\\operatorname\{Var\}\\left\(b\(z\)\+\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}v^\{\\prime\}\(h\)\\right\)=\\operatorname\{Var\}\\left\(\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}v^\{\\prime\}\(h\)\\right\)=\\mathbf\{c\}\(z\)^\{\\top\}\\mathbf\{\\Sigma\}\(z\)\\mathbf\{c\}\(z\)\.\(6\)Note that, in practice,𝐜​\(z\)\\mathbf\{c\}\(z\)is sparse \(see Appendix[B](https://arxiv.org/html/2605.14261#A2)\), and hence𝚺​\(z\)\\mathbf\{\\Sigma\}\(z\)can be implemented as sparse, setting irrelevant rows and columns to zeros\. One can simplify the expression further, as in the following, by assuming that the heuristic value outputs are uncorrelated:

Var⁡\(v^​\(z\)\)=∑h∈H𝐜​\(z\)h2​Var⁡\(v′​\(h\)\)\.\\operatorname\{Var\}\(\\hat\{v\}\(z\)\)=\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}^\{2\}\\operatorname\{Var\}\(v^\{\\prime\}\(h\)\)\.\(7\)Note that the above assumption wasnotmade in our main experimental results, presented later\.

Just as we are givenTTindependent trialsz1,…,zTz\_\{1\},\\ldots,z\_\{T\}and assume the valuesv​\(z1\),…,v​\(zT\)v\(z\_\{1\}\),\\ldots,v\(z\_\{T\}\)are independent and identically distributed \(i\.i\.d\.\), we make the standard\[[5](https://arxiv.org/html/2605.14261#bib.bib7)\]assumption that the AIVAT estimates are also i\.i\.d\. Then, the variance of the arithmetic mean of AIVAT estimatesv¯=1T​∑t=1Tv^​\(zt\)\\bar\{v\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\hat\{v\}\(z\_\{t\}\)is as follows:

Var⁡\(v¯\)=∑t=1TVar⁡\(v^​\(zt\)\)T2\.\\operatorname\{Var\}\(\\bar\{v\}\)=\\frac\{\\sum\_\{t=1\}^\{T\}\\operatorname\{Var\}\(\\hat\{v\}\(z\_\{t\}\)\)\}\{T^\{2\}\}\.\(8\)We can treat the variance of the mean as a proxy objective to be minimized\. It is possible to improve from \([8](https://arxiv.org/html/2605.14261#S4.E8)\) by assigning an \(unnormalized\) weightwtw\_\{t\}to each estimate to obtain the weighted average,i\.e\.,

v¯∗=∑t=1Twt​v^​\(zt\)∑t=1Twt\.\\bar\{v\}^\{\*\}=\\frac\{\\sum\_\{t=1\}^\{T\}w\_\{t\}\\hat\{v\}\(z\_\{t\}\)\}\{\\sum\_\{t=1\}^\{T\}w\_\{t\}\}\.Consider inverse\-variance weighting \(IVW\)\[[8](https://arxiv.org/html/2605.14261#bib.bib8), Ch\. 4\], which puts less weight on those with higher uncertainty and vice versa:wt=1Var⁡\(v^​\(zt\)\)w\_\{t\}=\\frac\{1\}\{\\operatorname\{Var\}\(\\hat\{v\}\(z\_\{t\}\)\)\}\. The variance of the IVW average is then

Var⁡\(v¯∗\)=1∑t=1T1Var⁡\(v^​\(zt\)\)\.\\operatorname\{Var\}\(\\bar\{v\}^\{\*\}\)=\\frac\{1\}\{\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{\\operatorname\{Var\}\(\\hat\{v\}\(z\_\{t\}\)\)\}\}\.\(9\)
###### Proposition 2\.

Assuming that the AIVAT estimates are independent, their IVW average yields the minimum variance amongst all weighted averages\.

The proof is in Appendix[D\.2](https://arxiv.org/html/2605.14261#A4.SS2)\. It follows from a well\-known result in statistics; cf\.Hartunget al\.\[[8](https://arxiv.org/html/2605.14261#bib.bib8), Ch\. 4\]\. Thus, IVW improves upon uniform weighting\. When the estimates being averaged over are all equally uncertain, \([8](https://arxiv.org/html/2605.14261#S4.E8)\) and \([9](https://arxiv.org/html/2605.14261#S4.E9)\) are equivalent\. Ideally, we would like the IVW average to also be an unbiased estimator\. This is true when the weights are independent of the data being averaged over, but not in general, as shown in the following proposition:

###### Proposition 3\.

When the weights are independent of the data being averaged over, the weighted average of AIVAT estimates is unbiased, i\.e\.,𝔼⁡\[v¯∗\|σ\]=𝔼z∈Z⁡\[v​\(z\)\|σ\]\\operatorname\{\\mathbb\{E\}\}\[\\bar\{v\}^\{\*\}\|\\sigma\]=\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\]\. But when the weights are correlated with the data being averaged over, the weighted average of AIVAT estimates is, in general, biased, i\.e\.,𝔼⁡\[v¯∗\|σ\]≠𝔼z∈Z⁡\[v​\(z\)\|σ\]\\operatorname\{\\mathbb\{E\}\}\[\\bar\{v\}^\{\*\}\|\\sigma\]\\neq\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\]\. Assuming that the data points are i\.i\.d\., the asymptotic bias isCov⁡\(w,v^​\(z\)\|σ\)/𝔼⁡\[w\]\\operatorname\{Cov\}\(w,\\hat\{v\}\(z\)\|\\sigma\)/\\operatorname\{\\mathbb\{E\}\}\[w\], wherewwandv^​\(z\)\\hat\{v\}\(z\)are random variables representing the unnormalized weights and estimated expected utilities\.

The proof is in Appendix[D\.3](https://arxiv.org/html/2605.14261#A4.SS3)\. To address this, we canestimatethis bias, and our estimator can still remain useful by showing that the estimated bias is much smaller than the reduction in the standard error of the weighted mean\. While we can technically use this estimate to offset the IVW mean, doing so adds a highly non\-trivial term to the variance\.

It is important to note that the estimator we introduce doesnotinherently introduce bias\. Indeed, whether or not our estimator is biased depends on the specific learning algorithm used to train the heuristic value function, since this determines the AIVAT estimates we obtain and their corresponding weights\. Additionally, fixing the specific learning algorithms can help us derive even stronger theoretical bounds\. Later in our experiments, the model we utilized as the heuristic value function isGaussian process regressor \(GPR\)\[[14](https://arxiv.org/html/2605.14261#bib.bib17)\], which relies on Gaussian assumptions, of which one of the fundamental statistical properties is that the mean and the variance are structurally independent\. The estimated uncertainty depends only on the input space, so the output and estimated uncertainty have zero covariance under Gaussian priors, and thus our estimator is unbiased\. The same holds for other models we use in Appendix[H](https://arxiv.org/html/2605.14261#A8):Bayesian ridge \(BR\)\[[12](https://arxiv.org/html/2605.14261#bib.bib15)\]andautomatic relevance determination \(ARD\)\[[10](https://arxiv.org/html/2605.14261#bib.bib16)\], which operate under the same foundational statistical assumptions\.

Besides, it is unclear how a player could even play to manipulate such a weighting scheme, since the bias depends on the correlation between the IVW weights and the expected utility estimates\. It would require them to play differently while predicting the confidence of the heuristic value function chosen by the evaluator in its outputs for different histories\. A similar bias\-variance tradeoff was previously explored in the context of imaginary observations \(a predecessor of AIVAT\) in the case of partial information\. We also demonstrate this tradeoff in our experiments below\.

### 4\.1Experiments on heuristic uncertainty propagation

We conducted experiments on the publicly available poker hand history data from the Pluribus experiment\[[5](https://arxiv.org/html/2605.14261#bib.bib7)\]\. Since Pluribus’s action distribution is not known to us, our implementation of AIVAT reduces to MIVAT\. In learning the heuristic value function for our MIVAT estimator, we chose theGaussian Process Regressor \(GPR\)\[[14](https://arxiv.org/html/2605.14261#bib.bib17)\]with Dot Product and White kernels\. We refer to this MIVAT estimator as MIVAT\-GPR\. For feature engineering, we took an approach similar to that ofWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\], detailed in Appendix[F](https://arxiv.org/html/2605.14261#A6)\. The Pluribus data consists of 10,000 poker hands\. We applied k\-fold cross\-validation withk=10k=10to calculate the estimates for the entirety of the data\. In each fold, a heuristic value function was trained using all relevant history\-payoff pairs in the training set, and then the resulting estimator was evaluated on the test set\. While training MIVAT\-GPR, in each fold, 1,000 hands were subsampled from the training set due to computational constraints\.

Table 3:Results on the Pluribus data in milli\-big blinds per hand \(mbb/h\)\.S​ESEstands for the standard error of the \(weighted\) mean, and ‘Est\.’ is an abbreviation for ‘Estimated’\.EstimatorWeightingWin rateS​ESEEst\. biasMIVAT\-GPR \(ours\)Uniform\-2599–IVW\-22753RawUniform\-7088–MIVAT\-WB\[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]Uniform\-9885–AIVAT\[[6](https://arxiv.org/html/2605.14261#bib.bib6),[5](https://arxiv.org/html/2605.14261#bib.bib7)\]Uniform4825–Our objective for the experiment is not to draw statistical conclusions about Pluribus’s superhuman performance, which, realistically, would require access to Pluribus’s strategies\. Instead, we seek to demonstrate that IVW gives rise to an estimate that has a lower standard error of the \(weighted\) mean than uniform weighting \(more details are available in Appendix[G](https://arxiv.org/html/2605.14261#A7)\)\. In our results, we also include the results for AIVAT, as reported byBrown and Sandholm \[[5](https://arxiv.org/html/2605.14261#bib.bib7)\]\. It would be unfair to compare our performance with theirs, as they had access to Pluribus’s action probabilities, which vastly increase AIVAT’s variance\-reduction power compared to when only the chance probabilities are known \(which is what we know\)\. Additionally, the results using MIVAT with the linear heuristic value function obtained through the steps introduced byWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]were included\. We refer to this estimator as MIVAT\-WB\. In Appendix[I](https://arxiv.org/html/2605.14261#A9), we derive this function for the more general case of AIVAT, but we also critique its assumptions and lack of regularization\.

Table[3](https://arxiv.org/html/2605.14261#S4.T3)shows the performance of our MIVAT\-GPR estimator, along with that of the baselines\. When taking the simple mean, MIVAT\-GPR does worse than MIVAT\-WB, but this is understandable, as we used far less data to train GPR due to computational constraints\. Also, under uniform averaging, MIVAT\-GPR performs worse than even when no estimator is applied \(see ‘Raw’\)\. It is clear that, due to the much smaller training data used to train its heuristic value function, GPR is quite inadequate in calculating the heuristic value of some counterfactual situations\. However, under IVW averaging, our MIVAT\-GPR estimator is also observed to far outperform itself under uniform averaging and MIVAT\-WB\. Indeed, MIVAT\-GPR notably achieved approximately 24\.5% reduction in the standard error of the \(weighted\) mean compared to when uniform averaging was used, corresponding to roughly 43\.0% reduction in the number of hands required to reach the same statistical significance\. Additionally, the estimated bias of the IVW estimate is about an order of magnitude smaller than the reduction in the standard error of the weighted mean\. This is evidence that IVW, which puts lower weights on outputs with higher uncertainty, can help achieve a low\-variance estimate of expected player utilities in games even when heuristics are sometimes poor\.

Another result to note is that MIVAT\-WB\[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]performs noticeably worse compared to MIVAT\-GPR under IVW and not much better than when no estimator is used\. At first glance, this is surprising, as their machine learning procedure is supposed to produce a linear heuristic value function that is optimal in the sense that it minimizes the sample variance of the training set\. In Appendix[I](https://arxiv.org/html/2605.14261#A9), we note its lack of regularization, and it is clear that this learning process is too aggressive and can be prone to overfitting, which hurts the estimator’s performance\. To our knowledge, we are the first to note that much simpler procedures of learning the heuristic value function can outperform the linear value function ofWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]\. Additional experimental results are detailed in Appendix[H](https://arxiv.org/html/2605.14261#A8)\.

## 5Conclusions and future research

In this paper, we studied the roles played by the underspecified heuristic value function in the AIVAT family of variance reduction techniques\. First, we developed a heuristic value function that allowed us to pathologically lower the variance and draw desired statistical conclusions about agents’ performances\. Our results showed that, when the data is fixed and known, an adversary can craft a heuristic value function that can lead to drawing nonsensical, misleading, or contradictory statistical conclusions\. Thus, the heuristic value function should be fixedprior to evaluation\!

Second, we showed that the uncertainty of the heuristic value function outputs, if known, can be propagated to find the uncertainty of AIVAT estimates\. Our contribution on heuristic uncertainty is particularly notable in that it leverages extra\-game\-theoretical considerations to yield further variance reduction\. We demonstrated that IVW further reduces the variance and yields a 43\.0% reduction in the number of samples \(poker hands\) required to draw statistical conclusions\. While this may come at a cost of possible bias, we can estimate this bias, and fixing the learning algorithm for the heuristic value function can yield stronger theoretical guarantees\.

A possible future direction is developing new methods to quantify the uncertainty of self\-play values approximated using non\-deterministic game\-solving algorithms, which can then be used by AIVAT with IVW\.

## Acknowledgements

This work has been supported by the Vannevar Bush Faculty Fellowship ONR N00014\-23\-1\-2876, National Science Foundation grant RI\-2312342, and NIH award A240108S001\. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies\.

## References

- \[1\]\(2013\)The annual computer poker competition\.AI Magazine34\(2\),pp\. 112–114\.Cited by:[§1](https://arxiv.org/html/2605.14261#S1.p3.1)\.
- \[2\]D\. Billings and M\. Kan\(2006\)A tool for the direct assessment of poker decisions\.ICGA Journal29\(3\),pp\. 119–142\.Cited by:[§1\.1](https://arxiv.org/html/2605.14261#S1.SS1.p1.1)\.
- \[3\]M\. Bowling, M\. Johanson, N\. Burch, and D\. Szafron\(2008\)Strategy evaluation in extensive games with importance sampling\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§2\.2\.3](https://arxiv.org/html/2605.14261#S2.SS2.SSS3.p1.1),[§2\.2](https://arxiv.org/html/2605.14261#S2.SS2.p1.1)\.
- \[4\]N\. Brown and T\. Sandholm\(2018\)Superhuman AI for heads\-up no\-limit poker: Libratus beats top professionals\.Science359\(6374\),pp\. 418–424\.Cited by:[§1](https://arxiv.org/html/2605.14261#S1.p1.1)\.
- \[5\]N\. Brown and T\. Sandholm\(2019\)Superhuman AI for multiplayer poker\.Science365\(6456\),pp\. 885–890\.Cited by:[§1\.1](https://arxiv.org/html/2605.14261#S1.SS1.p1.1),[§1](https://arxiv.org/html/2605.14261#S1.p3.1),[§3\.0\.2](https://arxiv.org/html/2605.14261#S3.SS0.SSS2.p1.3),[§3\.1\.2](https://arxiv.org/html/2605.14261#S3.SS1.SSS2.p3.1),[§3\.1](https://arxiv.org/html/2605.14261#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.14261#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.14261#S4.SS1.p2.1),[Table 3](https://arxiv.org/html/2605.14261#S4.T3.3.6.5.1),[§4](https://arxiv.org/html/2605.14261#S4.p2.4)\.
- \[6\]N\. Burch, M\. Schmid, M\. Moravcik, D\. Morill, and M\. Bowling\(2018\)AIVAT: a new variance reduction technique for agent evaluation in imperfect information games\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§1\.1](https://arxiv.org/html/2605.14261#S1.SS1.p1.1),[§1](https://arxiv.org/html/2605.14261#S1.p2.1),[§2\.2\.4](https://arxiv.org/html/2605.14261#S2.SS2.SSS4.p2.5),[§2\.2](https://arxiv.org/html/2605.14261#S2.SS2.p1.1),[Table 3](https://arxiv.org/html/2605.14261#S4.T3.3.6.5.1)\.
- \[7\]P\. Glasserman\(2004\)Monte Carlo methods in financial engineering\.Springer\.Cited by:[§2\.2\.1](https://arxiv.org/html/2605.14261#S2.SS2.SSS1.p1.4)\.
- \[8\]J\. Hartung, G\. Knapp, and B\. K\. Sinha\(2011\)Statistical meta\-analysis with applications\.John Wiley & Sons\.Cited by:[§D\.2](https://arxiv.org/html/2605.14261#A4.SS2.1.p1.6),[§4](https://arxiv.org/html/2605.14261#S4.p2.6),[§4](https://arxiv.org/html/2605.14261#S4.p3.1)\.
- \[9\]J\. W\. Kirchner\(2006\)Data analysis toolkit \#12: weighted averages and their uncertainties\.Cited by:[Appendix G](https://arxiv.org/html/2605.14261#A7.p1.4)\.
- \[10\]D\. J\. C\. MacKay\(1996\)Bayesian non\-linear modeling for the prediction competition\.InProceedings of the International Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis,Cited by:[Appendix H](https://arxiv.org/html/2605.14261#A8.p1.1),[§4](https://arxiv.org/html/2605.14261#S4.p5.1)\.
- \[11\]M\. Moravčík, M\. Schmid, N\. Burch, V\. Lisý, D\. Morrill, N\. Bard, T\. Davis, K\. Waugh, M\. Johanson, and M\. Bowling\(2017\)DeepStack: expert\-level artificial intelligence in heads\-up no\-limit poker\.Science356\(6337\),pp\. 508–513\.Cited by:[§1](https://arxiv.org/html/2605.14261#S1.p3.1)\.
- \[12\]M\. E\. Tipping\(2001\)Sparse Bayesian learning and the relevance vector machine\.J\. Mach\. Learn\. Res\.1,pp\. 211–244\.Cited by:[Appendix H](https://arxiv.org/html/2605.14261#A8.p1.1),[§4](https://arxiv.org/html/2605.14261#S4.p5.1)\.
- \[13\]M\. White and M\. Bowling\(2009\)Learning a value analysis tool for agent evaluation\.InProceedings of the International Joint Conference on Artificial Intelligence \(IJCAI\),Cited by:[Appendix J](https://arxiv.org/html/2605.14261#A10.p1.4),[Appendix C](https://arxiv.org/html/2605.14261#A3.p1.3),[Appendix F](https://arxiv.org/html/2605.14261#A6.p1.1),[§I\.1](https://arxiv.org/html/2605.14261#A9.SS1.p1.3),[§I\.1](https://arxiv.org/html/2605.14261#A9.SS1.p2.3),[§1\.1](https://arxiv.org/html/2605.14261#S1.SS1.p1.1),[§1\.1](https://arxiv.org/html/2605.14261#S1.SS1.p2.1),[§1](https://arxiv.org/html/2605.14261#S1.p2.1),[§2\.2\.2](https://arxiv.org/html/2605.14261#S2.SS2.SSS2.p1.8),[§2\.2\.2](https://arxiv.org/html/2605.14261#S2.SS2.SSS2.p2.1),[§3\.0\.1](https://arxiv.org/html/2605.14261#S3.SS0.SSS1.p1.3),[§3\.1\.1](https://arxiv.org/html/2605.14261#S3.SS1.SSS1.p2.1),[§4\.1](https://arxiv.org/html/2605.14261#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.14261#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.14261#S4.SS1.p4.1),[Table 3](https://arxiv.org/html/2605.14261#S4.T3.3.5.4.1),[footnote 1](https://arxiv.org/html/2605.14261#footnote1)\.
- \[14\]C\. K\. I\. Williams and C\. E\. Rasmussen\(2006\)Gaussian processes for machine learning\.MIT Press\.Cited by:[§4\.1](https://arxiv.org/html/2605.14261#S4.SS1.p1.1),[§4](https://arxiv.org/html/2605.14261#S4.p5.1)\.
- \[15\]M\. Zinkevich, M\. Bowling, N\. Bard, M\. Kan, and D\. Billings\(2006\)Optimal unbiased estimators for evaluating agent performance\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§1\.1](https://arxiv.org/html/2605.14261#S1.SS1.p1.1),[§2\.2\.2](https://arxiv.org/html/2605.14261#S2.SS2.SSS2.p1.8),[§2\.2](https://arxiv.org/html/2605.14261#S2.SS2.p1.1)\.

## Appendix AVisualization of AIVAT

h1h\_\{1\}h2h\_\{2\}h3h\_\{3\}h4h\_\{4\}zza1a\_\{1\}a2a\_\{2\}a3a\_\{3\}a4a\_\{4\}\+𝔼​\[v′​\(h1⋅a\)\]\+\\ \\mathbb\{E\}\[v^\{\\prime\}\(h\_\{1\}\\cdot a\)\]−𝔼​\[v′​\(h1⋅a1\)\]\-\\ \\mathbb\{E\}\[v^\{\\prime\}\(h\_\{1\}\\cdot a\_\{1\}\)\]\+𝔼​\[v′​\(h2⋅a\)\]\+\\ \\mathbb\{E\}\[v^\{\\prime\}\(h\_\{2\}\\cdot a\)\]−𝔼​\[v′​\(h2⋅a2\)\]\-\\ \\mathbb\{E\}\[v^\{\\prime\}\(h\_\{2\}\\cdot a\_\{2\}\)\]\+𝔼​\[v′​\(h3⋅a\)\]\+\\ \\mathbb\{E\}\[v^\{\\prime\}\(h\_\{3\}\\cdot a\)\]−𝔼​\[v′​\(h3⋅a3\)\]\-\\ \\mathbb\{E\}\[v^\{\\prime\}\(h\_\{3\}\\cdot a\_\{3\}\)\]\+𝔼​\[v′​\(h4⋅a\)\]\+\\ \\mathbb\{E\}\[v^\{\\prime\}\(h\_\{4\}\\cdot a\)\]−𝔼​\[v′​\(h4⋅a4\)\]\-\\ \\mathbb\{E\}\[v^\{\\prime\}\(h\_\{4\}\\cdot a\_\{4\}\)\]\+𝔼​\[v​\(z\)\]\+\\ \\mathbb\{E\}\[v\(z\)\]=v^​\(z\)=\\hat\{v\}\(z\)Figure 1:A diagram representing AIVAT\. Circles, squares, and diamonds denote chance nodes, player\-1 nodes, and player\-2 nodes, respectively\. This diagram illustrates a situation where only the action probabilities of the nodes belonging to nature and Player 1 are known\. The right side of the diagram shows the terms involved in the calculation of the AIVAT estimate\. Here,U​\(h\)=\{h\}U\(h\)=\\\{h\\\}for the sake of simplicity\.A simple visualization of AIVAT is given in Figure[1](https://arxiv.org/html/2605.14261#A1.F1)\.

## Appendix BSimplifying the expression for AIVAT

We simplified the expression for the AIVAT estimatev^​\(⋅\)\\hat\{v\}\(\\cdot\)by rewriting it as an affine function of the outputs of the value functionv′​\(⋅\)v^\{\\prime\}\(\\cdot\)in Section[3](https://arxiv.org/html/2605.14261#S3)\. The detailed derivation process is given below:

v^​\(z\)\\displaystyle\\hat\{v\}\(z\)=v^b​\(z\)\+v^c​\(z\)\\displaystyle=\\hat\{v\}\_\{b\}\(z\)\+\\hat\{v\}\_\{c\}\(z\)=∑z′∈U​\(z\)π​\(z′\)​v​\(z′\)∑z′∈U​\(z\)π​\(z′\)\+∑h⋅a∈K​\(z\)\(∑a′∈A​\(U​\(h\)\)∑h′∈U​\(h\)π​\(h′⋅a′\)​v′​\(h′⋅a′\)∑h′∈U​\(h\)π​\(h′\)\\displaystyle=\\frac\{\\sum\_\{z^\{\\prime\}\\in U\(z\)\}\\pi\(z^\{\\prime\}\)v\(z^\{\\prime\}\)\}\{\\sum\_\{z^\{\\prime\}\\in U\(z\)\}\\pi\(z^\{\\prime\}\)\}\+\\sum\_\{h\\cdot a\\in K\(z\)\}\\left\(\\frac\{\\sum\_\{a^\{\\prime\}\\in A\(U\(h\)\)\}\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\\cdot a^\{\\prime\}\)v^\{\\prime\}\(h^\{\\prime\}\\cdot a^\{\\prime\}\)\}\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\)\}\\right\.−∑h′∈U​\(h\)π​\(h′⋅a\)​v′​\(h′⋅a\)∑h′∈U​\(h\)π​\(h′⋅a\)\)\\displaystyle\\qquad\\left\.\-\\frac\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\\cdot a\)v^\{\\prime\}\(h^\{\\prime\}\\cdot a\)\}\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\\cdot a\)\}\\right\)=∑z′∈U​\(z\)π​\(z′\)​v​\(z′\)∑z′′∈U​\(z\)π​\(z′′\)\\displaystyle=\\sum\_\{z^\{\\prime\}\\in U\(z\)\}\\frac\{\\pi\(z^\{\\prime\}\)v\(z^\{\\prime\}\)\}\{\\sum\_\{z^\{\\prime\\prime\}\\in U\(z\)\}\\pi\(z^\{\\prime\\prime\}\)\}\+∑h⋅a∈K​\(z\)∑a′∈A​\(U​\(h\)\)∑h′∈U​\(h\)π​\(h′⋅a′\)∑h′′∈U​\(h\)π​\(h′′\)​v′​\(h′⋅a′\)\\displaystyle\\qquad\+\\sum\_\{h\\cdot a\\in K\(z\)\}\\sum\_\{a^\{\\prime\}\\in A\(U\(h\)\)\}\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\frac\{\\pi\(h^\{\\prime\}\\cdot a^\{\\prime\}\)\}\{\\sum\_\{h^\{\\prime\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\\prime\}\)\}v^\{\\prime\}\(h^\{\\prime\}\\cdot a^\{\\prime\}\)\+∑h⋅a∈K​\(z\)∑h′∈U​\(h\)\(−π​\(h′⋅a\)∑h′′∈U​\(h\)π​\(h′′⋅a\)\)​v′​\(h′⋅a\)\.\\displaystyle\\qquad\+\\sum\_\{h\\cdot a\\in K\(z\)\}\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\left\(\-\\frac\{\\pi\(h^\{\\prime\}\\cdot a\)\}\{\\sum\_\{h^\{\\prime\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\\prime\}\\cdot a\)\}\\right\)v^\{\\prime\}\(h^\{\\prime\}\\cdot a\)\.For the sake of simplicity, define the affine shiftb​\(z\)b\(z\),

b​\(z\)=∑z′∈U​\(z\)π​\(z′\)​v​\(z′\)∑z′′∈U​\(z\)π​\(z′′\),b\(z\)=\\sum\_\{z^\{\\prime\}\\in U\(z\)\}\\frac\{\\pi\(z^\{\\prime\}\)v\(z^\{\\prime\}\)\}\{\\sum\_\{z^\{\\prime\\prime\}\\in U\(z\)\}\\pi\(z^\{\\prime\\prime\}\)\},the setsS​\(z\)S\(z\)andS′​\(z\)S^\{\\prime\}\(z\)ofHH\(note thatS′​\(z\)⊆S​\(z\)S^\{\\prime\}\(z\)\\subseteq S\(z\)\),

S​\(z\)=\{h′⋅a′:∃h⋅a∈K​\(z\),a′∈A​\(U​\(h\)\),h′∈U​\(h\)\},S′​\(z\)=\{h′⋅a:∃h⋅a∈K​\(z\),h′∈U​\(h\)\},S\(z\)=\\\{h^\{\\prime\}\\cdot a^\{\\prime\}:\\exists h\\cdot a\\in K\(z\),a^\{\\prime\}\\in A\(U\(h\)\),h^\{\\prime\}\\in U\(h\)\\\},\\quad S^\{\\prime\}\(z\)=\\\{h^\{\\prime\}\\cdot a:\\exists h\\cdot a\\in K\(z\),h^\{\\prime\}\\in U\(h\)\\\},and𝐜​\(z\)∈ℝH\\mathbf\{c\}\(z\)\\in\\operatorname\{\\mathbb\{R\}\}^\{H\}, the vector of coefficients of eachv′​\(⋅\)v^\{\\prime\}\(\\cdot\)\. Explicitly, for eachh⋅a∈S​\(z\)∖S′​\(z\)h\\cdot a\\in S\(z\)\\setminus S^\{\\prime\}\(z\),

𝐜​\(z\)h⋅a=π​\(h⋅a\)∑h′∈U​\(h\)π​\(h′\),\\mathbf\{c\}\(z\)\_\{h\\cdot a\}=\\frac\{\\pi\(h\\cdot a\)\}\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\)\},for eachh⋅a∈S′​\(z\)h\\cdot a\\in S^\{\\prime\}\(z\),

𝐜​\(z\)h⋅a=π​\(h⋅a\)∑h′∈U​\(h\)π​\(h′\)−π​\(h⋅a\)∑h′∈U​\(h\)π​\(h′⋅a\),\\mathbf\{c\}\(z\)\_\{h\\cdot a\}=\\frac\{\\pi\(h\\cdot a\)\}\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\)\}\-\\frac\{\\pi\(h\\cdot a\)\}\{\\sum\_\{h^\{\\prime\}\\in U\(h\)\}\\pi\(h^\{\\prime\}\\cdot a\)\},\(10\)and finally for the remainingh∉S​\(z\)h\\notin S\(z\),

𝐜​\(z\)h=0\.\\mathbf\{c\}\(z\)\_\{h\}=0\.Putting it all together, we obtain the following simplified expression:

v^​\(z\)=b​\(z\)\+∑h∈H𝐜​\(z\)h​v′​\(h\)\.\\hat\{v\}\(z\)=b\(z\)\+\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}v^\{\\prime\}\(h\)\.\(11\)

## Appendix CDetailed refinement of the optimization problem for sample variance

The detailed derivation process for the refinement of the proxy objective of the optimization problem given byWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]in \([2](https://arxiv.org/html/2605.14261#S2.E2)\) is as follows:

Minimize:𝜽∈ℝH​C​\(𝜽\)\\displaystyle\\underset\{\\bm\{\\theta\}\\in\\operatorname\{\\mathbb\{R\}\}^\{H\}\}\{\\textbf\{Minimize: \}\}C\(\\bm\{\\theta\}\)=∑t=1T\(v^𝜽​\(zt\)−1T​∑t′=1Tv^𝜽​\(zt′\)\)2\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\_\{t\}\)\-\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\_\{t^\{\\prime\}\}\)\\right\)^\{2\}=∑t=1T\(b​\(zt\)\+⟨𝐜​\(zt\),𝜽⟩−1T​∑t′=1T\(b​\(zt′\)\+⟨𝐜​\(zt′\),𝜽⟩\)\)2\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(b\(z\_\{t\}\)\+\\langle\\mathbf\{c\}\(z\_\{t\}\),\\bm\{\\theta\}\\rangle\-\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}\\left\(b\(z\_\{t^\{\\prime\}\}\)\+\\langle\\mathbf\{c\}\(z\_\{t^\{\\prime\}\}\),\\bm\{\\theta\}\\rangle\\right\)\\right\)^\{2\}=∑t=1T\(b​\(zt\)\+⟨𝐜​\(zt\),𝜽⟩−\(b¯\+⟨𝐜¯,𝜽⟩\)\)2\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(b\(z\_\{t\}\)\+\\langle\\mathbf\{c\}\(z\_\{t\}\),\\bm\{\\theta\}\\rangle\-\\left\(\\bar\{b\}\+\\langle\\bar\{\\mathbf\{c\}\},\\bm\{\\theta\}\\rangle\\right\)\\right\)^\{2\}=∑t=1T\(\(b​\(zt\)−b¯\)\+⟨𝐜​\(zt\)−𝐜¯,𝜽⟩\)2,\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(\\left\(b\(z\_\{t\}\)\-\\bar\{b\}\\right\)\+\\left<\\mathbf\{c\}\(z\_\{t\}\)\-\\bar\{\\mathbf\{c\}\},\\bm\{\\theta\}\\right\>\\right\)^\{2\},whereb¯=1T​∑t=1Tb​\(zt\)\\bar\{b\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}b\(z\_\{t\}\)and𝐜¯=1T​∑t=1T𝐜​\(zt\)\\bar\{\\mathbf\{c\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbf\{c\}\(z\_\{t\}\)\.

## Appendix DOmitted proofs

This section contains several proofs that were omitted in the main body of the paper\.

### D\.1Proof of Proposition[1](https://arxiv.org/html/2605.14261#Thmproposition1)

###### Proof\.

The cost functionC​\(𝜽\)=∑t=1T\(\(b​\(zt\)−b¯\)\+⟨𝐜​\(zt\)−𝐜¯,𝜽⟩\)2C\(\\bm\{\\theta\}\)=\\sum\_\{t=1\}^\{T\}\\left\(\\left\(b\(z\_\{t\}\)\-\\bar\{b\}\\right\)\+\\left<\\mathbf\{c\}\(z\_\{t\}\)\-\\bar\{\\mathbf\{c\}\},\\bm\{\\theta\}\\right\>\\right\)^\{2\}can be rewritten in the standard least\-squares form as follows:C​\(𝜽\)=‖𝒚\+𝑿​𝜽‖22C\(\\bm\{\\theta\}\)=\\\|\\bm\{y\}\+\\bm\{X\}\\bm\{\\theta\}\\\|\_\{2\}^\{2\}, where𝑿=\[\(𝐜​\(zt\)−𝐜¯\)⊤\]t∈\{1,…,T\}∈ℝT×H\\bm\{X\}=\\begin\{bmatrix\}\(\\mathbf\{c\}\(z\_\{t\}\)\-\\bar\{\\mathbf\{c\}\}\)^\{\\top\}\\end\{bmatrix\}\_\{t\\in\\\{1,\\ldots,T\\\}\}\\in\\mathbb\{R\}^\{T\\times H\}and𝒚=\[b​\(zt\)−b¯\]t∈\{t,…,T\}∈ℝT\\bm\{y\}=\\begin\{bmatrix\}b\(z\_\{t\}\)\-\\bar\{b\}\\end\{bmatrix\}\_\{t\\in\\\{t,\\ldots,T\\\}\}\\in\\mathbb\{R\}^\{T\}\. When expanded, we haveC​\(𝜽\)=𝒚⊤​𝒚\+2​𝒚⊤​𝑿​𝜽\+𝜽⊤​𝑿⊤​𝑿​𝜽C\(\\bm\{\\theta\}\)=\\bm\{y\}^\{\\top\}\\bm\{y\}\+2\\bm\{y\}^\{\\top\}\\bm\{X\}\\bm\{\\theta\}\+\\bm\{\\theta\}^\{\\top\}\\bm\{X\}^\{\\top\}\\bm\{X\}\\bm\{\\theta\}\. Since𝑿⊤​𝑿\\bm\{X\}^\{\\top\}\\bm\{X\}is positive semidefinite, we have thatC​\(𝜽\)C\(\\bm\{\\theta\}\)is convex\. Therefore, it has a global minimum, and there exists a vector𝜽∗∈ℝH\\bm\{\\theta\}^\{\*\}\\in\\mathbb\{R\}^\{H\}that reaches it, as required\. ∎

### D\.2Proof of Proposition[2](https://arxiv.org/html/2605.14261#Thmproposition2)

###### Proof\.

Letv^​\(z1\),…,v^​\(zT\)\\hat\{v\}\(z\_\{1\}\),\\ldots,\\hat\{v\}\(z\_\{T\}\)be independent unbiased AIVAT estimates with variancesVar⁡\(v^​\(z1\)\),…,Var⁡\(v^​\(zT\)\)\\operatorname\{Var\}\(\\hat\{v\}\(z\_\{1\}\)\),\\ldots,\\operatorname\{Var\}\(\\hat\{v\}\(z\_\{T\}\)\)\. The weighted averagev¯∗=∑t=1Twt∗​v^​\(zt\)\\bar\{v\}^\{\*\}=\\sum\_\{t=1\}^\{T\}w\_\{t\}^\{\*\}\\hat\{v\}\(z\_\{t\}\)has varianceVar⁡\(v¯∗\)=∑t=1T\(wt∗\)2​Var⁡\(v^​\(zt\)\)\\operatorname\{Var\}\(\\bar\{v\}^\{\*\}\)=\\sum\_\{t=1\}^\{T\}\(w\_\{t\}^\{\*\}\)^\{2\}\\operatorname\{Var\}\(\\hat\{v\}\(z\_\{t\}\)\)\. It is a classical result that this variance can be minimized by setting each weightwtw\_\{t\}to be proportional to1Var⁡\(v^​\(zt\)\)\\frac\{1\}\{\\operatorname\{Var\}\(\\hat\{v\}\(z\_\{t\}\)\)\}; cf\.Hartunget al\.\[[8](https://arxiv.org/html/2605.14261#bib.bib8), Ch\. 4\]\. ∎

### D\.3Proof of Proposition[3](https://arxiv.org/html/2605.14261#Thmproposition3)

###### Proof\.

We have that

𝔼⁡\[v¯∗\|σ\]\\displaystyle\\operatorname\{\\mathbb\{E\}\}\[\\bar\{v\}^\{\*\}\|\\sigma\]=𝔼⁡\[∑t=1Twt∗​v^​\(zt\)\|σ\]\\displaystyle=\\operatorname\{\\mathbb\{E\}\}\\left\[\\sum\_\{t=1\}^\{T\}w\_\{t\}^\{\*\}\\hat\{v\}\(z\_\{t\}\)\\middle\|\\sigma\\right\]=∑t=1T𝔼⁡\[wt∗​v^​\(zt\)\|σ\]\\displaystyle=\\sum\_\{t=1\}^\{T\}\\operatorname\{\\mathbb\{E\}\}\[w\_\{t\}^\{\*\}\\hat\{v\}\(z\_\{t\}\)\|\\sigma\]=∑t=1T\(𝔼⁡\[wt∗\|σ\]​𝔼⁡\[v^​\(zt\)\|σ\]\+Cov⁡\(wt∗,v^​\(zt\)\|σ\)\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(\\operatorname\{\\mathbb\{E\}\}\[w\_\{t\}^\{\*\}\|\\sigma\]\\operatorname\{\\mathbb\{E\}\}\[\\hat\{v\}\(z\_\{t\}\)\|\\sigma\]\+\\operatorname\{Cov\}\(w\_\{t\}^\{\*\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)\\right\)=∑t=1T𝔼⁡\[wt∗\|σ\]​𝔼⁡\[v^​\(zt\)\|σ\]\+∑t=1TCov⁡\(wt∗,v^​\(zt\)\|σ\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\operatorname\{\\mathbb\{E\}\}\[w\_\{t\}^\{\*\}\|\\sigma\]\\operatorname\{\\mathbb\{E\}\}\[\\hat\{v\}\(z\_\{t\}\)\|\\sigma\]\+\\sum\_\{t=1\}^\{T\}\\operatorname\{Cov\}\(w\_\{t\}^\{\*\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)=𝔼z∈Z⁡\[v​\(z\)\|σ\]​∑t=1T𝔼⁡\[wt∗\|σ\]\+∑t=1TCov⁡\(wt∗,v^​\(zt\)\|σ\)\\displaystyle=\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\]\\sum\_\{t=1\}^\{T\}\\operatorname\{\\mathbb\{E\}\}\[w\_\{t\}^\{\*\}\|\\sigma\]\+\\sum\_\{t=1\}^\{T\}\\operatorname\{Cov\}\(w\_\{t\}^\{\*\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)=𝔼z∈Z⁡\[v​\(z\)\|σ\]​𝔼⁡\[∑t=1Twt∗\|σ\]\+∑t=1TCov⁡\(wt∗,v^​\(zt\)\|σ\)\.\\displaystyle=\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\]\\operatorname\{\\mathbb\{E\}\}\\left\[\\sum\_\{t=1\}^\{T\}w\_\{t\}^\{\*\}\\middle\|\\sigma\\right\]\+\\sum\_\{t=1\}^\{T\}\\operatorname\{Cov\}\(w\_\{t\}^\{\*\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)\.We assume normalized weights, so𝔼⁡\[v¯∗\|σ\]=𝔼z∈Z⁡\[v​\(z\)\|σ\]\+∑t=1TCov⁡\(wt∗,v^​\(zt\)\|σ\)\\operatorname\{\\mathbb\{E\}\}\[\\bar\{v\}^\{\*\}\|\\sigma\]=\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\]\+\\sum\_\{t=1\}^\{T\}\\operatorname\{Cov\}\(w\_\{t\}^\{\*\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)\. If∑t=1TCov⁡\(wt∗,v^​\(zt\)\|σ\)=0\\sum\_\{t=1\}^\{T\}\\operatorname\{Cov\}\(w\_\{t\}^\{\*\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)=0, then𝔼⁡\[v¯∗\|σ\]=𝔼z∈Z⁡\[v​\(z\)\|σ\]\\operatorname\{\\mathbb\{E\}\}\[\\bar\{v\}^\{\*\}\|\\sigma\]=\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\], as required\. And if∑t=1TCov⁡\(wt∗,v^​\(zt\)\|σ\)≠0\\sum\_\{t=1\}^\{T\}\\operatorname\{Cov\}\(w\_\{t\}^\{\*\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)\\neq 0, then𝔼⁡\[v¯∗\|σ\]≠𝔼z∈Z⁡\[v​\(z\)\|σ\]\\operatorname\{\\mathbb\{E\}\}\[\\bar\{v\}^\{\*\}\|\\sigma\]\\neq\\operatorname\{\\mathbb\{E\}\}\_\{z\\in Z\}\[v\(z\)\|\\sigma\], as required\. LetS=∑t=1TwtS=\\sum\_\{t=1\}^\{T\}w\_\{t\}and letwwandv​\(z\)v\(z\)be random variables forwtw\_\{t\}andv​\(zt\)v\(z\_\{t\}\)\. Since we assume the data points are i\.i\.d\., we have that𝔼⁡\[wt\]=𝔼⁡\[w\]\\operatorname\{\\mathbb\{E\}\}\[w\_\{t\}\]=\\operatorname\{\\mathbb\{E\}\}\[w\],𝔼⁡\[v​\(zt\)\]=𝔼⁡\[v​\(z\)\]\\operatorname\{\\mathbb\{E\}\}\[v\(z\_\{t\}\)\]=\\operatorname\{\\mathbb\{E\}\}\[v\(z\)\], and𝔼⁡\[S\]=T​𝔼⁡\[w\]\\operatorname\{\\mathbb\{E\}\}\[S\]=T\\operatorname\{\\mathbb\{E\}\}\[w\]\. Then, we have that

bias\\displaystyle\\operatorname\{bias\}=∑t=1TCov⁡\(wt∗,v^​\(zt\)\|σ\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\operatorname\{Cov\}\(w\_\{t\}^\{\*\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)=∑t=1TCov⁡\(wtS,v^​\(zt\)\|σ\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\operatorname\{Cov\}\\left\(\\frac\{w\_\{t\}\}\{S\},\\hat\{v\}\(z\_\{t\}\)\\middle\|\\sigma\\right\)≈∑t=1T\(1𝔼⁡\[S\]​Cov⁡\(wt,v^​\(zt\)\|σ\)−𝔼⁡\[wt\]𝔼\[S\]2​Cov⁡\(S,v^​\(zt\)\|σ\)\)\\displaystyle\\approx\\sum\_\{t=1\}^\{T\}\\left\(\\frac\{1\}\{\\operatorname\{\\mathbb\{E\}\}\[S\]\}\\operatorname\{Cov\}\(w\_\{t\},\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)\-\\frac\{\\operatorname\{\\mathbb\{E\}\}\[w\_\{t\}\]\}\{\\operatorname\{\\mathbb\{E\}\}\[S\]^\{2\}\}\\operatorname\{Cov\}\(S,\\hat\{v\}\(z\_\{t\}\)\|\\sigma\)\\right\)=T𝔼⁡\[S\]​Cov⁡\(w,v^​\(z\)\|σ\)−T​𝔼⁡\[w\]𝔼\[S\]2​Cov⁡\(S,v^​\(z\)\|σ\)\\displaystyle=\\frac\{T\}\{\\operatorname\{\\mathbb\{E\}\}\[S\]\}\\operatorname\{Cov\}\(w,\\hat\{v\}\(z\)\|\\sigma\)\-\\frac\{T\\operatorname\{\\mathbb\{E\}\}\[w\]\}\{\\operatorname\{\\mathbb\{E\}\}\[S\]^\{2\}\}\\operatorname\{Cov\}\(S,\\hat\{v\}\(z\)\|\\sigma\)=TT​𝔼⁡\[w\]​Cov⁡\(w,v^​\(z\)\|σ\)−T​𝔼⁡\[w\]\(T​𝔼⁡\[w\]\)2​Cov⁡\(S,v^​\(z\)\|σ\)\\displaystyle=\\frac\{T\}\{T\\operatorname\{\\mathbb\{E\}\}\[w\]\}\\operatorname\{Cov\}\(w,\\hat\{v\}\(z\)\|\\sigma\)\-\\frac\{T\\operatorname\{\\mathbb\{E\}\}\[w\]\}\{\(T\\operatorname\{\\mathbb\{E\}\}\[w\]\)^\{2\}\}\\operatorname\{Cov\}\(S,\\hat\{v\}\(z\)\|\\sigma\)=1𝔼⁡\[w\]​Cov⁡\(w,v^​\(z\)\|σ\)−1T​𝔼⁡\[w\]​Cov⁡\(S,v^​\(z\)\|σ\)\\displaystyle=\\frac\{1\}\{\\operatorname\{\\mathbb\{E\}\}\[w\]\}\\operatorname\{Cov\}\(w,\\hat\{v\}\(z\)\|\\sigma\)\-\\frac\{1\}\{T\\operatorname\{\\mathbb\{E\}\}\[w\]\}\\operatorname\{Cov\}\(S,\\hat\{v\}\(z\)\|\\sigma\)≈Cov⁡\(w,v^​\(z\)\|σ\)𝔼⁡\[w\],\\displaystyle\\approx\\frac\{\\operatorname\{Cov\}\(w,\\hat\{v\}\(z\)\|\\sigma\)\}\{\\operatorname\{\\mathbb\{E\}\}\[w\]\},as required\. ∎

## Appendix ERules of Texas hold’em

In Texas hold’em, the game begins with every player being dealt two private cards\. This is followed by a preflop betting round, where players take turns betting, matching \(i\.e\., checking or calling\), or giving up \(i\.e\., folding\) based on their hidden cards and beliefs about what cards others hold\. After the betting round, the dealer reveals three public cards as part of the flop, which is then followed by another betting round\. After that, a single card is revealed as the turn card, and another betting round ensues\. Finally, the river card is revealed publicly, and the final betting round takes place, after which the game terminates\. The game can also end when only one player remains in the pot \(as others have folded\)\. In no\-limit settings, players are allowed to bet or raise whatever amount they want \(capped by their number of chips\) however many times they want, whereas this amount and the maximum number of bets are determined by the rules of the game in fixed\-limit settings\.

## Appendix FFeature engineering of states in 6\-max no\-limit Texas hold’em

For the feature engineering of states in 6\-max no\-limit Texas hold’em, we took an approach similar to that ofWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]in their evaluation of 6\-player limit Texas hold’em data\. Specifically, we extracted the pot amount, and each player’s hand strength and hand strength squared values\. The hand strength is the expected probability of beating another random hand, while the hand strength squared is the expected squared probability of beating a random hand\. All hand strength \(squared\) values were also multiplied by the pot amount and exponentiated by the number of non\-folded players\.

## Appendix GStandard error of the \(weighted\) mean

In this section, we provide the definition of the standard error of the \(weighted\) mean\. When unweighted, the standard definition in the field of statistics is

S​E=sN,SE=\\frac\{s\}\{\\sqrt\{N\}\},wheressis the sample standard deviation andNNis the sample size\. Note that

s2=∑i=1N\(xi−x¯\)2N−1,s^\{2\}=\\frac\{\\sum\_\{i=1\}^\{N\}\(x\_\{i\}\-\\bar\{x\}\)^\{2\}\}\{N\-1\},wherex1,…,xNx\_\{1\},\\dots,x\_\{N\}are the sample values andx¯\\bar\{x\}is the sample mean\. Under inverse\-variance weighting,Kirchner \[[9](https://arxiv.org/html/2605.14261#bib.bib13)\]gives the following definition:

S​E∗=s∗N,SE^\{\*\}=\\frac\{s^\{\*\}\}\{\\sqrt\{N\}\},wheres∗s^\{\*\}is the weighted sample standard deviation andw1,…,wNw\_\{1\},\\dots,w\_\{N\}are the weights corresponding to the sample valuesx1,…,xNx\_\{1\},\\dots,x\_\{N\}, respectively\. Here,

\(s∗\)2=\(∑i=1Nwi​\(xi−x¯∗\)2∑i=1Nwi\)​\(NN−1\),\(s^\{\*\}\)^\{2\}=\\left\(\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}\(x\_\{i\}\-\\bar\{x\}^\{\*\}\)^\{2\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}\}\\right\)\\left\(\\frac\{N\}\{N\-1\}\\right\),wherex¯∗=∑i=1Nwi​xi∑i=1Nwi\\bar\{x\}^\{\*\}=\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}x\_\{i\}\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}\}is the weighted sample mean\.

## Appendix HAdditional experimental results on heuristic uncertainty

In learning the heuristic value function for our MIVAT estimator, we chose two more models that allow the calculation of output variance:Bayesian ridge \(BR\)\[[12](https://arxiv.org/html/2605.14261#bib.bib15)\]andautomatic relevance determination \(ARD\) regression\[[10](https://arxiv.org/html/2605.14261#bib.bib16)\]\. We refer to the two new MIVAT estimators as follows: MIVAT\-BR and MIVAT\-ARDR, respectively\. Due to the inherent limitation of the models, we assumed for the two estimators that the heuristic value outputs are uncorrelated,i\.e\., \([6](https://arxiv.org/html/2605.14261#S4.E6)\) simplifies to \([7](https://arxiv.org/html/2605.14261#S4.E7)\)\. This limitation is why we relegated the use of these models to the appendix\. Except for subsampling, the training procedure for both estimators was identical to that of MIVAT\-GPR\.

Table 4:Results on the Pluribus data in milli\-big blinds per hand \(mbb/h\)\.S​ESEstands for the standard error of the \(weighted\) mean, and ‘Est\.’ is an abbreviation for ‘Estimated’\.EstimatorWeightingWin rateS​ESEEst\. biasMIVAT\-BRUniform\-8083–IVW07480MIVAT\-ARDRUniform\-8083–IVW07480MIVAT\-GPRUniform\-2599–IVW\-22753RawUniform\-7088–MIVAT\-WBUniform\-9885–AIVATUniform4825–Table[4](https://arxiv.org/html/2605.14261#A8.T4)shows the performance of both MIVAT\-BR and MIVAT\-ARDR, in addition to MIVAT\-GPR and our baselines\. When taking the simple mean, both MIVAT\-BR and MIVAT\-ARDR outperform MIVAT\-GPR, but this is understandable, as we used far less data to train GPR due to computational constraints\. IVW allowed our MIVAT estimators to yield the best performance \(save for AIVAT\)\. The difference in the uncertainty achieved between uniform averaging and IVW averaging is smaller for MIVAT\-BR and MIVAT\-ARDR than for MIVAT\-GPR\. In general, for all our MIVAT estimators under IVW, the bias is approximately equal to the difference between the estimates under different weighting schemes\. Indeed, offsetting the IVW estimates using the estimated bias recovers an estimator that is in line with using uniform averaging, which is unbiased\. Also, the IVW win rates for MIVAT\-BR and MIVAT\-ARDR are not exactly zero but are shown so purely due to rounding coincidence\. Before rounding, the values were approximately−0\.233\-0\.233for MIVAT\-BR and−0\.217\-0\.217for MIVAT\-ARD\. Note that the values for MIVAT\-BR and MIVAT\-ARDR are similar because both BR and ARD are Bayesian linear regression models\.

## Appendix ILinear value function

Define a feature engineering functionϕ:H↦ℝd\\bm\{\\phi\}:H\\mapsto\\operatorname\{\\mathbb\{R\}\}^\{d\}that maps a statehhto a respective vector of features\. When linear, we require the value function to be of the following form:

v𝜽′​\(h\)=ϕ​\(h\)⊤​𝜽,v\_\{\\bm\{\\theta\}\}^\{\\prime\}\(h\)=\\bm\{\\phi\}\(h\)^\{\\top\}\\bm\{\\theta\},\(12\)where𝜽∈ℝd\\bm\{\\theta\}\\in\\operatorname\{\\mathbb\{R\}\}^\{d\}is the parameter vector\. Then,

v^𝜽​\(z\)=b​\(z\)\+∑h∈H𝐜​\(z\)h​v𝜽′​\(h\)=b​\(z\)\+∑h∈H𝐜​\(z\)h​ϕ​\(h\)⊤​𝜽\\displaystyle\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\)=b\(z\)\+\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}v\_\{\\bm\{\\theta\}\}^\{\\prime\}\(h\)=b\(z\)\+\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}\\bm\{\\phi\}\(h\)^\{\\top\}\\bm\{\\theta\}=b​\(z\)\+\(∑h∈H𝐜​\(z\)h​ϕ​\(h\)\)⊤​𝜽=b​\(z\)\+𝝍​\(z\)⊤​𝜽,\\displaystyle\\qquad=b\(z\)\+\\left\(\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}\\bm\{\\phi\}\(h\)\\right\)^\{\\top\}\\bm\{\\theta\}=b\(z\)\+\\bm\{\\psi\}\(z\)^\{\\top\}\\bm\{\\theta\},where𝝍​\(z\)=∑h∈H𝐜​\(z\)h​ϕ​\(h\)\\bm\{\\psi\}\(z\)=\\sum\_\{h\\in H\}\\mathbf\{c\}\(z\)\_\{h\}\\bm\{\\phi\}\(h\)\.

### I\.1Closed\-form solution

For MIVAT,White and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]derived a closed\-form formula of the optimal linear function parameter𝜽∗\\bm\{\\theta\}^\{\*\}for the optimization problem shown in \([2](https://arxiv.org/html/2605.14261#S2.E2)\)\. The steps they provide can be closely followed to yield an analogous solution for AIVAT\. Withb¯=1T​∑t=1b​\(zt\)\\bar\{b\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}b\(z\_\{t\}\)and𝝍¯=1T​∑t=1T𝝍​\(zt\)\\overline\{\\bm\{\\psi\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\bm\{\\psi\}\(z\_\{t\}\), the parameter of the linear function that optimizes \([2](https://arxiv.org/html/2605.14261#S2.E2)\) is as follows:111A careful reader may notice that the latter factor in \([13](https://arxiv.org/html/2605.14261#A9.E13)\) is negated in the equivalent formulation ofWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]\. This is because we hide the negative sign inside𝐜​\(z\)h′⋅a\\mathbf\{c\}\(z\)\_\{h^\{\\prime\}\\cdot a\}; see \([10](https://arxiv.org/html/2605.14261#A2.E10)\)\.,222The steps we took to obtain this are isolated in Appendix[J](https://arxiv.org/html/2605.14261#A10)\.

𝜽∗=\(\(1T​∑t=1T𝝍​\(zt\)​𝝍​\(zt\)⊤\)−𝝍¯​𝝍¯⊤\)−1​\(b¯​𝝍¯−\(1T​∑t=1Tb​\(zt\)​𝝍​\(zt\)\)\)\.\\bm\{\\theta\}^\{\*\}=\\left\(\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\bm\{\\psi\}\(z\_\{t\}\)\\bm\{\\psi\}\(z\_\{t\}\)^\{\\top\}\\right\)\-\\overline\{\\bm\{\\psi\}\}\\,\\overline\{\\bm\{\\psi\}\}^\{\\top\}\\right\)^\{\-1\}\\left\(\\bar\{b\}\\,\\overline\{\\bm\{\\psi\}\}\-\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}b\(z\_\{t\}\)\\bm\{\\psi\}\(z\_\{t\}\)\\right\)\\right\)\.\(13\)
We can simplify this further to

𝜽∗=𝐗−1​𝐲,\\bm\{\\theta\}^\{\*\}=\\mathbf\{X\}^\{\-1\}\\mathbf\{y\},\(14\)where𝐗=\(\(1T​∑t=1T𝝍​\(zt\)​𝝍​\(zt\)⊤\)−𝝍¯​𝝍¯⊤\)\\mathbf\{X\}=\\left\(\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\bm\{\\psi\}\(z\_\{t\}\)\\bm\{\\psi\}\(z\_\{t\}\)^\{\\top\}\\right\)\-\\overline\{\\bm\{\\psi\}\}\\,\\overline\{\\bm\{\\psi\}\}^\{\\top\}\\right\)and𝐲=\(b¯​𝝍¯−\(1T​∑t=1Tb​\(zt\)​𝝍​\(zt\)\)\)\\mathbf\{y\}=\\left\(\\bar\{b\}\\,\\overline\{\\bm\{\\psi\}\}\-\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}b\(z\_\{t\}\)\\bm\{\\psi\}\(z\_\{t\}\)\\right\)\\right\)\. Although left unstated byWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\], for the above closed\-form solution to exist,𝐗\\mathbf\{X\}must be invertible\. We provide the conditions required for this to be true\.

###### Theorem 1\.

If no hyperplane inℝd\\operatorname\{\\mathbb\{R\}\}^\{d\}contains all of\{𝛙​\(zt\)\}t=1T\\\{\\bm\{\\psi\}\(z\_\{t\}\)\\\}\_\{t=1\}^\{T\}, then𝐗\\mathbf\{X\}is invertible\.

###### Proof\.

We prove the contrapositive: if𝐗\\mathbf\{X\}is non\-invertible, then there is a hyperplane inℝd\\operatorname\{\\mathbb\{R\}\}^\{d\}containing all of\{𝝍​\(zt\)\}t=1T\\\{\\bm\{\\psi\}\(z\_\{t\}\)\\\}\_\{t=1\}^\{T\}\.

If𝐗\\mathbf\{X\}is singular, then∃𝐚∈ℝd\\exists\\mathbf\{a\}\\in\\operatorname\{\\mathbb\{R\}\}^\{d\}such that𝐚≠𝟎\\mathbf\{a\}\\neq\\mathbf\{0\}and𝐗𝐚=𝟎\\mathbf\{X\}\\mathbf\{a\}=\\mathbf\{0\}\. The latter implies𝐚⊤​𝐗𝐚=0\\mathbf\{a\}^\{\\top\}\\mathbf\{X\}\\mathbf\{a\}=0\. Using

𝐗=\(\(1T​∑t=1T𝝍​\(zt\)​𝝍​\(zt\)⊤\)−𝝍¯​𝝍¯⊤\)=1T​\(∑t=1T\(𝝍​\(zt\)−𝝍¯\)​\(𝝍​\(zt\)−𝝍¯\)⊤\),\\mathbf\{X\}=\\left\(\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\bm\{\\psi\}\(z\_\{t\}\)\\bm\{\\psi\}\(z\_\{t\}\)^\{\\top\}\\right\)\-\\overline\{\\bm\{\\psi\}\}\\,\\overline\{\\bm\{\\psi\}\}^\{\\top\}\\right\)=\\frac\{1\}\{T\}\\left\(\\sum\_\{t=1\}^\{T\}\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\)\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\)^\{\\top\}\\right\),we obtain

0=𝐚⊤​𝐗𝐚=𝐚⊤​1T​\(∑t=1T\(𝝍​\(zt\)−𝝍¯\)​\(𝝍​\(zt\)−𝝍¯\)⊤\)​𝐚=1T​∑t=1T\(\(𝝍​\(zt\)−𝝍¯\)⊤​𝐚\)2,0=\\mathbf\{a\}^\{\\top\}\\mathbf\{X\}\\mathbf\{a\}=\\mathbf\{a\}^\{\\top\}\\frac\{1\}\{T\}\\left\(\\sum\_\{t=1\}^\{T\}\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\)\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\)^\{\\top\}\\right\)\\mathbf\{a\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\(\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\)^\{\\top\}\\mathbf\{a\}\)^\{2\},implying,∀t∈\{1,…,T\}:\(𝝍​\(zt\)−𝝍¯\)⊤​𝐚=0\\forall t\\in\\\{1,\\ldots,T\\\}:\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\)^\{\\top\}\\mathbf\{a\}=0, and hence𝝍​\(zt\)⊤​𝐚=𝝍¯⊤​𝐚\\bm\{\\psi\}\(z\_\{t\}\)^\{\\top\}\\mathbf\{a\}=\\overline\{\\bm\{\\psi\}\}^\{\\top\}\\mathbf\{a\}\. Note that𝝍¯⊤​𝐚=c\\overline\{\\bm\{\\psi\}\}^\{\\top\}\\mathbf\{a\}=cis a constant and, defining𝐱∈ℝd\\mathbf\{x\}\\in\\operatorname\{\\mathbb\{R\}\}^\{d\},𝐱⊤​𝐚=c\\mathbf\{x\}^\{\\top\}\\mathbf\{a\}=cis an equation of a hyperplane\. ∎

This prohibits one from applying common feature engineering techniques such as using separate linear coefficients for different stages in the game \(e\.g\., in poker: preflop, flop, turn, and river\)\. On top of this, it lacks any regularization, which makes it prone to overfitting\. It is also unclear how one can estimate the variance of its outputs\.

## Appendix JDerivation of the closed\-form solution

This appendix section contains the steps to obtain the optimal linear value function parameter for the optimization problem given byWhite and Bowling \[[13](https://arxiv.org/html/2605.14261#bib.bib5)\]optimization problem stated in \([2](https://arxiv.org/html/2605.14261#S2.E2)\)\. Note that this is for the case of the linear heuristic value function shown in \([12](https://arxiv.org/html/2605.14261#A9.E12)\), not the parameterized version in \([4](https://arxiv.org/html/2605.14261#S3.E4)\)\. Restating the optimization problem,

Minimize:𝜽∈ℝd​C​\(𝜽\)\\displaystyle\\underset\{\\bm\{\\theta\}\\in\\operatorname\{\\mathbb\{R\}\}^\{d\}\}\{\\textbf\{Minimize: \}\}C\(\\bm\{\\theta\}\)=∑t=1T\(v^𝜽​\(zt\)−1T​∑t′=1Tv^𝜽​\(zt′\)\)2\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\_\{t\}\)\-\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}\\hat\{v\}\_\{\\bm\{\\theta\}\}\(z\_\{t^\{\\prime\}\}\)\\right\)^\{2\}=∑t=1T\(\(b​\(zt\)\+𝝍​\(zt\)⊤​𝜽\)−1T​∑t′=1T\(b​\(zt′\)\+𝝍​\(zt′\)⊤​𝜽\)\)2\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(\\left\(b\(z\_\{t\}\)\+\\bm\{\\psi\}\(z\_\{t\}\)^\{\\top\}\\bm\{\\theta\}\\right\)\-\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}\\left\(b\(z\_\{t^\{\\prime\}\}\)\+\\bm\{\\psi\}\(z\_\{t^\{\\prime\}\}\)^\{\\top\}\\bm\{\\theta\}\\right\)\\right\)^\{2\}=∑t=1T\(\(b​\(zt\)−1T​∑t′=1Tb​\(zt′\)\)\+\(𝝍​\(zt\)−1T​∑t′=1T𝝍​\(zt′\)\)⊤​𝜽\)2\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(\\left\(b\(z\_\{t\}\)\-\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}b\(z\_\{t^\{\\prime\}\}\)\\right\)\+\\left\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}\\bm\{\\psi\}\(z\_\{t^\{\\prime\}\}\)\\right\)^\{\\top\}\\bm\{\\theta\}\\right\)^\{2\}=∑t=1T\(\(b​\(zt\)−b¯\)\+\(𝝍​\(zt\)−𝝍¯\)⊤​𝜽\)2\.\\displaystyle=\\sum\_\{t=1\}^\{T\}\\left\(\(b\(z\_\{t\}\)\-\\bar\{b\}\)\+\\left\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\\right\)^\{\\top\}\\bm\{\\theta\}\\right\)^\{2\}\.They then took the derivative ofCCwith respect to𝜽\\bm\{\\theta\}and set it to zero:

0=∂C∂𝜽\|𝜽=𝜽∗=∑t=1T2​\(𝝍​\(zt\)−𝝍¯\)​\(\(b​\(zt\)−b¯\)\+\(𝝍​\(zt\)−𝝍¯\)⊤​𝜽∗\)\.0=\\left\.\\frac\{\\partial C\}\{\\partial\\bm\{\\theta\}\}\\right\|\_\{\\bm\{\\theta\}=\\bm\{\\theta\}^\{\*\}\}=\\sum\_\{t=1\}^\{T\}2\\left\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\\right\)\\left\(\(b\(z\_\{t\}\)\-\\bar\{b\}\)\+\\left\(\\bm\{\\psi\}\(z\_\{t\}\)\-\\overline\{\\bm\{\\psi\}\}\\right\)^\{\\top\}\\bm\{\\theta\}^\{\*\}\\right\)\.Simplifying and solving for𝜽∗\\bm\{\\theta\}^\{\*\}, the following closed\-form solution was obtained:

𝜽∗=\(\(1T​∑t=1T𝝍​\(zt\)​𝝍​\(zt\)⊤\)−𝝍¯​𝝍¯⊤\)−1​\(b¯​𝝍¯−\(1T​∑t=1Tb​\(zt\)​𝝍​\(zt\)\)\)\.\\bm\{\\theta\}^\{\*\}=\\left\(\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\bm\{\\psi\}\(z\_\{t\}\)\\bm\{\\psi\}\(z\_\{t\}\)^\{\\top\}\\right\)\-\\overline\{\\bm\{\\psi\}\}\\,\\overline\{\\bm\{\\psi\}\}^\{\\top\}\\right\)^\{\-1\}\\left\(\\bar\{b\}\\,\\overline\{\\bm\{\\psi\}\}\-\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}b\(z\_\{t\}\)\\bm\{\\psi\}\(z\_\{t\}\)\\right\)\\right\)\.

## Appendix KTestbench specification

Our testbench contains an AMD Ryzen 9 3900X 12\-core, 24\-thread desktop processor and 128 GB memory\.

Similar Articles

Adaptive auditing of AI systems with anytime-valid guarantees

arXiv cs.AI

This paper introduces a statistical framework for adaptively auditing AI systems using Safe Anytime-Valid Inference (SAVI) to draw rigorous conclusions with limited data. It proposes a 'testing by betting' approach to validate model robustness while controlling type-I errors during adaptive sampling.