How Much Due Diligence Before You Bid? Learning in Intractable Takeover Auctions
Summary
This paper models takeover bidding as auction games with imperfect information and studies optimal due diligence strategies using game-theoretic solvers and reinforcement learning, finding that PPO and PPG are effective for large games, and providing a computable threshold for when additional diligence ceases to be beneficial.
View Cached Full Text
Cached at: 06/30/26, 05:33 AM
# 1 Introduction
Source: [https://arxiv.org/html/2606.29457](https://arxiv.org/html/2606.29457)
How Much Due Diligence Before You Bid? Learning in Intractable Takeover Auctions
Zain Naboulsi Principal AI Engineer, Sparq
zain\.naboulsi@teamsparq\.com
How much due diligence should a bidder buy before a takeover contest? Deal teams answer this by judgment; we turn it into a computable, defensible number, cheap enough to run on a commodity laptop\. A bidder acts on private*due diligence*: noisy, costly estimates of a target whose true value no one observes\. We study how to bid well under that imperfect information, and what happens when the game grows past the reach of exact solvers\. The two are linked by one lever: each unit of diligence is another private signal, and each signal multiplies the strategy space, so the same quantity that sets how much diligence is worth buying also governs when the game outruns exact solvers\. Our answer is that the optimal amount is finite and falls as the per\-signal cost rises, including in an equilibrium where both bidders choose diligence, where competition makes the smart amount more conservative still\. We cast takeover bidding as a small family of two\-player zero\-sum auction games from the takeover\-auction literature \(common\-value with a winner’s curse, common\-value with a toehold, and independent\-private\-value\), built on a reusable abstraction over OpenSpiel, and benchmark nine solvers spanning exact, tabular, and deep families\. Recent work argues that simple generic policy\-gradient methods rival specialized game\-solving machinery \(counterfactual regret minimization, fictitious play, double oracle\); we find that on commodity CPU with no frontier\-model spend, PPO and PPG are by far the strongest*learning*solvers, but only within that family: wherever the game is small enough to tabulate, the exact solvers \(CFR, MMD, PSRO\) are both lower in exploitability and faster\. A scaling study shows the learning methods’ flat per\-target cost becomes an advantage only beyond exact enumeration\. Solving the auction’s genuine own\-profit Bayes\-Nash equilibrium exactly, we then read off the value of a bidder’s diligence and the cost at which acquiring more of it stops paying\. Finally, on a multi\-signal auction too large to enumerate, where exact solvers cannot run, PPO and PPG drive a calibrated learned\-best\-response exploitability estimate to its resolution floor, below a naive unshaded bidder; we report it as a lower bound, not a Nash certificate\. We release the games, solvers, and experiments\.[https://github\.com/zainnab\-sparq/imperfect\-information\-deal\-games](https://github.com/zainnab-sparq/imperfect-information-deal-games)![[Uncaptioned image]](https://arxiv.org/html/2606.29457v1/figures/sparq-logo.png)Sparq
Keywords:imperfect\-information games, auctions, mergers and acquisitions, policy gradient, PPO, counterfactual regret minimization, exploitability, self\-play, OpenSpiel
A merger or acquisition is rarely a posted price\. It is a contested process in which an acquirer and a target, or two competing acquirers, hold private information and act strategically under uncertainty about each other and about the asset\. Economists model these as games of incomplete information\(Harsanyi,[1967–1968](https://arxiv.org/html/2606.29457#bib.bib13)\): takeover contests as auctions with private or common values, negotiations as bargaining under asymmetric information\. The computational game\-solving community has produced powerful algorithms for large imperfect\-information games, but has concentrated on recreational benchmarks rather than economic ones\.
The question in our title is a real one for a deal team, and it carries a hidden computational edge\. Each unit of due diligence a bidder performs is one more private signal it brings to the auction, and each signal multiplies the number of distinct situations it must plan for, so the strategy space grows exponentially in the amount of diligence\. The economic question, how much diligence is worth buying, and the computational one, when the game stops being exactly solvable, are therefore governed by the same lever: the number of signals a bidder carries\. They are not the same number, though; we find the economically optimal amount of diligence is modest, while the game outruns exact solvers only at much larger signal counts, so the two questions meet on one axis but at different points along it\. Deal teams answer the diligence question by judgment\. This paper answers the economic side directly, solving the auction’s own\-profit equilibrium to find that the optimal amount is finite and falls with the per\-signal cost, which turns that judgment call into a defensible, model\-backed number and points to where a firm that keeps researching is spending past the point it pays\. It does so cheaply, on a commodity laptop with no frontier\-model spend, and uses the computational side to stress\-test cheap generic solvers right at the boundary where exact methods give out\.
Two facts motivate this paper\. First, the standard specialized solvers \(counterfactual regret minimization, fictitious play, double oracle, and their deep descendants\) are expensive and delicate to deploy\. Second, recent work\(Rudolph et al\.,[2026](https://arxiv.org/html/2606.29457#bib.bib27)\)argues that in two\-player zero\-sum imperfect\-information games,*simple generic policy\-gradient methods*such as PPO are competitive with or superior to that machinery\. If so, modeling deal\-making under imperfect information need not require frontier models or elaborate solvers; a cheap self\-play loop on commodity hardware may suffice\.
We test this in the deal\-making setting\. Our contributions are: \(1\) a small family of zero\-sum deal games grounded in the takeover\-auction literature, sharing a reusableDealGameabstraction; \(2\) a like\-for\-like benchmark of nine solvers under exact exploitability, on iterations*and*wall\-clock, spanning the exact \(CFR, MMD, PSRO\), tabular\-learning \(REINFORCE\), and deep\-learning \(PPO, PPG, a generic deep policy gradient, Deep CFR, NFSP\) families; \(3\) evidence that the generic policy\-gradient methods PPO and PPG are the strongest learning solvers, dominating deep CFR and deep fictitious play at equal episode budgets, while the exact solvers remain best wherever the game is small enough to tabulate; \(4\) a scaling study isolating why the learning methods are nonetheless of interest \(their per\-target wall\-clock is roughly constant in game size where exact CFR’s grows steeply\), with the honest caveat that within enumerable sizes PPO never overtakes CFR and fails to reach the target on the largest; together with a first step into the genuinely intractable regime, a multi\-signal auction too large to enumerate, where PPO and PPG drive a*calibrated*learned\-best\-response exploitability estimate to its resolution floor, below a naive unshaded bidder, in a regime exact solvers cannot enter; and \(5\) economic results read off the auction’s genuine own\-profit Bayes\-Nash equilibrium, which we solve exactly rather than through the zero\-sum rendering: the value of a bidder’s diligence is positive on net, and, charging a per\-signal cost, the profit\-maximizing amount of due diligence is finite and falls as cost rises across every parameterization we check, including when both bidders choose diligence in a symmetric acquisition equilibrium \(which competition makes more conservative than a one\-sided calculation\), giving a model\-internal answer to the question in our title \(its fine schedule is game\-specific and sensitive to the bid\-grid resolution\); the toehold’s robust effect at the true equilibrium is on value rather than on equilibrium bidding\. Everything runs on a commodity laptop CPU, with no frontier\-model API calls: individual experiments take minutes to tens of minutes, the full benchmark suite a few hours, and the economic equilibria solve in seconds\.
Relative toRudolph et al\. \([2026](https://arxiv.org/html/2606.29457#bib.bib27)\), who establish the policy\-gradient thesis on recreational benchmarks at tabulatable scale, our contribution is to port it to economic deal games, to tie the exact\-versus\-learning crossover to tree\-enumerability through a wall\-clock scaling law, and to take a first step past enumeration with a learned\-best\-response lower bound that we calibrate against exact exploitability\.
## 2Related work
### Solving imperfect\-information games\.
Counterfactual regret minimization\(Zinkevich et al\.,[2007](https://arxiv.org/html/2606.29457#bib.bib33)\)is the tabular standard; Monte Carlo CFR\(Lanctot et al\.,[2009](https://arxiv.org/html/2606.29457#bib.bib18)\)samples it, predictive \(optimistic\) variants accelerate it by extrapolating the next gradient\(Farina et al\.,[2021](https://arxiv.org/html/2606.29457#bib.bib11)\), and Deep CFR\(Brown et al\.,[2019](https://arxiv.org/html/2606.29457#bib.bib4)\)scales it with neural advantage and strategy networks\. ESCHER\(McAleer et al\.,[2023](https://arxiv.org/html/2606.29457#bib.bib22)\)and DREAM\(Steinberger et al\.,[2020](https://arxiv.org/html/2606.29457#bib.bib30)\)are later variance\-reduced deep\-CFR\-family methods\. Neural Fictitious Self\-Play \(NFSP\)\(Heinrich and Silver,[2016](https://arxiv.org/html/2606.29457#bib.bib14)\)and Policy\-Space Response Oracles \(PSRO\)\(Lanctot et al\.,[2017](https://arxiv.org/html/2606.29457#bib.bib19)\)bring fictitious play and double oracle to the deep setting; CFR\-family search underlies the superhuman poker agent Libratus\(Brown and Sandholm,[2018](https://arxiv.org/html/2606.29457#bib.bib3)\), and ReBeL\(Brown et al\.,[2020](https://arxiv.org/html/2606.29457#bib.bib5)\)combines RL with search\. Magnetic mirror descent \(MMD\)\(Sokota et al\.,[2023](https://arxiv.org/html/2606.29457#bib.bib29)\)is a regularized method competitive as both a tabular and a deep solver; the same regularization idea, scaled with deep RL as regularized Nash dynamics, masters Stratego model\-free and without tabulation\(Pérolat et al\.,[2022](https://arxiv.org/html/2606.29457#bib.bib25)\)\. PPO \(Proximal Policy Optimization\)\(Schulman et al\.,[2017](https://arxiv.org/html/2606.29457#bib.bib28)\)and its phasic variant PPG \(Phasic Policy Gradient\)\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.29457#bib.bib9)\)are the generic policy\-gradient methods thatRudolph et al\. \([2026](https://arxiv.org/html/2606.29457#bib.bib27)\)argue are competitive with all of these; we evaluate both\. We build on OpenSpiel\(Lanctot et al\.,[2019](https://arxiv.org/html/2606.29457#bib.bib20)\)\.
### Auctions and takeovers\.
Auction theory characterizes bidding under private and common values\(Klemperer,[1999](https://arxiv.org/html/2606.29457#bib.bib16)\), from the common\-value model ofWilson \([1977](https://arxiv.org/html/2606.29457#bib.bib32)\)to the affiliated\-values generalization ofMilgrom and Weber \([1982](https://arxiv.org/html/2606.29457#bib.bib23)\), which is closest to the game we build\. Endogenous information acquisition before bidding, the formal version of our diligence question, is studied byPersico \([2000](https://arxiv.org/html/2606.29457#bib.bib26)\)and, in the broader mechanism\-design and auction\-format literature, byBergemann and Välimäki \([2002](https://arxiv.org/html/2606.29457#bib.bib1)\)andCompte and Jehiel \([2007](https://arxiv.org/html/2606.29457#bib.bib10)\); we take signal precision as exogenous and add a per\-signal cost only in the diligence\-cutoff study\. The takeover literature studies*sequential*preemptive and jump bidding\(Fishman,[1988](https://arxiv.org/html/2606.29457#bib.bib12)\)and toehold effects\(Bulow et al\.,[1999](https://arxiv.org/html/2606.29457#bib.bib6)\); our sealed simultaneous auction deliberately abstracts away the sequential signaling that drives preemption, and isolates the toehold’s payment\-side channel\. Bilateral trade under private information is bounded by impossibility results\(Myerson and Satterthwaite,[1983](https://arxiv.org/html/2606.29457#bib.bib24)\)\. A line of work learns equilibria of auction games directly, with neural pseudogradient ascent\(Bichler et al\.,[2021](https://arxiv.org/html/2606.29457#bib.bib2)\)and first\-order gradient dynamics\(Kohring et al\.,[2023](https://arxiv.org/html/2606.29457#bib.bib17)\), and learning has likewise been applied to negotiation and multi\-issue bargaining\(Chang,[2020](https://arxiv.org/html/2606.29457#bib.bib8)\)\. We borrow these structures as game definitions rather than deriving closed\-form equilibria\.
## 3Deal games
### Reusable abstraction\.
ADealGamelayer centralizes the two pieces identical across deal games and easy to get wrong: information\-set construction \(a player’s information state is built from its own private tokens plus public tokens, so an opponent’s private draw never leaks in, which would silently invalidate exploitability\), and the zero\-sum payoff contract\. Concrete games supply only their move protocol and economics\.
### Common\-value takeover auction\.
Two bidders compete for a target of common valueWW, unknown to both; each receives a private signal ofWWwith per\-bidder noise \(the affiliated common\-value setting ofMilgrom and Weber,[1982](https://arxiv.org/html/2606.29457#bib.bib23)\)\. Sealed first\-price bids are submitted simultaneously; the higher bid wins, pays its bid, and earnsW−bidW\-\\text\{bid\}; the loser earns nothing\. The common\-value structure induces a winner’s curse\(Capen et al\.,[1971](https://arxiv.org/html/2606.29457#bib.bib7); Kagel and Levin,[1986](https://arxiv.org/html/2606.29457#bib.bib15)\): conditioning on winning is bad news aboutWW, so a bidder that does not shade overpays in expectation\.
### A note on the zero\-sum rendering\.
We score each game zero\-sum as the*difference*in profit, the head\-to\-head benchmark ofRudolph et al\. \([2026](https://arxiv.org/html/2606.29457#bib.bib27)\), so that exact two\-player exploitability applies\. This is a deliberate modeling choice with a cost: unlike an equal\-split of a fixed pie, subtracting the opponent’s \(endogenous\) profit is not an incentive\-preserving affine shift; it adds a rivalry term, so the equilibrium we compute is the Nash equilibrium of the relativized game, not the Bayes\-Nash equilibrium of the underlying general\-sum auction\. We use it for methodological reach in the solver benchmark, where exact two\-player exploitability is the metric; but where an economic quantity is at stake \(the information\-asymmetry, diligence, and toehold studies, §[5](https://arxiv.org/html/2606.29457#S5)\) we read the result off the auction’s genuine own\-profit Bayes\-Nash equilibrium, which we solve exactly \(§[4](https://arxiv.org/html/2606.29457#S4.SS0.SSS0.Px5)\), not the relativized one\. TheDealGameabstraction supports the general\-sum contract directly\.
### Toehold variant\.
A bidder holding a toeholdθ\\thetabuys only the remaining\(1−θ\)\(1\-\\theta\)if it wins and collectsθ\\thetaof the price if it loses;θ=0\\theta=0recovers the base auction\.
### Private\-value auction\.
A structurally different game: each bidder draws and observes its own independent private value, with no common value and no winner’s curse\. Including it guards against artifacts of the common\-value structure\.
## 4Methods
We benchmark nine solvers, grouped into three families\. All are evaluated by the same exact metric: OpenSpiel exploitability, i\.e\. NashConv halved, the average over the two players of how much each could gain by best\-responding to the other\. It is zero at a Nash equilibrium and lower is better; we compute it by exact tree traversal\.
### Exact and tabular\.
CFR\(Zinkevich et al\.,[2007](https://arxiv.org/html/2606.29457#bib.bib33)\); magnetic mirror descent \(MMD\)\(Sokota et al\.,[2023](https://arxiv.org/html/2606.29457#bib.bib29)\)as a Nash solver; PSRO\(Lanctot et al\.,[2017](https://arxiv.org/html/2606.29457#bib.bib19)\)with exact best\-response oracles and a projected\-replicator\-dynamics meta\-solver \(exact oracles are the strongest, and appropriate for games small enough to tabulate\); and a from\-scratch tabular policy gradient \(REINFORCE with a per\-information\-state baseline and entropy bonus\)\.
### Deep \(function approximation\)\.
PPO\(Schulman et al\.,[2017](https://arxiv.org/html/2606.29457#bib.bib28)\), the seed paper’s method, implemented as clipped self\-play with a value baseline, multiple epochs over minibatches, and entropy regularization; PPG\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.29457#bib.bib9)\), its phasic variant, which decouples policy and value optimization with periodic auxiliary value\-distillation phases under a behavioral\-cloning constraint; a generic deep policy gradient \(OpenSpiel’s PolicyGradient agent with the regret\-policy\-gradient loss, “DeepPG” in tables\), the deep analogue of our tabular REINFORCE; Deep CFR\(Brown et al\.,[2019](https://arxiv.org/html/2606.29457#bib.bib4)\)\(“DeepCFR” in tables\), the deep counterfactual\-regret baseline standing in for the deep\-CFR family \(including ESCHERMcAleer et al\.,[2023](https://arxiv.org/html/2606.29457#bib.bib22)and DREAMSteinberger et al\.,[2020](https://arxiv.org/html/2606.29457#bib.bib30)\); and NFSP\(Heinrich and Silver,[2016](https://arxiv.org/html/2606.29457#bib.bib14)\), the fictitious\-play baseline\. Because each bidder takes exactly one action per episode \(its bid\), the policy\-gradient advantage reduces to the Monte\-Carlo return minus the value baseline, with no multi\-step bootstrapping; the clipped surrogate, minibatch epochs, and entropy bonus are otherwise standard\.
### Tuning and fairness\.
We selected the learning rate of the policy\-gradient family \(PPO, PPG, DeepPG\) by a coarse sweep, because the generic policy gradient is sensitive to it \(see §[5](https://arxiv.org/html/2606.29457#S5)\); the other deep baselines \(Deep CFR, NFSP\) use author\-recommended settings adapted to the short CPU budget rather than a per\-game sweep\. We report this asymmetry plainly so the comparison is not misread as exhaustively tuned on all sides\. The policy\-gradient methods \(PPO, PPG, DeepPG\) and NFSP share the same 64\-unit single\-hidden\-layer scale; Deep CFR uses its standard two\-layer\(64,64\)\(64,64\)advantage and policy networks\. All methods are evaluated identically\.
### Iterate averaging\.
In two\-player zero\-sum games the last iterate of policy gradient cycles around the equilibrium and does not converge in exploitability; only the time\-average converges\. We therefore evaluate the deep policy\-gradient methods \(PPO, PPG, DeepPG\) on a tabular tail\-average of the network’s per\-information\-state action probabilities \(the games are small enough to tabulate a neural policy exactly\), reporting that average as the convergent metric\. The other deep baselines are evaluated on their own native convergent objects for the same reason: NFSP on its average\-policy network and Deep CFR on its average\-strategy network\. The tail\-average discards the first half of training \(a burn\-in\) so the random\-init transient does not pollute it; we note that this gives the policy\-gradient family a head start the always\-on averages of NFSP/Deep CFR do not get, and report both the tail\-average and the oscillating last iterate\. To confirm the headline ordering is not an artifact of that head start, we also score the policy\-gradient family with*no*burn\-in \(a full average that includes the random\-init transient, the same treatment the NFSP/Deep CFR averages receive\): on the headline game PPG and PPO then read0\.0110\.011and0\.0220\.022\(three seeds\), essentially unchanged from their tail\-averages and still far below NFSP \(0\.5070\.507\) and Deep CFR \(0\.1210\.121\), so the gap is a method effect rather than a burn\-in advantage\. Deep methods are stochastic, so we report mean±\\pmstandard deviation over five seeds; we record wall\-clock seconds alongside episodes for a compute\-normalized comparison\.
In the intractable regime the policy cannot be tabulated, so there we tail\-average the network*weights*instead\. Averaging the weights of a nonlinear network is not the same object as the time\-average of the induced policy, and the convergence argument above is about the latter, so we validate weight\-averaging on tractable CV\-large, where exact exploitability is available: from a single PPO trajectory we average the same post\-burn\-in iterates both ways\. The weight tail\-average reaches exact exploitability0\.026±0\.0060\.026\\pm 0\.006, close to the tabular time\-average’s0\.021±0\.0090\.021\\pm 0\.009and far below the cycling last iterate’s0\.052±0\.0260\.052\\pm 0\.026\(three seeds\)\. Weight\-averaging is thus a sound stand\-in for the convergent time\-average where the latter cannot be built\.
### General\-sum equilibrium\.
The economic comparative statics are properties of the underlying*general\-sum*auction, not of the zero\-sum rendering the solver benchmark uses, so for them we solve the auction’s own\-profit Bayes\-Nash equilibrium exactly\. The auction factorizes: terminal profit depends only on the common value and the two bids, and each bidder’s signals are conditionally independent given the value, so the game collapses to small dense tensors \(one row per information set\) and the equilibrium is found by fictitious play on each bidder’s*own*profit, with no tree traversal and no Monte\-Carlo noise\. We report the achieved NashConv \(the summed own\-profit gain from a unilateral deviation; below∼\\sim10−210^\{\-2\}throughout, and∼\\sim10−410^\{\-4\}except at the fully\-informed corner\) so convergence is measured rather than assumed, and pin the tensor model to the OpenSpiel game with a unit test that matches expected own profit on several game shapes\. This lets each economic claim be read off the genuine equilibrium rather than a one\-sided deviation or the relativized objective; because the tensors are tiny it also runs in a fraction of a second, including the per\-bidder asymmetric\-signal configurations the OpenSpiel game cannot express\.
## 5Experiments
All experiments run on CPU in a Docker image that pins the Python base image, OpenSpiel, and PyTorch versions and fixes the CPU thread count \(OMP\_NUM\_THREADSandtorch\.set\_num\_threads\) so multithreaded float reductions are bounded; the residual nondeterminism that remains is why we still report over seeds\. Full hyperparameters are in Appendix[A](https://arxiv.org/html/2606.29457#A1)\. The deep self\-play, cross\-game, zero\-sum toehold, and scaling studies are driven by one benchmark script; the tabular convergence figure comes from a small auxiliary script; and the general\-sum economic results \(information asymmetry, diligence, and the toehold at the own\-profit equilibrium\) come from the exact equilibrium solver of §[4](https://arxiv.org/html/2606.29457#S4.SS0.SSS0.Px5)\.
### Tabular convergence\.
On the small common\-value auction \(from the auxiliary script\) all exact/tabular methods drive exploitability toward zero \(Fig\.[5](https://arxiv.org/html/2606.29457#S5.F5)\)\. Final exploitability: CFR0\.0025, MMD0\.015, tabular REINFORCE0\.0006\. On this small game the from\-scratch policy gradient is competitive with the specialized solvers; the cross\-game table below shows it degrades on larger games at a fixed budget\.
### Deep self\-play \(headline\)\.
On a larger common\-value auction \(5 values, 6 bid levels\), over five seeds, the two generic policy\-gradient methods are the strongest learners: PPG converges to tail\-averaged exploitability0\.013±0\.0040\.013\\pm 0\.004and PPO to0\.016±0\.0080\.016\\pm 0\.008, an order of magnitude below the generic deep policy gradient’s0\.213±0\.0290\.213\\pm 0\.029, roughly8×8\\timesbelow Deep CFR’s0\.122±0\.0070\.122\\pm 0\.007, and about30×30\\timesbelow NFSP’s0\.496±0\.0040\.496\\pm 0\.004\(Fig\.[6](https://arxiv.org/html/2606.29457#S5.F6)\)\. They are also markedly more sample\-efficient, reaching their plateau in far fewer episodes\. The exact references remain both lower and faster: CFR0\.0037in69s and PSRO0\.0048±0\.00040\.0048\\pm 0\.0004in44s, against roughly300300s for PPO/PPG\. Indeed the policy\-gradient methods are the most expensive*per episode*\(PPO/PPG≈300\\approx 300s for300300k episodes versus DeepPG141141s and NFSP187187s\), so their edge over the deep baselines is at equal*episode*budgets, not equal wall\-clock; on a game this small the exact solvers dominate on both axes\. A practical caveat behind the generic policy gradient: with an aggressive learning rate its last iterate diverges into a limit cycle near exploitability0\.400\.40; a smaller learning rate \(which we use\) restores monotone convergence, and the tail\-average \(the convergent quantity in zero\-sum games\) gives the stable, low\-variance numbers reported here\.
### Across games\.
Table[2](https://arxiv.org/html/2606.29457#S5.T2)reports final exploitability for every method on every game\. Two patterns hold\. Among the learning methods, PPO and PPG are the strongest and trade places \(PPG lowest on CV\-small and CV\-large, PPO on CV\-toehold and PV\), and both sit far below the generic deep policy gradient, Deep CFR, and NFSP on all four games\. The gaps are large and consistent across five seeds, so we read these as robust orderings rather than precise estimates\. As expected for games small enough to tabulate, the exact solvers \(CFR, MMD, PSRO\) achieve the lowest exploitability overall; tabular REINFORCE is competitive only on the smallest game and degrades sharply on the larger ones\. The case for the learning methods is therefore not that they win here, but that they carry over unchanged as games grow \(§[5](https://arxiv.org/html/2606.29457#S5.SS0.SSS0.Px9)\)\.
### Information asymmetry\.
We read the value of information off the auction’s genuine own\-profit Bayes\-Nash equilibrium \(§[4](https://arxiv.org/html/2606.29457#S4.SS0.SSS0.Px5)\), not the zero\-sum rendering or a one\-sided deviation\. Holding bidder 1’s signal noise at0\.50\.5and sweeping bidder 0’s, bidder 0’s equilibrium own profit falls monotonically as its signal degrades, from0\.93with a perfect signal \(the one corner the solver brings only to a near\-equilibrium, NashConv∼\\sim8×10−38\\times 10^\{\-3\}; §[4](https://arxiv.org/html/2606.29457#S4.SS0.SSS0.Px5)\) to0\.30with a useless one, while the now\-better\-informed rival’s value rises to meet it; at equal \(useless\) information the two split the surplus \(0\.30each, Fig\.[7](https://arxiv.org/html/2606.29457#S5.F7)\)\. This is the value of information measured in a real equilibrium: positive, steepest at intermediate signal quality, and economically meaningful even at the floor, because a poorly informed bidder still captures real surplus\. The zero\-sum rendering and a one\-sided own\-profit best response agree on the direction of the decline \(the zero\-sum value falls from\+0\.389to\+0\.002, the one\-sided best response from\+0\.389to\+0\.168\), but both overstate the collapse: the rivalry term and the fixed\-opponent assumption each understate the residual surplus the uninformed bidder retains in the true equilibrium\. This is exactly the gap the exact general\-sum solver closes\.
### How much due diligence?
The value of information is what a bidder pays*for*; the title’s question is how much to buy\. We model diligence as the number of independent signalskka bidder acquires before bidding\. At the own\-profit equilibrium \(rival fixed at one signal\), in our base game \(five values, six bids, signal noise0\.50\.5\) bidder 0’s equilibrium profit rises with its ownkk, from0\.46atk=1k=1to0\.82atk=5k=5\(Fig\.[8](https://arxiv.org/html/2606.29457#S5.F8), left\), with a*lumpy*marginal: here the third signal is worth more than the second \(\+0\.18\+0\.18versus\+0\.08\+0\.08\)\. Charging a per\-signal diligence costcc, the profit\-maximizing amount of diligence isk⋆=argmaxk\[value\(k\)−ck\]k^\{\\star\}=\\arg\\max\_\{k\}\\,\[\\text\{value\}\(k\)\-c\\,k\]\(Fig\.[8](https://arxiv.org/html/2606.29457#S5.F8), right\): in this game it steps down fromk⋆=5k^\{\\star\}=5when signals are nearly free tok⋆=1k^\{\\star\}=1onceccexceeds about0\.140\.14per signal, skippingk=2k=2entirely\. This is the cost\-benefit cutoff the title asks for\. The load\-bearing claim is not merely that the optimum is finite \(with a strictly positive per\-signal cost and a bounded marginal value of information, finiteness is close to automatic\) but that it*falls monotonically*as the cost rises; the specific schedule \(whichkkare skipped\) is the fragile part, as the next paragraphs show\.
### Is the cutoff robust?
Because that cutoff is read off one parameterization, we re\-solved it across game sizes \(four, five, and six common values\) and signal qualities \(noise0\.30\.3,0\.50\.5,0\.70\.7\); Fig\.[9](https://arxiv.org/html/2606.29457#S5.F9)\. Two things hold and one does not\.*Robust:*the optimal amount of diligence is finite and*falls*as the per\-signal cost rises in every converged parameterization \(7/77/7\), and acquiring diligence is valuable on net \(equilibrium value atk=5k=5exceeds that atk=1k=1in6/76/7\)\.*Not robust:*the fine structure\. The marginal value of signals is lumpy but its shape is parameterization\-specific: the odd\-signal bump that drives the base game’sk=2k=2skip appears in only3/73/7cases, and in3/73/7the value is non\-monotone inkk\(more signals can*lower*a bidder’s equilibrium profit at somekk\)\. We initially read this non\-monotonicity as a common\-value equilibrium effect, but it does not survive refinement of the bid grid: re\-solving a non\-monotone cell \(four values, noise0\.50\.5\) with the bid grid refined from55to99and1717levels over the same bid range turns thek=1→2k\{=\}1\\\!\\to\\\!2value*decline*\(\+0\.12\+0\.12\) into an*increase*\(about−0\.07\-0\.07to−0\.10\-0\.10; Fig\.[10](https://arxiv.org/html/2606.29457#S5.F10)\), so the non\-monotonicity is a coarse\-grid quantization artifact, not an equilibrium property \(the refined solves carry more bid actions and settle at NashConv∼\\sim2×10−42\\times 10^\{\-4\}rather than10−410^\{\-4\}, but the reversal is far larger than that residual\)\. We therefore report the specific schedule \(whichkka cutoff skips\) as a model\-internal illustration that is itself sensitive to the bid grid; the transferable answer is the qualitative one \(finite, cost\-decreasing diligence\), stated in the model’s terms \(a stylized constant per\-signal cost\), not for a specific transaction\.
### When both bidders choose diligence\.
The curves above fix the rival at one signal, so they are a one\-sided best response, and the toehold result \(below\) warns that one\-sided analyses overstate effects\. We therefore let*both*bidders choose how much to acquire\. Solving the bidding equilibrium for every pair\(k0,k1\)\(k\_\{0\},k\_\{1\}\)on the base game gives a value matrix in which a bidder’s own profit rises with its own signals but*falls*as the rival acquires more \(own value with five signals drops from0\.82against a one\-signal rival to0\.58against a five\-signal rival\), so diligence is partly rivalrous\. At each per\-signal cost we then find the symmetric acquisition equilibrium, thek⋆k^\{\\star\}at which neither bidder gains by deviating when both acquirek⋆k^\{\\star\}\(Fig\.[11](https://arxiv.org/html/2606.29457#S5.F11)\)\. The one\-sided calculation*overstates*how much diligence is bought: at intermediate costs the symmetric equilibrium settles atk⋆=2k^\{\\star\}=2where the one\-sided cutoff reads33–44, because each extra signal is worth less when the rival is also well\-informed\. The title’s answer nonetheless survives the move to a genuine acquisition equilibrium:k⋆k^\{\\star\}is still finite and still falls monotonically with cost \(from55when signals are nearly free, to22at intermediate cost, to11onceccexceeds about0\.140\.14\)\.
### Toeholds\.
Bidder 0’s expected bid rises with its toehold under the relativized analyses \(Fig\.[12](https://arxiv.org/html/2606.29457#S5.F12)\): in the zero\-sum equilibrium \(CFR\) it rises from2\.03atθ=0\\theta=0to2\.27atθ=0\.4\\theta=0\.4\(then edging down slightly to2\.25atθ=0\.6\\theta=0\.6\), and its own\-profit\-maximizing bid against a fixed opponent rises monotonically from1\.40to2\.40\. At the genuine own\-profit equilibrium, however, the picture is sharper: a toehold raises bidder 0’s equilibrium profit monotonically \(from0\.46atθ=0\\theta=0to1\.97atθ=0\.6\\theta=0\.6\), but its equilibrium*bid*is essentially flat once the opponent re\-optimizes \(Fig\.[13](https://arxiv.org/html/2606.29457#S5.F13)\)\. This flatness is not an artifact of the coarse bid grid: refining the grid from66to2020levels keeps the equilibrium bid’s total variation acrossθ∈\[0,0\.6\]\\theta\\in\[0,0\.6\]under0\.0020\.002, so it is an equilibrium property\. The robust toehold effect is thus on value, not on equilibrium bidding: the aggressiveness that the zero\-sum rendering and a one\-sided best response both show is largely competed away in the symmetric equilibrium\. The aggressiveness channel emphasized byBulow et al\. \([1999](https://arxiv.org/html/2606.29457#bib.bib6)\)operates in sequential and asymmetric contests; in this sealed, simultaneous, symmetric auction it is competed away at the fixed point, so the flat equilibrium bid is a matter of model scope rather than a refutation of that channel\. That the two objectives disagree on this comparative static is itself the case for solving the true general\-sum equilibrium rather than reading economics off the zero\-sum benchmark\.
### Scaling: do the learning methods earn their keep?
The headline and cross\-game results show the exact solvers winning wherever the game is tabulatable, so the case for the learning methods must be about*scale*\. We test it directly \(Table[1](https://arxiv.org/html/2606.29457#S5.T1)\): on common\-value auctions of growing size we measure the training wall\-clock each method needs to drive exploitability to0\.050\.05\. Exact CFR traverses the whole game tree every iteration, so its time\-to\-target grows steeply with the tree, from2\.02\.0s at∼\\sim2k states to1,5501\{,\}550s at∼\\sim365k states \(faster than linear: a∼\\sim180\-fold growth in states costs a∼\\sim780\-fold growth in time, because the larger trees also need more iterations to reach the target, not only more time per iteration\)\. PPO samples a fixed number of episodes per update regardless of tree size, so its time\-to\-target is nearly flat \(∼\\sim190–280s\)\. The cost trends diverge sharply, which is the structural reason to care about the learning methods\. But within the range we can enumerate this does*not*become a PPO win: CFR is faster on every game we could build, and on the largest \(S12\) PPO, with the configuration tuned on the small game and not re\-tuned per scale, fails to reach the target at all \(its best exploitability over three seeds is0\.130\.13, more than double the0\.050\.05target\) while CFR reaches it in1,5501\{,\}550s\. The constant per\-iteration cost of the learning methods is a real advantage, but it pays off only past the point where the tree can be enumerated, where CFR cannot run and exploitability must be estimated approximately\. We therefore do not claim the learning methods beat exact solvers on solvable deal games: we observe*no*crossover within the enumerable range \(the two cost curves diverge but never cross\), and past that range the comparison is no longer like\-for\-like, because the success metric must change from exact exploitability to a calibrated lower bound\. What we claim is the weaker, defensible statement: the cost trends make the learning methods the strongest*available*option once exact methods become infeasible, and exhibiting that regime \(next paragraph\) is the natural next step\.
Table 1:Scaling: training wall\-clock \(seconds\) to reach exploitability0\.050\.05on common\-value auctions of growing size\. Exact exploitability is still computed by tree enumeration\. CFR’s cost grows steeply with the tree; PPO’s is roughly flat but does not reach the target on the largest game\. PPO seconds are mean±\\pmstd over the seeds that reached the target \(seeds reaching in parentheses\); the last column is PPO’s best exploitability reached, mean±\\pmstd over all three seeds\. PPO uses the small\-game configuration, not re\-tuned per scale\.
### The intractable regime\.
The scaling study stops where the tree can still be enumerated\. To reach the regime the learning methods are actually for, we make the game genuinely intractable: a*multi\-signal*common\-value auction in which each bidder receiveskki\.i\.d\. noisy signals ofWW\(independent due\-diligence estimates\)\. An information set is the bidder’skk\-tuple of signals, so the number of information sets per bidder isvaluesk\\text\{values\}^\{k\}and the tree isvalues2k\+1⋅bids2\\text\{values\}^\{2k\+1\}\\cdot\\text\{bids\}^\{2\}, both exploding withkk\. At66values,66bids, andk=8k=8signals the game has about6\.1×10146\.1\\times 10^\{14\}histories and1\.68×1061\.68\\times 10^\{6\}information sets per bidder: exact CFR and exact exploitability cannot run, and the strategy cannot be tabulated, but a network generalizes across signal vectors \(the relevant statistic is roughly their mean\)\. We train PPO and PPG with weight tail\-averaging \(no tabulation\) and measure exploitability with a*learned best response*\(Lockhart et al\.,[2019](https://arxiv.org/html/2606.29457#bib.bib21); Timbers et al\.,[2022](https://arxiv.org/html/2606.29457#bib.bib31)\): freeze the policy, train a PPO best\-responder against it for each player, and Monte\-Carlo the gain from deviating; half the summed gains estimates exploitability\. A learned BR underestimates the true best response, so this is a lower bound\. We calibrate it on CV\-large, where exact exploitability is available as ground truth, by measuring it for policies of*known*exploitability built by mixing the CFR equilibrium with uniform play \(Fig\.[2](https://arxiv.org/html/2606.29457#S5.F2), five best\-response seeds per point\)\. The learned BR tracks the exact value tightly across more than a decade, from0\.0430\.043\(estimate0\.039±0\.0030\.039\\pm 0\.003\) up to0\.9000\.900\(estimate0\.903±0\.0050\.903\\pm 0\.005\), with small across\-seed spread throughout, so the tightness is not single\-seed luck\. Two distinct scales matter below about0\.040\.04: a*Monte\-Carlo noise floor*of roughly0\.010\.01\(the standard error of the estimate at2020k evaluation episodes\), and a higher*reliable\-tracking threshold*of about0\.040\.04, below which the estimate begins to under\-read before it has reached the noise floor \(at exact0\.0200\.020it already reads0\.0110\.011\)\. At the CFR equilibrium itself the estimate is essentially zero \(0\.0010\.001against the exact0\.00370\.0037, below the noise floor\)\. The estimator is thus tight in the range that matters and saturates only below the tracking threshold, so a reading of0\.0000\.000means*exploitability below about0\.040\.04*, not a best response too weak to find a deviation\.
One concern remains: CV\-large is a single\-signal game, whereas the intractable target carries thekk\-signal information\-set structure \(kk\-tuples rather than scalars\)\. To check that the estimator’s tightness is a property of the learned BR and not of single\-signal geometry, we repeat the calibration on the multi\-signal auction itself atk=1,2,3k=1,2,3, the largestkkstill tractable for exact ground truth \(55values,66bids;k=3k=3is already2\.8×1062\.8\\times 10^\{6\}histories\), with five best\-response seeds per mix point \(Fig\.[3](https://arxiv.org/html/2606.29457#S5.F3)\)\. The estimate tracks the exact value closely across all threekkin the range that matters, from near1\.01\.0down to the low tenths \(atk=3k=3, exact0\.1840\.184reads0\.1570\.157and exact0\.9660\.966reads0\.9420\.942\)\. The reliable\-tracking threshold does rise withkk: atk=1k=1the estimate is still tight at exact0\.050\.05\(0\.0390\.039\), but atk=3k=3the same point reads only0\.0280\.028, so reliable tracking holds down to about0\.040\.04atk=1k=1and about0\.100\.10atk=3k=3\(the equilibrium reads0\.0000\.000against an exact0\.0080\.008atk=3k=3\)\. The calibrated threshold therefore rises only modestly withkkover the range we can check, rather than blowing up, which is the property thek=8k=8reading relies on: the estimator’s tightness survives the multi\-signal information\-set structure, not just the single\-signal game\. The evidence is strongest atk≤3k\\leq 3, where we have exact ground truth on the headline family; thek=8k=8reading extrapolates the floor’s slow growth past that point, and should be read as such\. As direct support, on a smaller game \(33values\) solvable tok=4k=4the floor does not grow inkk\(it edges down from about0\.00510\.0051to0\.00370\.0037; Fig\.[4](https://arxiv.org/html/2606.29457#S5.F4)\), so the floor does not blow up as the information\-set count grows\.
### Result\.
On this intractable game, where no exact solver can run, we compare against two tabulation\-free anchors: uniform play and a*naive*bidder that bids its posterior mean ofWWwithout conditioning on the winning event \(no winner’s\-curse shading\)\. The same learned BR exploits uniform play almost completely \(approximate exploitability0\.999\) and the naive bidder partially \(0\.071, which measures the value of conditioning on winning that the naive bidder forgoes, one component of the winner’s curse rather than its full magnitude\), but finds no profitable deviation against the trained policies: against both PPO and PPG its estimate sits at the resolution floor \(clamped to0\.000\) over three seeds \(Fig\.[1](https://arxiv.org/html/2606.29457#S5.F1)\)\. The informative comparison is the gap to the anchors: the learned policies sit far below the naive bidder, which is itself far below uniform play\. Three checks make this credible rather than an artifact of a weak BR: the same BR clearly exploits both references; doubling its training and widening its network leaves the anchor estimates unchanged \(the BR is at its power ceiling\); and, most directly, the calibrations above show this BR tracks exact exploitability tightly down to about0\.040\.04on the single\-signal game and across the multi\-signal structure atk=1,2,3k=1,2,3\(Figs\.[2](https://arxiv.org/html/2606.29457#S5.F2),[3](https://arxiv.org/html/2606.29457#S5.F3)\), so a0\.0000\.000reading places the learned policies below that resolution and well below the naive bidder’s0\.0710\.071, in a regime exact methods cannot enter\. This is the evidence the scaling study set up: once the game is too large to enumerate, the learning methods still produce a low\-exploitability policy at fixed cost\. We report approximate \(lower\-bound\) exploitability and do not claim a Nash certificate; certifying equilibrium quality here is open\.
Figure 1:Intractable multi\-signal common\-value auction \(66values,66bids,k=8k=8signals;∼\\sim6\.1×10146\.1\\times 10^\{14\}histories, exact methods infeasible\)\. Approximate \(learned\-best\-response\) exploitability of uniform play, a naive unshaded posterior\-mean bidder, and the trained PPO and PPG policies \(weight tail\-average, mean±\\pmstd over three seeds\)\. The same BR exploits the references but not the learned policies\. Lower is better\.Figure 2:Calibrating the learned\-best\-response estimator on CV\-large, where exact exploitability is available\. Policies of known exploitability are built by mixing the CFR equilibrium with uniform play; markers are the mean over five best\-response seeds and bars the across\-seed standard deviation\. The estimate tracks the exact value along the diagonal from0\.0430\.043\(0\.0390\.039\) to0\.9000\.900\(0\.9030\.903\), staying close down to a reliable\-tracking threshold of about0\.040\.04and saturating toward zero below that as it approaches a∼\\sim0\.010\.01Monte\-Carlo noise floor\. This is why a0\.0000\.000reading on the intractable game means exploitability below the resolution floor, not a weak best response\.Figure 3:Calibrating the same learned\-best\-response estimator on the*multi\-signal*auction atk=1,2,3k=1,2,3signals \(55values,66bids\), the largestkkstill tractable for exact ground truth\. As in Fig\.[2](https://arxiv.org/html/2606.29457#S5.F2), policies of known exploitability are built by mixing the CFR equilibrium with uniform play; markers are the mean over five best\-response seeds and bars are the across\-seed standard deviation\. The estimate tracks the exact value along the diagonal across all threekkin the range that matters; the resolution floor rises withkk\(reliable tracking from about0\.040\.04atk=1k=1to about0\.100\.10atk=3k=3\) but not catastrophically\. The tightness is thus a property of the estimator, not of single\-signal structure, supporting the lower\-bound reading on thek=8k=8game\.Figure 4:Does the resolution floor explode askkgrows? On a smaller game \(33values,44bids\), we track the floor \(the learned\-BR estimator’s Monte\-Carlo standard error at the near\-zero\-exploitability CFR equilibrium\) overk=1,…,4k=1,\\dots,4\. It does not grow with the information\-set count \(3k3^\{k\}\); it edges*down*from about0\.00510\.0051to0\.00370\.0037, and the estimator still tracks a known\-exploitability mixed policy at everykk\(Appendix\)\. The floor magnitude is not comparable across game families, but its flat trend inkkis the property thek=8k=8extrapolation on the headline family relies on\.Table 2:Final exploitability by method and game \(NashConv halved; lower is better\)\. Exact/tabular solvers \(CFR, MMD, PSRO, REINFORCE\) are deterministic at fixed budget; the deep methods \(PPO, PPG, DeepPG, DeepCFR, NFSP\) are mean±\\pmstd over five seeds\. Deep learners use a fixed episode budget per game \(150150k episodes\), smaller than the300300k\-episode headline deep self\-play run; this is why CV\-large here \(0\.0180\.018–0\.0190\.019\) sits slightly above the headline figures \(0\.0130\.013–0\.0160\.016\) for the same game\. Best learning method per game in bold\.Figure 5:Tabular convergence on the small common\-value auction: exploitability \(NashConv halved, log scale, lower is better\) vs\. iterations for CFR, MMD, and our from\-scratch REINFORCE policy gradient\. From the auxiliary experiment script\.Figure 6:Deep self\-play on the common\-value auction \(5 values, 6 bids\): tail\-averaged exploitability \(log scale, lower is better\) vs\. episodes \(left\) and training wall\-clock seconds \(right\), mean±\\pmstd over five seeds, for PPO, PPG, DeepPG, and NFSP, with exact CFR/PSRO and Deep CFR shown as horizontal references \(their wall\-clock in the legend\)\.Figure 7:Value of information at the genuine own\-profit Bayes\-Nash equilibrium \(solved exactly, §[4](https://arxiv.org/html/2606.29457#S4.SS0.SSS0.Px5)\), not the zero\-sum rendering\. As bidder 0’s signal noise rises \(bidder 1 fixed at noise0\.50\.5\), bidder 0’s equilibrium own profit falls and the better\-informed rival’s rises, meeting where both are uninformed\. The value of information is positive, steepest at intermediate signal quality, and the uninformed bidder still captures real surplus\.Figure 8:How much due diligence, in the base game \(five values, six bids, noise0\.50\.5\)\. Left: bidder 0’s equilibrium own profit as it acquires more independent diligence signalskk\(rival fixed at one signal\), rising with a lumpy marginal\. Right: with a per\-signal diligence costcc, the profit\-maximizing amount of diligencek⋆=argmaxk\[value\(k\)−ck\]k^\{\\star\}=\\arg\\max\_\{k\}\[\\text\{value\}\(k\)\-c\\,k\], a step function falling from55to11as cost rises\. The cost\-benefit cutoff the title asks for; its robustness across parameterizations is Fig\.[9](https://arxiv.org/html/2606.29457#S5.F9)\.Figure 9:Robustness of the diligence cutoff across game sizes \(four, five, six common values\) and signal qualities \(noise0\.30\.3,0\.50\.5,0\.70\.7\); dotted lines are parameterizations the fictitious\-play solver did not bring to tolerance and are excluded from the counts\. Left: the value of diligence is positive on net but not always monotone inkk\. Right: the marginal value is lumpy, but its shape \(which signals are worth most\) is parameterization\-specific\. The cutoff itself is finite and falls with cost in every converged case \(7/77/7\); the specifickkit skips is not a general feature\.Figure 10:The value non\-monotonicity is a bid\-grid artifact\. Re\-solving a non\-monotone robustness cell \(four values, noise0\.50\.5\) with the bid grid refined from55to99and1717levels over the same bid range: the coarse\-grid decline fromk=1k=1tok=2k=2reverses into an increase once the grid is refined, so “more signals lower profit” does not survive refinement and is not an equilibrium property\.Figure 11:Diligence when both bidders choose it\. Left: a bidder’s equilibrium own profit rises with its own signals but falls as the rival acquires more, so diligence is partly rivalrous\. Right: the profit\-maximizing amount of diligence per signal\-cost, in the symmetric acquisition equilibrium \(both bidders choosekk\) versus the one\-sided cutoff \(rival fixed at one\)\. Competition makes the symmetric equilibrium more conservative, but both schedules are finite and fall monotonically with cost\.Figure 12:Toeholds under the relativized analyses\. Both the zero\-sum CFR equilibrium bid and the bid that maximizes bidder 0’s*own*profit against a fixed opponent rise with the toeholdθ\\theta\. At the genuine own\-profit equilibrium this aggressiveness is largely competed away \(Fig\.[13](https://arxiv.org/html/2606.29457#S5.F13)\)\.Figure 13:The toehold at the genuine own\-profit equilibrium\. Left: bidder 0’s equilibrium profit rises monotonically with its toeholdθ\\theta\. Right: its equilibrium expected bid is essentially flat, so the aggressiveness seen under the relativized analyses \(Fig\.[12](https://arxiv.org/html/2606.29457#S5.F12)\) is competed away once the opponent re\-optimizes\. The robust toehold effect is on value, not on equilibrium bidding\.
## 6Discussion and limitations
The games are small and stylized abstractions, not calibrated transaction models, and they are one auction family \(common\- and private\-value first\-price\), not a broad suite\. The common\-value signals are conditionally independent given the asset value under a specific discretized noise model; affiliated or correlated signal structures \(the generalMilgrom and Weber,[1982](https://arxiv.org/html/2606.29457#bib.bib23)case\) are not explored, and some mechanisms \(for instance that the payoff\-relevant statistic is roughly the signal mean\) may be specific to that structure\. Our PSRO uses exact best\-response oracles rather than RL oracles, and our Deep CFR stands in for the broader deep\-CFR family \(ESCHERMcAleer et al\.,[2023](https://arxiv.org/html/2606.29457#bib.bib22), DREAMSteinberger et al\.,[2020](https://arxiv.org/html/2606.29457#bib.bib30)\)\. The zero\-sum relative\-scoring rendering distorts the underlying general\-sum auction by adding a rivalry term \(§[5](https://arxiv.org/html/2606.29457#S5)\); we therefore read the economic comparative statics off the auction’s exact own\-profit Bayes\-Nash equilibrium instead \(§[4](https://arxiv.org/html/2606.29457#S4.SS0.SSS0.Px5)\), but the solver benchmark itself still runs on the zero\-sum rendering, where exact two\-player exploitability applies, so the head\-to\-head method comparison and the economic results are computed on different objectives\. For the intractable regime we report a learned\-best\-response*lower bound*on exploitability, calibrated against exact values where those are available \(§[5](https://arxiv.org/html/2606.29457#S5)\); it is not a Nash certificate, and a certified upper bound there \(for example via a structured or local best response\) remains open\. The intractable\-regime policy is evaluated through a weight \(Polyak\) tail\-average, which we validate against the tabular time\-average only on CV\-large over three seeds \(§[5](https://arxiv.org/html/2606.29457#S5)\); we treat it as a sound stand\-in rather than a proven equivalence at larger scale\. Other genuinely intractable instances \(continuous bids/signals or many\-round contests\) would stress the same estimator further\. Diligence signals are free in the auction games themselves; the cost side enters only in the diligence\-cutoff study \(§[5](https://arxiv.org/html/2606.29457#S5)\), where we impose a per\-signal cost and read off the optimal amount, both against a fixed one\-signal rival and in a symmetric acquisition equilibrium where both bidders choose their diligence\. That cutoff uses a stylized constant per\-signal cost, not an estimated cost schedule for a particular deal, so it answers “how much diligence” in the model’s terms rather than as a transaction recommendation\. Claims about which method is best are empirical statements on these games, not theorems\.
## 7Conclusion
On zero\-sum auction games from the takeover literature, the generic policy\-gradient methods PPO and PPG are the strongest learning solvers, clearly beating the deep fictitious\-play and deep\-CFR baselines on commodity CPU with no frontier\-model spend; this gap survives scoring every method without the policy\-gradient family’s burn\-in, and the one tuning asymmetry \(we swept the policy\-gradient learning rate, not the baselines’\) is stated plainly\. Solving the auction’s own\-profit Bayes\-Nash equilibrium exactly, we quantify the value of due diligence and the cost at which acquiring more of it stops paying\. The honest qualification is that on every game small enough to tabulate the exact solvers \(CFR, MMD, PSRO\) are both more accurate and faster: the policy\-gradient methods win the learning bracket, not the overall race\. A scaling study shows why they may still matter, their per\-target cost staying flat while exact CFR’s grows steeply with the game tree, and we take the first step into the regime that settles the question: on a multi\-signal auction too large to enumerate, where exact solvers cannot run, PPO and PPG drive a learned\-best\-response exploitability estimate to its resolution floor, below a naive unshaded bidder, with the estimator calibrated against exact values on the games where both can be computed\. Certifying equilibrium quality there, beyond a lower bound, is the open problem; the shared abstraction is a foundation for these richer games\.
## Appendix AHyperparameters
All deep methods use a single hidden layer of6464units and the Adam optimizer, except Deep CFR, which uses two\(64,64\)\(64,64\)networks \(its advantage and policy nets\); full solver settings are in Table[3](https://arxiv.org/html/2606.29457#A1.T3)\. Means are over five seeds\{0,1,2,3,4\}\\\{0,1,2,3,4\\\}for the headline and cross\-game studies and three seeds\{0,1,2\}\\\{0,1,2\\\}for the scaling, intractable, and weight\-averaging studies\. Both the single\-signal calibration \(Fig\.[2](https://arxiv.org/html/2606.29457#S5.F2)\) and the multi\-signal calibration \(Fig\.[3](https://arxiv.org/html/2606.29457#S5.F3)\) use five best\-response seeds per mix point with across\-seed standard\-deviation bars\. All tail\-averages discard the first50%50\\%of training as burn\-in\.
### Episode budgets\.
Headline deep self\-play300300k; cross\-game table150150k; scaling\-study PPO600600k \(time\-to\-target is reached well before the cap\); intractable\-regime PPO/PPG300300k\.
### General\-sum equilibrium solver\.
The economic comparative statics are read off the auction’s own\-profit Bayes\-Nash equilibrium \(§[4](https://arxiv.org/html/2606.29457#S4.SS0.SSS0.Px5)\), found by fictitious play on the factorized tensor model \(linear averaging of pure best responses, up to6060k iterations with an early stop at NashConv<10−4<10^\{\-4\}\)\. The information\-asymmetry study sweeps bidder 0’s signal noise over\{0\.0,0\.25,0\.5,0\.75,1\.0\}\\\{0\.0,0\.25,0\.5,0\.75,1\.0\\\}with bidder 1 fixed at0\.50\.5; the toehold study sweepsθ\\thetaover\{0\.0,0\.2,0\.4,0\.6\}\\\{0\.0,0\.2,0\.4,0\.6\\\}; the diligence study sweeps bidder 0’s number of signals over\{1,2,3,4,5\}\\\{1,2,3,4,5\\\}\(each of noise0\.50\.5\) with the rival fixed at one signal, and the cost cutoff is taken over a per\-signal cost grid in steps of0\.020\.02up to0\.300\.30\. All three use55values and66bids\. The achieved NashConv is∼\\sim10−410^\{\-4\}except at the fully\-informed corner \(noise0\), where it is∼\\sim8×10−38\\times 10^\{\-3\}\. The diligence\-cutoff robustness sweep \(Fig\.[9](https://arxiv.org/html/2606.29457#S5.F9)\) repeats the diligence study overvalues∈\{4,5,6\}\\text\{values\}\\in\\\{4,5,6\\\}\(withbids=values\+1\\text\{bids\}=\\text\{values\}\+1\) andnoise∈\{0\.3,0\.5,0\.7\}\\text\{noise\}\\in\\\{0\.3,0\.5,0\.7\\\}, allowing up to400400k fictitious\-play iterations;22of the99cells \(both at noise0\.30\.3\) did not reach tolerance and are excluded from the robustness counts\. To firm up thek=8k=8extrapolation of the estimator’s resolution floor, which the calibration only reaches atk≤3k\\leq 3on the headline family, we additionally track the floor overk=1,…,4k=1,\\dots,4on a smaller game \(33values,44bids\), with33seeds \(Fig\.[4](https://arxiv.org/html/2606.29457#S5.F4)\);k=4k=4here \(∼\\sim3\.1×1053\.1\\times 10^\{5\}histories\) extends a full step past the headline calibration, andk=5k=5\(∼\\sim2\.8×1062\.8\\times 10^\{6\}histories\), though enumerable in principle, is already impractically slow for exact CFR, itself an instance of the scaling wall\. The zero\-sum toehold bids in Fig\.[12](https://arxiv.org/html/2606.29457#S5.F12)are the CFR equilibrium \(600600iterations\) and the own\-profit best response against theθ=0\\theta\{=\}0CFR equilibrium opponent, retained for comparison\. The symmetric acquisition equilibrium \(Fig\.[11](https://arxiv.org/html/2606.29457#S5.F11)\) solves the bidding equilibrium for every signal pair\(k0,k1\)\(k\_\{0\},k\_\{1\}\)withk0,k1∈\{1,…,5\}k\_\{0\},k\_\{1\}\\in\\\{1,\\dots,5\\\}on the base game and, at each per\-signal cost, reports the symmetric pure\-strategy acquisition equilibriak⋆k^\{\\star\}\(no unilateral deviation inkkimproves net payoff\)\. The grid\-refinement check \(Fig\.[10](https://arxiv.org/html/2606.29457#S5.F10)\) re\-solves a non\-monotone robustness cell \(44values, noise0\.50\.5\) with the bid grid set to55,99, and1717levels evenly spaced over the same bid range\[0,values\]\[0,\\text\{values\}\]; the larger action sets converge more slowly \(NashConv∼\\sim2×10−42\\times 10^\{\-4\}\)\.
### Software and threading\.
All experiments run in a pinned Docker image: Python3\.12\.133\.12\.13, CPU\-only PyTorch2\.12\.12\.12\.1, OpenSpiel1\.6\.151\.6\.15, NumPy<2\.0<2\.0\. To make the wall\-clock comparisons reproducible we pin threading to eight cores \(OMP\_NUM\_THREADS=MKL\_NUM\_THREADS=8\\texttt\{OMP\\\_NUM\\\_THREADS\}=\\texttt\{MKL\\\_NUM\\\_THREADS\}=8in the image, and the deep benchmark script additionally callstorch\.set\_num\_threads\(8\)\\texttt\{torch\.set\\\_num\\\_threads\}\(8\)\), so floating\-point reduction order is fixed across runs\.
Table 3:Solver hyperparameters\. The policy\-gradient family \(PPO, PPG, DeepPG\) shares a swept learning rate; the other deep baselines use author\-recommended settings adapted to the short CPU budget \(§[5](https://arxiv.org/html/2606.29457#S5)\)\.
### Intractable\-regime estimator\.
The learned best response trains for250250update batches of256256episodes each \(same PPO settings,6464\-wide\); approximate exploitability is Monte\-Carlo’d over2020k evaluation episodes; the evaluated policy is the weight tail\-average\. The calibration study \(Fig\.[2](https://arxiv.org/html/2606.29457#S5.F2)\) uses the same estimator settings, with the reference policies built from a400400\-iteration CFR equilibrium mixed with uniform play\. The multi\-signal calibration \(Fig\.[3](https://arxiv.org/html/2606.29457#S5.F3)\) uses the same estimator and mix grid on the55\-value,66\-bid auction atk=1,2,3k=1,2,3signals \(up to2\.8×1062\.8\\times 10^\{6\}histories atk=3k=3\), with the reference CFR run for400400iterations atk=1,2k\{=\}1,2and200200atk=3k=3, and five best\-response seeds per mix point\.
## References
- Bergemann and Välimäki \[2002\]Dirk Bergemann and Juuso Välimäki\.Information acquisition and efficient mechanism design\.*Econometrica*, 70\(3\):1007–1033, 2002\.
- Bichler et al\. \[2021\]Martin Bichler, Maximilian Fichtl, Stefan Heidekrüger, Nils Kohring, and Paul Sutterer\.Learning equilibria in symmetric auction games using artificial neural networks\.*Nature Machine Intelligence*, 3:687–695, 2021\.
- Brown and Sandholm \[2018\]Noam Brown and Tuomas Sandholm\.Superhuman AI for heads\-up no\-limit poker: Libratus beats top professionals\.*Science*, 359\(6374\):418–424, 2018\.
- Brown et al\. \[2019\]Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm\.Deep counterfactual regret minimization\.In*International Conference on Machine Learning \(ICML\)*, 2019\.arXiv:1811\.00164\.
- Brown et al\. \[2020\]Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong\.Combining deep reinforcement learning and search for imperfect\-information games\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2020\.arXiv:2007\.13544\.
- Bulow et al\. \[1999\]Jeremy Bulow, Ming Huang, and Paul Klemperer\.Toeholds and takeovers\.*Journal of Political Economy*, 107\(3\):427–454, 1999\.
- Capen et al\. \[1971\]E\. C\. Capen, R\. V\. Clapp, and W\. M\. Campbell\.Competitive bidding in high\-risk situations\.*Journal of Petroleum Technology*, 23\(6\):641–653, 1971\.
- Chang \[2020\]Ho\-Chun Herbert Chang\.Multi\-issue bargaining with deep reinforcement learning\.*arXiv preprint arXiv:2002\.07788*, 2020\.
- Cobbe et al\. \[2021\]Karl Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman\.Phasic policy gradient\.In*International Conference on Machine Learning \(ICML\)*, 2021\.arXiv:2009\.04416\.
- Compte and Jehiel \[2007\]Olivier Compte and Philippe Jehiel\.Auctions and information acquisition: Sealed Bid or Dynamic Formats?*RAND Journal of Economics*, 38\(2\):355–372, 2007\.
- Farina et al\. \[2021\]Gabriele Farina, Christian Kroer, and Tuomas Sandholm\.Faster game solving via predictive Blackwell approachability: Connecting regret matching and mirror descent\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 35, pages 5363–5371, 2021\.arXiv:2007\.14358\.
- Fishman \[1988\]Michael J\. Fishman\.A theory of preemptive takeover bidding\.*RAND Journal of Economics*, 19\(1\):88–101, 1988\.
- Harsanyi \[1967–1968\]John C\. Harsanyi\.Games with incomplete information played by “Bayesian” players, I–III\.*Management Science*, 14:159–182, 320–334, 486–502, 1967–1968\.
- Heinrich and Silver \[2016\]Johannes Heinrich and David Silver\.Deep reinforcement learning from self\-play in imperfect\-information games\.In*NIPS 2016 Deep Reinforcement Learning Workshop*, 2016\.arXiv:1603\.01121\.
- Kagel and Levin \[1986\]John H\. Kagel and Dan Levin\.The winner’s curse and public information in common value auctions\.*American Economic Review*, 76\(5\):894–920, 1986\.
- Klemperer \[1999\]Paul Klemperer\.Auction theory: A guide to the literature\.*Journal of Economic Surveys*, 13\(3\):227–286, 1999\.
- Kohring et al\. \[2023\]Nils Kohring, Fabian Raoul Pieroth, and Martin Bichler\.Enabling first\-order gradient\-based learning for equilibrium computation in markets\.In*Proceedings of the 40th International Conference on Machine Learning \(ICML\)*, volume 202 of*Proceedings of Machine Learning Research*, pages 17327–17342, 2023\.arXiv:2303\.09500\.
- Lanctot et al\. \[2009\]Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling\.Monte carlo sampling for regret minimization in extensive games\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2009\.
- Lanctot et al\. \[2017\]Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel\.A unified game\-theoretic approach to multiagent reinforcement learning\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2017\.arXiv:1711\.00832\.
- Lanctot et al\. \[2019\]Marc Lanctot, Edward Lockhart, Jean\-Baptiste Lespiau, et al\.Openspiel: A framework for reinforcement learning in games\.*arXiv preprint arXiv:1908\.09453*, 2019\.
- Lockhart et al\. \[2019\]Edward Lockhart, Marc Lanctot, Julien Pérolat, Jean\-Baptiste Lespiau, Dustin Morrill, Finbarr Timbers, and Karl Tuyls\.Computing approximate equilibria in sequential adversarial games by exploitability descent\.In*Proceedings of the Twenty\-Eighth International Joint Conference on Artificial Intelligence \(IJCAI\)*, 2019\.arXiv:1903\.05614\.
- McAleer et al\. \[2023\]Stephen McAleer, Gabriele Farina, Marc Lanctot, and Tuomas Sandholm\.ESCHER: Eschewing importance sampling in games by computing a history value function to estimate regret\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.arXiv:2206\.04122\.
- Milgrom and Weber \[1982\]Paul R\. Milgrom and Robert J\. Weber\.A theory of auctions and competitive bidding\.*Econometrica*, 50\(5\):1089–1122, 1982\.
- Myerson and Satterthwaite \[1983\]Roger B\. Myerson and Mark A\. Satterthwaite\.Efficient mechanisms for bilateral trading\.*Journal of Economic Theory*, 29\(2\):265–281, 1983\.
- Pérolat et al\. \[2022\]Julien Pérolat, Bart De Vylder, Daniel Hennes, et al\.Mastering the game of Stratego with model\-free multiagent reinforcement learning\.*Science*, 378\(6623\):990–996, 2022\.
- Persico \[2000\]Nicola Persico\.Information acquisition in auctions\.*Econometrica*, 68\(1\):135–148, 2000\.
- Rudolph et al\. \[2026\]Max Rudolph, Nathan Lichtlé, Sobhan Mohammadpour, Alexandre Bayen, J\. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota\.Reevaluating policy gradient methods for imperfect\-information games\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.arXiv:2502\.08938\.
- Schulman et al\. \[2017\]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Sokota et al\. \[2023\]Samuel Sokota, Ryan D’Orazio, J\. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer\.A unified approach to reinforcement learning, quantal response equilibria, and two\-player zero\-sum games\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.arXiv:2206\.05825\.
- Steinberger et al\. \[2020\]Eric Steinberger, Adam Lerer, and Noam Brown\.DREAM: Deep regret minimization with advantage baselines and model\-free learning\.*arXiv preprint arXiv:2006\.10410*, 2020\.
- Timbers et al\. \[2022\]Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Julian Schrittwieser, Thomas Hubert, and Michael Bowling\.Approximate exploitability: Learning a best response\.In*Proceedings of the Thirty\-First International Joint Conference on Artificial Intelligence \(IJCAI\)*, pages 3487–3493, 2022\.arXiv:2004\.09677\.
- Wilson \[1977\]Robert Wilson\.A bidding model of perfect competition\.*The Review of Economic Studies*, 44\(3\):511–518, 1977\.
- Zinkevich et al\. \[2007\]Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione\.Regret minimization in games with incomplete information\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2007\.Similar Articles
A3M: Adaptive, Adversarial and Multi-Objective Learning for Strategic Bidding in Repeated Auctions
Introduces A3M, a framework combining adaptive deep reinforcement learning, adversarial reasoning, and multi-objective reward design for strategic bidding in repeated auctions, achieving 30-40% regret reduction.
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups.
PACT introduces a head-to-head negotiation benchmark for LLMs using a 20-round buyer-seller bargaining game to test persuasion and adaptation. Top performers include GPT-5.5 and Opus 4.7, with ratings computed via Glicko-2 on an Elo-like scale.
DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation
This paper introduces DRIVE, a unified Transformer-based framework for offline auto-bidding that decouples candidate action generation from decision making, combining distributional action modeling, retrieval-augmented candidate generation, and value-based evaluation to improve bidding performance under budget and cost constraints.
Generative Auto-Bidding with Unified Modeling and Exploration
This paper introduces Guide, a framework that combines a Decision Transformer with Q-value guidance and an inverse dynamics module to balance exploration and safety in automated bidding for digital advertising, demonstrating effectiveness on public datasets and simulated auctions.
A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry
This paper introduces a contextual-bandit team game with two-sided informational asymmetry for runtime human oversight of AI agents, characterizing gaps between team-optimal and myopic human oversight strategies.