Self-Play Reinforcement Learning under Imperfect Information in Big 2
Summary
This paper presents a self-play reinforcement learning framework for the four-player imperfect-information card game Big 2, comparing policy-gradient and value-based methods and finding that PPO with entropy regularization outperforms others.
View Cached Full Text
Cached at: 05/29/26, 09:11 AM
# Self-Play Reinforcement Learning under Imperfect Information in Big 2
Source: [https://arxiv.org/html/2605.28863](https://arxiv.org/html/2605.28863)
###### Abstract
Imperfect\-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non\-stationary opponents\. We study these challenges in Big 2, a four\-player imperfect\-information card game\. We develop a self\-play RL framework for Big 2 that enables controlled comparisons between policy\-gradient and value\-approximating agents\. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q\-learning against random, greedy, and heuristic Big 2 opponents\. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current\-policy self\-play provides a stronger finite\-budget curriculum than checkpoint self\-play or fixed\-opponent training\. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets\.
Reinforcement Learning, Self\-Play, Imperfect\-Information Games, Card Games
## 1Introduction
Games are a useful testbed for reinforcement learning \(RL\) because they provide precise rules, rewards, and evaluation protocols\. In perfect\-information games, self\-play RL and search have produced numerous successes, from AlphaGo to AlphaZero and MuZero\(Silveret al\.,[2016](https://arxiv.org/html/2605.28863#bib.bib19),[2018](https://arxiv.org/html/2605.28863#bib.bib20); Schrittwieseret al\.,[2020](https://arxiv.org/html/2605.28863#bib.bib21)\)\. Imperfect\-information games are harder: agents must act from partial observations, infer hidden state from public behavior, and learn under non\-stationary opponent distributions induced by self\-play\. Progress in poker, Stratego, and general game\-playing systems has shown the power of combining learning with search, regret minimization, or game\-theoretic reasoning\(Heinrich and Silver,[2016](https://arxiv.org/html/2605.28863#bib.bib3); Moravciket al\.,[2017](https://arxiv.org/html/2605.28863#bib.bib4); Brown and Sandholm,[2018](https://arxiv.org/html/2605.28863#bib.bib5); Brownet al\.,[2019](https://arxiv.org/html/2605.28863#bib.bib2); Brown and Sandholm,[2019](https://arxiv.org/html/2605.28863#bib.bib6); Brownet al\.,[2020](https://arxiv.org/html/2605.28863#bib.bib7); Perolatet al\.,[2022](https://arxiv.org/html/2605.28863#bib.bib11); Schmidet al\.,[2023](https://arxiv.org/html/2605.28863#bib.bib12)\)\. Recent work on action abstraction and policy\-gradient theory highlights the need for better understanding of learning dynamics in imperfect\-information games\(Liet al\.,[2024](https://arxiv.org/html/2605.28863#bib.bib13); Liuet al\.,[2025](https://arxiv.org/html/2605.28863#bib.bib14)\)\.
Multiplayer card games provide a challenge for game\-theoretic learning algorithms, as they involve hidden information, sparse terminal rewards, and action spaces that change sharply between turns of play\. For example, the game DouDizhu features three\-player competition and cooperation and a large variable action space\(Zhaet al\.,[2021](https://arxiv.org/html/2605.28863#bib.bib8)\), Mahjong requires reasoning about hidden information across four players\(Liet al\.,[2020](https://arxiv.org/html/2605.28863#bib.bib9)\), and Pluribus showed that moving beyond heads\-up poker introduces qualitatively new strategic issues\(Brown and Sandholm,[2019](https://arxiv.org/html/2605.28863#bib.bib6)\)\.
We study Big 2, a four\-player card\-shedding game\. Each player observes only their own hand and the public play history, while the other three hands must be inferred from actions, passes, and remaining card counts\. The game’s legal actions are hand\-specific combinations such as singles, pairs, triples, straights, flushes, full houses, four\-of\-a\-kind hands, straight flushes, and passes\. Prior Big 2 work has demonstrated the difficulty of mastering the game due to multiplayer dynamics, large state and action spaces, and short\-term versus long\-term strategic tradeoffs\(Chen and Lu,[2022](https://arxiv.org/html/2605.28863#bib.bib16); Luo and Tan,[2024](https://arxiv.org/html/2605.28863#bib.bib17); Chen and Lu,[2025](https://arxiv.org/html/2605.28863#bib.bib18)\)\. Big 2 is particularly challenging because playing a strong short\-term action may greatly reduce a player’s future options or allow an opponent to take control of the game\. Therefore, the game tests whether an agent can choose the long\-term strategic action over the locally optimal action\.
Prior Big 2 agents have used self\-play PPO, Monte Carlo tree search\-based opponent prediction, Monte Carlo training with opponent modeling and action filtering, and MDP\-style decompositions of scoring, risk, prediction, and control\(Charlesworth,[2018](https://arxiv.org/html/2605.28863#bib.bib15); Chen and Lu,[2022](https://arxiv.org/html/2605.28863#bib.bib16); Luo and Tan,[2024](https://arxiv.org/html/2605.28863#bib.bib17); Chen and Lu,[2025](https://arxiv.org/html/2605.28863#bib.bib18)\)\. Our goal is complementary: we study compute\-efficient deep RL methods in the full four\-player environment, avoiding engineered opponent models, tree search, and heuristic action pruning beyond legal\-action filtering\.
Despite recent progress, there has not yet been a controlled study to investigate whether policy\-gradient or value\-based objectives learn more effectively in Big 2 under the same interface and limited training budget, and how training design choices affect stability and final performance\. We compare PPO, Monte Carlo Q approximation, SARSA, and target\-network Q\-learning under a common environment, state and action representation, architecture, training budget, and evaluation protocol\. This limited\-compute setting lets us study sample and compute efficiency rather than performance gains from scale alone\. We find that PPO performs best among the methods tested, and we analyze two training factors that substantially influence its performance: entropy regularization, which affects policy stochasticity, and opponent curriculum, which changes the learning signal\. Together, these contributions provide the first controlled empirical study of RL objectives and training design choices for Big 2, as well as an accessible baseline for future work on search, abstraction, opponent modeling, and larger training budgets\.
## 2Game Formulation
We model Big 2 as a finite\-horizon, turn\-based, imperfect\-information game withN=4N=4players and a standard 52\-card deck\. Card values are ranked in the order of 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A, 2, and suits break ties in the order diamonds<<clubs<<hearts<<spades\. Each player receives 13 private cards, the player holding3♢3\\diamondsuitopens, and players act clockwise until one player empties their hand and wins\. A trick is the current combination that other players must beat or pass on\. Legal non\-pass tricks are singles, pairs, triples, and five\-card hands; five\-card hands are ordered as straight<<flush<<full house<<four\-of\-a\-kind<<straight flush\. A response to a single, pair, or triple must be a trick of the same category with higher value, while a response to a five\-card hand must either beat it within the same category or use a higher five\-card category\. If all other players pass after a non\-pass play, the trick is cleared and the last player to play regains control, meaning they may lead another round of play with any legal non\-pass combination\. The goal is for a player to be the first to discard all of their cards\.
## 3Methods
### 3\.1Game Environment
We developed a simulator whose observation stateoio\_\{i\}accurately reflects the information available to each player during the game\. At each decision point, the acting player observes their own private hand and the public game history, including the active trick, previously played cards, remaining card counts of each opponent, and the current pass count\. The simulator also returns the current legal candidate action set𝒜\(oi\)\\mathcal\{A\}\(o\_\{i\}\)by enumerating legal combinations in the acting player’s hand and filtering to those that would be valid given the active trick\. This avoids invalid\-action exploration and makes policy\-gradient and value\-based methods directly comparable despite the variable action set\. Each candidate action is represented as a feature vector containing rank\-and\-suit indicator features, a bit for whether the action is a pass, the trick type, and trick rank features\. Further details are provided in Appendix[A](https://arxiv.org/html/2605.28863#A1)\.
### 3\.2Neural Architecture
Our architecture separately encodes the information state and the legal candidate actions, then scores each state\-action pair\. The acting player’s hand is represented as card IDs, embedded with a shared card embedding table, passed through self\-attention over the held cards, and pooled into a hand summary\. Public card sets such as the current trick, seen cards, and opponents’ played cards are embedded with the same table, concatenated with opponent card counts and the pass count, and projected into a state embedding\. Legal actions are encoded from their 80\-dimensional combination features and scored against the state embedding by a dot product, so the policy and Q\-network rank only the actions that are legal at the current decision point\. This differs notably from previous Big 2 PPO architectures, which feed a hand\-engineered 412\-bit state vector into fully connected layers and predict over a fixed 1695\-action head that is masked to legal moves\(Charlesworth,[2018](https://arxiv.org/html/2605.28863#bib.bib15)\); our model represents cards directly, allows held cards to interact before pooling, and avoids predicting scores for illegal actions\.
We implement a policy network that uses state\-action scores as logits to calculate action probabilities, and also uses a separate MLP as a value head\. We also implement a Q\-network that directly uses state\-action scores as Q values\.
Figure 1:Neural architectures for different learning algorithms\. The PPO policy and Q\-network share the same card\-aware state encoder, action encoder, and dot product scorer\. The PPO policy network includes a value head\.
### 3\.3Learning Algorithms
We use PPO as our policy gradient baseline to train the policy network\. We compare PPO with three value\-based algorithms that collect trajectories usingϵ\\epsilon\-greedy action selection over legal actions and are trained using mean squared error on the predicted value of the selected action\. Our reward signal is based on the Big 2 game score: the winner of a game receives a game score equal to the sum of the remaining cards in the losers’ hands, and each loser receives a score equal to the negative of their remaining card count\. For training the value\-based algorithms only, the reward is divided by 13 as explained in Appendix[C](https://arxiv.org/html/2605.28863#A3)\.
The Monte Carlo Q variant uses the full discounted return from each model\-controlled trajectory,
yt=∑k=tTγk−trk\.y\_\{t\}=\\sum\_\{k=t\}^\{T\}\\gamma^\{k\-t\}r\_\{k\}\.The SARSA variant uses the one\-step on\-policy target,
yt=rt\+γQtarget\(ot\+1,at\+1\),y\_\{t\}=r\_\{t\}\+\\gamma Q\_\{\\mathrm\{target\}\}\(o\_\{t\+1\},a\_\{t\+1\}\),whereat\+1a\_\{t\+1\}is the next action actually selected by the behavior policy at the next model\-controlled decision point\. The Q\-learning variant uses the corresponding max target,
yt=rt\+γmaxa∈𝒜\(ot\+1\)Qtarget\(ot\+1,a\)\.y\_\{t\}=r\_\{t\}\+\\gamma\\max\_\{a\\in\\mathcal\{A\}\(o\_\{t\+1\}\)\}Q\_\{\\mathrm\{target\}\}\(o\_\{t\+1\},a\)\.For the SARSA and Q\-learning variants, a delayed target network is periodically synchronized with the online Q\-network\.
All three value\-based agents behave greedily during evaluation\.
## 4Experimental Setup
#### Training configuration\.
To study learning dynamics, we train each agent in a limited\-compute setting: 5,000 batches at 64 episodes per batch\. Across all algorithms evaluated, these 5,000 batches took between 7 hours and 13 hours to train on a single 6\-core Intel i7 laptop\.
For PPO, we use 4 PPO epochs per update, clipϵ=0\.2\\epsilon=0\.2, learning rate3×10−53\\times 10^\{\-5\},γ=0\.99\\gamma=0\.99, andλ=0\.95\\lambda=0\.95\. We use learning rate warmup and cosine learning rate decay\. Additional implementation details for PPO and value\-based training are in Appendix[C](https://arxiv.org/html/2605.28863#A3)\.
During current\-policy self\-play, the current policy controls all four seats, and training examples are collected from every model\-controlled decision point across seats\. When training against a fixed opponent, the policy occupies one randomly chosen seat and the fixed opponent controls the other three\.
#### Evaluation protocol\.
To provide a consistent evaluation baseline, we implement three heuristic opponents of varying difficulty\. The first is a "Random" baseline, which simply chooses uniformly from the legal action set\. The second is a "Greedy" baseline, described in Algorithm[1](https://arxiv.org/html/2605.28863#alg1), which plays the weakest legal non\-pass combination available\. The third is a "Smart" baseline, described in Algorithm[2](https://arxiv.org/html/2605.28863#alg2), which is a stronger hand\-aware rule\-based policy\. It scores each legal non\-pass action using lightweight strategic features, including immediate wins, number of cards shed, and whether it leaves low orphan singles\. It only passes in narrow situations, such as to avoid expensive early use of 2s and conserve valuable five\-card combinations\.
At evaluation time, we roll out 1,000 four\-player games where one seat is held by the agent being evaluated, and the other three seats are held by players of a certain opponent class\. The evaluated agent’s seat is randomized across games, and all reported metrics are averaged over these seat\-randomized deals\. We track win rate and the average game score \(reward\) against each opponent class\. Therefore, we consider an agent successful against an opponent pool when its win rate exceeds 25% and its average score is positive\. Unless otherwise noted, each reported evaluation result is from a single training and evaluation seed\. Because our result tables do not report uncertainty across independent seeds, we interpret small differences cautiously\.
## 5Results
### 5\.1PPO outperforms value\-based methods under self\-play
Figure 2:Learning curves for PPO, Monte Carlo Q, SARSA, and Q\-learning in four\-player Big 2\. Each checkpoint is evaluated against fixed random, greedy, and smart heuristic opponent pools\. We report win rate and average score for the model\-controlled seat\.Figure[2](https://arxiv.org/html/2605.28863#S5.F2)compares the training dynamics of the policy\-gradient and value\-based agents under a shared simulator, representation, and evaluation protocol\. PPO is the strongest and most consistent method over the training budget analyzed, and Table[1](https://arxiv.org/html/2605.28863#S5.T1)shows that it has the best final\-checkpoint win rate and average score against all three opponent classes\. It improves rapidly early in training and remains competitive across all three opponent pools, with especially clear gains against the greedy and smart heuristic opponents\. This finding is consistent with prior Big 2 work showing that self\-play PPO can learn robust strategies in the game\(Charlesworth,[2018](https://arxiv.org/html/2605.28863#bib.bib15)\), but our comparison extends that observation by evaluating PPO alongside multiple value\-based alternatives under the same legal\-candidate scoring interface and compute budget\.
The value\-based methods learn useful policies, but they do not match PPO’s overall performance within the same training horizon\. Among these methods, Monte Carlo Q is the strongest final\-checkpoint value\-based baseline, outperforming SARSA and Q\-learning against the greedy and smart opponents\. The gap between PPO and the value\-based methods contrasts with DouZero, where Monte Carlo value approximation was highly effective for DouDizhu self\-play\(Zhaet al\.,[2021](https://arxiv.org/html/2605.28863#bib.bib8)\), and with Big 2 DMC variants that achieve strong performance through longer training, opponent modeling, and action\-set filtering\(Luo and Tan,[2024](https://arxiv.org/html/2605.28863#bib.bib17)\)\. One plausible explanation is that Big 2’s large, four\-player, hidden\-information state space makes value estimation slow to stabilize: strategically important states are visited rarely, and terminal returns must assign credit across long sequences of combinatorial actions\. Under this interpretation, PPO’s clipped policy\-gradient objective and value baseline provide a more sample\-efficient learning signal for the training budgets we study, while Monte Carlo Q may require longer training or additional structure such as opponent modeling and action pruning to close the gap\.
The results also suggest that self\-play is not merely overfitting to a single opponent distribution\. Although training uses self\-play rather than direct supervised imitation of the evaluation opponents, performance improves against Random, Greedy, and Smart heuristics\. This cross\-opponent improvement suggests that the agents, and PPO in particular, learn transferable Big 2 strategies\.
Table 1:Win rates and average score across algorithms\.
### 5\.2Moderate entropy regularization improves PPO
The PPO results in Section[5\.1](https://arxiv.org/html/2605.28863#S5.SS1)were obtained from a run with no entropy regularization\. After inspecting model outputs, we found that the average policy entropy decreased steadily over training, as shown in Appendix Figure[3](https://arxiv.org/html/2605.28863#A4.F3)\. This was confirmed by examining evaluation rollouts, which showed that the model often sampled its top action with 90\+% probability, including in ambiguous but strategically important decision points, such as the first trick in the game\. This behavior suggested that the lack of entropy regularization may have made the policy too deterministic\. In an imperfect\-information game such as Big 2, stochastic policies may perform better as they operate with uncertainty given hidden information and avoid becoming predictable\.
We therefore ablate the effect of explicitly encouraging stochasticity in PPO\. For these runs, we add the standard entropy term to the PPO minimization objective,
LPPO=Lpolicy\+cvLvalue−βent𝔼o\[H\(π\(⋅∣o\)\)\],L\_\{\\mathrm\{PPO\}\}=L\_\{\\mathrm\{policy\}\}\+c\_\{v\}L\_\{\\mathrm\{value\}\}\-\\beta\_\{\\mathrm\{ent\}\}\\mathbb\{E\}\_\{o\}\\left\[H\\left\(\\pi\(\\cdot\\mid o\)\\right\)\\right\],whereβent\\beta\_\{\\mathrm\{ent\}\}controls the strength of the entropy incentive\. Appendix Figure[3](https://arxiv.org/html/2605.28863#A4.F3)shows that increasingβent\\beta\_\{\\mathrm\{ent\}\}does in fact make the trained policy maintain stochasticity throughout training\.
Table[2](https://arxiv.org/html/2605.28863#S5.T2)reports the final performance of PPO agents trained using different entropy incentives\.βent=0\.05\\beta\_\{\\mathrm\{ent\}\}=0\.05achieves the best performance, suggesting that moderate entropy regularization improves performance, but only to an extent; too much entropy trades off with learning better policies and potentially leads the model to take suboptimal actions\.
Table 2:Results of PPO entropy ablation\.
### 5\.3Current\-policy self\-play outperforms alternative curricula
We next ablate the opponent distribution used during training\. The default setting trains against the current policy, which exposes the learner to an opponent distribution that changes as the agent improves\. We compare this setting to checkpoint self\-play, where opponents are sampled from earlier saved policies in the same training run, and to a fixed\-opponent curriculum in which the learning agent plays only against the deterministic Smart strategy\. Results useβent=0\.05\\beta\_\{\\mathrm\{ent\}\}=0\.05from above\.
Table[3](https://arxiv.org/html/2605.28863#S5.T3)shows that current\-policy self\-play performs best for both PPO and Monte Carlo Q under the training budgets we study\. This result is somewhat counterintuitive for evaluation against the Smart opponent\. In the limit of sufficient exploration, data, representation capacity, and optimization, a Smart\-only curriculum could in principle learn a best response to Smart\.
However, we hypothesize that training only against Smart produces a narrower distribution of states and legal action sets because the deterministic opponent repeatedly drives games through the parts of the game tree favored by its heuristic\. This can reduce exploration and action\-value coverage, especially for Monte Carlo Q, where terminal returns provide high\-variance labels only for the actions actually sampled\. Current\-policy self\-play, by contrast, acts as a moving curriculum: the agent sees weak opponents early and increasingly stronger opponents as its own policy improves\. This keeps the opponent distribution close to the learner’s current skill level, so rollouts tend to expose mistakes that are still relevant to the current policy\.
Checkpoint self\-play also adds opponent diversity, but it weakens this adaptive pressure\. Older checkpoints can represent behaviors that the current policy has already learned to beat, so part of the training budget is spent collecting gradients against stale mistakes rather than against the learner’s present strategic weaknesses\. The checkpoint pool therefore trades the sharper learning signal from current\-policy opponents for broader but less targeted opponent coverage\. Under longer training this diversity may improve robustness, but in our limited\-training\-budget setting the diluted learning signal is slightly worse than training directly against the current policy\.
Table 3:Opponent\-curriculum ablation\. Entries show win rate and average score\.
## 6Discussion
Our results show that direct deep RL can learn useful policies for Big 2, and that algorithm choice matters considerably in the limited\-compute setting\. In our evaluation, PPO outperforms Monte Carlo Q\-approximation, SARSA, and target\-network Q\-learning under the same simulator, architecture, and training budget\. One possible explanation is that value\-based methods must estimate noisy, delayed returns for many rare state\-action pairs, while PPO can improve the policy from trajectory\-level advantage estimates before the value landscape has fully stabilized\. In a high\-variance, imperfect\-information, multiplayer game with non\-stationary self\-play rewards, this makes value approximation slower to converge and less competitive within the training budget we study\.
We also find that controlled stochasticity improves policy learning\. PPO without entropy regularization becomes increasingly deterministic, while an intermediate entropy incentive improves performance against Random, Greedy, and Smart opponents\. This suggests that imperfect\-information card games reward stochastic policies: they encourage exploration during training and give the agent higher success when acting under uncertainty\. A prematurely deterministic policy can over\-commit to suboptimal action preferences\. Our results also show that excessive entropy can make it difficult to exploit learned strategies, meaning that RL approaches must tune the entropy hyperparameter carefully\.
The opponent curriculum results show that current\-policy self\-play is more effective than self\-play against previous policy checkpoints or training against the Smart strategy, even when evaluating against the Smart strategy\. This result suggests that the best way to exploit a heuristic opponent is not necessarily to train only against that opponent\. A fixed deterministic opponent exposes the learner to a narrow slice of the game tree, while current\-policy self\-play creates an adaptive curriculum whose difficulty tracks the agent’s own progress\. Checkpoint self\-play adds diversity, but under a short training horizon it can dilute the learning signal by spending experience on older opponents the current policy may already beat\.
These findings make Big 2 a useful benchmark for studying RL in imperfect\-information games\. By holding the environment, action representation, architecture, and evaluation protocol fixed, we isolate factors that are often confounded in larger game\-playing systems: algorithm choice, policy stochasticity, and opponent distribution\. These findings are relevant to real\-world multi\-agent RL settings, which involve partial observation, delayed rewards, changing opponents, and limited training budgets\. Our study is still narrower than the full space of imperfect\-information methods: we do not compare against CFR or Deep CFR\(Zinkevichet al\.,[2007](https://arxiv.org/html/2605.28863#bib.bib1); Brownet al\.,[2019](https://arxiv.org/html/2605.28863#bib.bib2)\), nor against search\-augmented or opponent\-modeling agents\. Future work should test whether those methods improve robustness in Big 2, whether value\-based methods close the gap with longer training, and whether cross\-play among independently trained agents reveals additional strategic weaknesses\.
## Impact Statement
This paper presents work whose goal is to advance reinforcement learning for multiplayer imperfect\-information games\. We do not deploy the system in real\-world decision\-making settings; potential societal concerns are limited to the general risks of game\-playing AI and strategic agents\.
## References
- N\. Brown, A\. Lerer, S\. Gross, and T\. Sandholm \(2019\)Deep counterfactual regret minimization\.InProceedings of the 36th International Conference on Machine Learning,pp\. 793–802\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1),[§6](https://arxiv.org/html/2605.28863#S6.p4.1)\.
- N\. Brown, A\. Lerer, S\. Gross, and T\. Sandholm \(2020\)Combining deep reinforcement learning and search for imperfect\-information games\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 17057–17069\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- N\. Brown and T\. Sandholm \(2018\)Superhuman ai for heads\-up no\-limit poker: libratus beats top professionals\.Science359\(6374\),pp\. 418–424\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- N\. Brown and T\. Sandholm \(2019\)Superhuman ai for multiplayer poker\.Science365\(6456\),pp\. 885–890\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1),[§1](https://arxiv.org/html/2605.28863#S1.p2.1)\.
- H\. Charlesworth \(2018\)A self\-play reinforcement learning approach to big2\.External Links:1808\.10442,[Link](https://arxiv.org/abs/1808.10442)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p4.1),[§3\.2](https://arxiv.org/html/2605.28863#S3.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.28863#S5.SS1.p1.1)\.
- L\. Chen and Y\. Lu \(2022\)Challenging artificial intelligence with multiopponent and multimovement prediction for the card game Big2\.IEEE Access10,pp\. 40661–40676\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2022.3166932)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p3.1),[§1](https://arxiv.org/html/2605.28863#S1.p4.1)\.
- L\. Chen and Y\. Lu \(2025\)Markov decision process\-based artificial intelligence with card\-playing strategy and free\-playing right exploration for four\-player card game Big2\.IEEE Transactions on Games17\(2\),pp\. 267–281\.External Links:[Document](https://dx.doi.org/10.1109/TG.2024.3424431)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p3.1),[§1](https://arxiv.org/html/2605.28863#S1.p4.1)\.
- J\. Heinrich and D\. Silver \(2016\)Deep reinforcement learning from self\-play in imperfect\-information games\.External Links:1603\.01121,[Link](https://arxiv.org/abs/1603.01121)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- B\. Li, Z\. Fang, and L\. Huang \(2024\)RL\-CFR: improving action abstraction for imperfect information extensive\-form games with reinforcement learning\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 27752–27770\.External Links:[Link](https://proceedings.mlr.press/v235/li24t.html)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- J\. Li, S\. Koyamada, Q\. Ye, G\. Liu, C\. Wang, R\. Yang, L\. Zhao, T\. Qin, T\. Liu, and H\. Hon \(2020\)Suphx: mastering mahjong with deep reinforcement learning\.External Links:2003\.13590,[Link](https://arxiv.org/abs/2003.13590)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p2.1)\.
- M\. Liu, G\. Farina, and A\. E\. Ozdaglar \(2025\)A policy\-gradient approach to solving imperfect\-information games with best\-iterate convergence\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ZW4MRZrmSA)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- Q\. Luo and T\. Tan \(2024\)Improved learning efficiency of deep monte\-carlo for complex imperfect\-information card games\.Applied Soft Computing158,pp\. 111545\.External Links:[Document](https://dx.doi.org/10.1016/j.asoc.2024.111545)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p3.1),[§1](https://arxiv.org/html/2605.28863#S1.p4.1),[§5\.1](https://arxiv.org/html/2605.28863#S5.SS1.p2.1)\.
- M\. Moravcik, M\. Schmid, N\. Burch, V\. Lisy, D\. Morrill, N\. Bard, T\. Davis, K\. Waugh, M\. Johanson, and M\. Bowling \(2017\)DeepStack: expert\-level artificial intelligence in heads\-up no\-limit poker\.Science356\(6337\),pp\. 508–513\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- J\. Perolat, B\. De Vylder, D\. Hennes, E\. Tarassov,et al\.\(2022\)Mastering the game of stratego with model\-free multiagent reinforcement learning\.Science378\(6623\),pp\. 990–996\.External Links:[Document](https://dx.doi.org/10.1126/science.add4679)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- M\. Schmid, M\. Moravcik, N\. Burch, R\. Kadlec, J\. Davidson, K\. Waugh, N\. Bard, F\. Timbers, M\. Lanctot, G\. Z\. Holland, E\. Davoodi, A\. Christianson, and M\. Bowling \(2023\)Student of games: a unified learning algorithm for both perfect and imperfect information games\.Science Advances9\(46\),pp\. eadg3256\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.adg3256)Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel, T\. Lillicrap, and D\. Silver \(2020\)Mastering atari, go, chess and shogi by planning with a learned model\.Nature588\(7839\),pp\. 604–609\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- D\. Silver, A\. Huang, C\. J\. Maddison, A\. Guez, L\. Sifre, G\. van den Driessche, J\. Schrittwieser, I\. Antonoglou, V\. Panneershelvam, M\. Lanctot, S\. Dieleman, D\. Grewe, J\. Nham, N\. Kalchbrenner, I\. Sutskever, T\. Lillicrap, M\. Leach, K\. Kavukcuoglu, T\. Graepel, and D\. Hassabis \(2016\)Mastering the game of go with deep neural networks and tree search\.Nature529\(7587\),pp\. 484–489\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel, T\. Lillicrap, K\. Simonyan, and D\. Hassabis \(2018\)A general reinforcement learning algorithm that masters chess, shogi, and go through self\-play\.Science362\(6419\),pp\. 1140–1144\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p1.1)\.
- D\. Zha, J\. Xie, W\. Ma, S\. Zhong, J\. Liu, J\. Hu, P\. Zhang, H\. Liu, X\. Gao, J\. Wu, and Y\. Guo \(2021\)DouZero: mastering doudizhu with self\-play deep reinforcement learning\.InProceedings of the 38th International Conference on Machine Learning,pp\. 12333–12344\.Cited by:[§1](https://arxiv.org/html/2605.28863#S1.p2.1),[§5\.1](https://arxiv.org/html/2605.28863#S5.SS1.p2.1)\.
- M\. Zinkevich, M\. Johanson, M\. Bowling, and C\. Piccione \(2007\)Regret minimization in games with incomplete information\.InAdvances in Neural Information Processing Systems,Vol\.20\.Cited by:[§6](https://arxiv.org/html/2605.28863#S6.p4.1)\.
## Appendix ABig 2 Game Details
### A\.1Game Definition
The simulator represents each card as an integer in\{0,…,51\}\\\{0,\\ldots,51\\\}, ordered first by rank and then by suit\. Ranks increase in the order 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A, 2, and suits increase as diamonds<<clubs<<hearts<<spades\. Thus card0is3♢3\\diamondsuit, the lowest card in the game\.
At the beginning of an episode, the deck is uniformly shuffled and dealt evenly, giving each player 13 private cards\. The player holding3♢3\\diamondsuitacts first and must include that card in the opening play\. Players then act clockwise until one player empties their hand\. The first player to empty their hand is the winner, and the episode terminates immediately\.
### A\.2State, Observations, and Information
The full simulator state consists of each player’s private hand, the current player index, the current active trick, the set of public cards that have already been played, the number of consecutive passes, and the cards played by each player\. This full state is not observed by any learning agent\. Instead, at a decision point for playerii, the environment returns an information\-state observation containing:
- •playerii’s current hand, padded to 13 with pad integers;
- •a 52\-dimensional indicator for the current active trick, if one exists;
- •a 52\-dimensional indicator for all cards seen so far;
- •the remaining card counts for the other players in clockwise order;
- •the current number of consecutive passes;
- •per\-opponent 52\-dimensional indicators for cards that each opponent has already played\.
For the standard four\-player game, this produces a fixed\-length observation vector of dimension
13\+52\+52\+3\+1\+3⋅52=277\.13\+52\+52\+3\+1\+3\\cdot 52=277\.The observation therefore combines the acting player’s private hand with public history, but never exposes the unplayed cards in opponents’ hands\.
### A\.3Actions and Legal Candidate Generation
At every decision point, the simulator enumerates all candidate combinations in the acting player’s current hand\. It then filters this set according to the active trick\. If there is no active trick, the player has control and may lead any non\-pass legal combination\. If the active trick is a single, pair, or triple, the player may only play the same type with a higher comparison key\. If the active trick is a five\-card hand, the player may play a stronger hand in the same category or any legal five\-card hand in a higher category\. Passing is legal only when there is an active non\-pass trick\. This produces a variable\-length legal action set𝒜\(oi\)\\mathcal\{A\}\(o\_\{i\}\)for each observationoio\_\{i\}\.
Big 2’s action space is structurally combinatorial in a way that differs from poker\-style betting games\. In no\-limit poker, much of the action\-space challenge comes from abstracting over bet sizes\. In Big 2, actions are subsets of the player’s private hand, and the same card can participate in many incompatible future combinations\. Furthermore, each card in the hand is different as both rank and suit matter\. This makes fixed action abstraction difficult, because the useful action set depends heavily on the exact cards in the current hand and on the current trick\.
This structure makes even local decisions strategically ambiguous\. When a player has control and may lead a new trick, the legal set can contain several qualitatively different plans: low singles that probe opponents’ responses, pairs or triples that shed duplicated ranks, and five\-card hands that may either unlock or destroy future structure\. The best lead is therefore not determined only by immediate combination strength\. A high card or strong five\-card hand can win control now, but spending it may remove the player’s only answer to a later threat; conversely, saving it may allow an opponent to seize control\. Strong play also depends on hidden\-hand inference\. Because opponents’ hands are observed only through their plays, passes, and remaining card counts, a player must avoid playing tricks that are likely to match an opponent’s remaining cards, such as opening a pair or five\-card category that lets an opponent shed an otherwise difficult holding\.
To quantify this structure, we sampled 10,000 complete games using random legal play, yielding 752,677 decision points\. The visited legal action count is highly state\-dependent: many response states are tightly constrained, but the 99th percentile has 19 legal actions and the largest observed decision has 132 legal actions\. The branching factor is especially large when a player has control, where the mean legal action count is 8\.1 and the 95th percentile is 20\.
### A\.4Transition Dynamics
When a player plays a non\-pass combination, the simulator removes those cards from the player’s hand, marks them as seen, records them in that player’s public played\-card history, and sets the active trick to that combination\. When a player passes, the consecutive\-pass counter increases\. If all other players pass after a non\-pass play, the active trick is cleared and the last player who played a non\-pass combination takes control on the next turn\.
## Appendix BArchitectural Details
The state encoder uses a shared card embedding table for all card\-valued observation fields\. The acting player’s hand is represented as a padded list of card ids, encoded with masked self\-attention, and pooled over valid cards\. The current trick, the set of seen cards, and each opponent’s public played\-card history are represented as 52\-dimensional indicators and projected through the same card embedding table before being passed through small feed\-forward encoders\. The remaining\-card counts for the other players and the current pass count are encoded separately and concatenated with the pooled card features\. A projection, layer normalization, and residual feed\-forward block produce the final state embedding\.
Each legal action is represented by the 80\-dimensional candidate feature vector containing the cards involved in the trick and the features of the trick itself, and it is encoded by a two\-layer multilayer perceptron\. A learned linear projection maps the state embedding into the action\-embedding space, and a scaled dot product produces one scalar per legal candidate\. In the PPO policy this scalar is a logit; in the Q\-network it isQ\(oi,a\)Q\(o\_\{i\},a\)\. The PPO policy additionally applies a multilayer value head to the state embedding to estimateV\(oi\)V\(o\_\{i\}\)\.
## Appendix CTraining Implementation Details
#### Rollout ownership and seating\.
Each training episode is a full four\-player game\. A model\-controlled seat is a seat whose action is selected by the learned policy or Q\-network and whose decision records are used for optimization\. In current\-policy self\-play, the same parameterized policy controls all four seats\. Gradients are aggregated from every model\-controlled decision point\. In fixed\-opponent training, a learner seat is sampled for each episode and the remaining seats use the fixed opponent\. In checkpoint self\-play, non\-learner seats are sampled from the current policy or up to 20 saved checkpoints according to the stated mixture, but only learner/model\-controlled seats contribute stored training records\. Seat assignment is randomized independently of the random deal and the player holding3♢3\\diamondsuitstill starts the game\.
#### Rewards\.
Rewards are assigned from each model\-controlled seat’s own perspective\. Evaluation uses the unshaped terminal Big 2 score: the winner receives the number of cards left in the other hands, and each loser receives the negative number of cards left in their own hand\. Under the score\-defined environment reward, intermediate rewards are zero and the terminal score is assigned only when the game ends\.
#### PPO training details\.
PPO uses generalized advantage estimation withλ=0\.95\\lambda=0\.95andγ=0\.99\\gamma=0\.99\. Advantages are normalized within the update batch\. The value loss coefficient iscv=0\.5c\_\{v\}=0\.5; the entropy coefficient is theβent\\beta\_\{\\mathrm\{ent\}\}reported for each PPO condition, withβent=0\\beta\_\{\\mathrm\{ent\}\}=0in the no\-entropy main comparison\. The implementation uses clipped policy ratios withϵ=0\.2\\epsilon=0\.2, clipped value loss with the same clipping range, and global gradient\-norm clipping at0\.50\.5\. Training involves 64 full games per batch, and PPO uses a minibatch size of256256\.
#### Value\-based training details\.
The value\-based agents use Adam with learning rate3×10−53\\times 10^\{\-5\},γ=0\.99\\gamma=0\.99, 64 full games per batch, and one optimizer update per collected batch\. They do not use a replay buffer; updates are on\-policy with respect to theϵ\\epsilon\-greedy behavior policy used to collect that batch\. Epsilon decays linearly from0\.50\.5to0over training\. SARSA and Q\-learning use a delayed target network synchronized every 10 batches\. The loss is mean squared error on the selected action’s Q value\. Gradients are clipped to norm1\.01\.0\. Terminal rewards are divided by 13 \(the number of cards per hand\) to compress into a more natural range for the dot\-product scorer\.
## Appendix DPPO Entropy Ablation
Figure 3:Average policy entropy during PPO current self\-play training for different entropy coefficients\. Entropy is sampled every 100 training batches\. Larger entropy coefficients keep the policy more stochastic over the course of training, while the run with no entropy bonus becomes increasingly deterministic\.
## Appendix EHeuristic Baselines
### E\.1Greedy Heuristic Baseline
The greedy baseline is a deterministic rule\-based policy used as a simple non\-learning opponent\. If the player is forced to take the only legal action, the policy returns it\. Otherwise, it excludesPassand chooses the minimum non\-pass candidate under the simulator’s combination ordering\. This ordering sorts first by combination type and then by the combination comparison key, so the policy plays the weakest legal non\-pass action available rather than preserving hand structure or reasoning about future tricks\.
Algorithm 1Greedy heuristic action selection0:legal candidates
𝒜\\mathcal\{A\}
1:if
\|𝒜\|≤1\|\\mathcal\{A\}\|\\leq 1then
2:returnthe only legal action in
𝒜\\mathcal\{A\}
3:endif
4:
ℬ←\{a∈𝒜:a≠Pass\}\\mathcal\{B\}\\leftarrow\\\{a\\in\\mathcal\{A\}:a\\neq\\textsc\{Pass\}\\\}
5:return
mina∈ℬa\\min\_\{a\\in\\mathcal\{B\}\}aunder the simulator’s combination ordering
### E\.2Smart Heuristic Strategy
The Smart strategy is a deterministic rule\-based policy used as a stronger non\-learning opponent\. It scores each non\-pass legal action and chooses the minimum\-scoring action, with lower scores corresponding to more desirable plays\. The heuristic favors immediate wins, shedding more cards, preserving future combinations, avoiding early use of 2s, and avoiding low orphan cards\. Passing is considered only in narrow cases: when the best early\-game response would spend multiple 2s, or when the current trick is a four\-of\-a\-kind or straight flush\.
Algorithm 2Smart heuristic action selection0:legal candidates
𝒜\\mathcal\{A\}, current hand
HH, active trick
TT
1:
ℬ←\{a∈𝒜:a≠Pass\}\\mathcal\{B\}\\leftarrow\\\{a\\in\\mathcal\{A\}:a\\neq\\textsc\{Pass\}\\\}
2:if
ℬ=∅\\mathcal\{B\}=\\emptysetor
\|𝒜\|=1\|\\mathcal\{A\}\|=1then
3:returnthe only legal action
4:endif
5:for all
a∈ℬa\\in\\mathcal\{B\}do
6:if
\|a\|=\|H\|\|a\|=\|H\|then
7:
s\(a\)←−1000s\(a\)\\leftarrow\-1000\{win immediately\}
8:else
9:
s\(a\)←0\.8∑c∈arank\(c\)s\(a\)\\leftarrow 0\.8\\sum\_\{c\\in a\}\\mathrm\{rank\}\(c\)
10:ifphase
\(H\)\(H\)is earlythen
11:
s\(a\)←s\(a\)\+10⋅\#\{2s ina\}s\(a\)\\leftarrow s\(a\)\+10\\cdot\\\#\\\{2\\text\{s in \}a\\\}
12:endif
13:ifphase
\(H\)\(H\)is midthen
14:
s\(a\)←s\(a\)\+5⋅\#\{2s ina\}s\(a\)\\leftarrow s\(a\)\+5\\cdot\\\#\\\{2\\text\{s in \}a\\\}
15:endif
16:
s\(a\)←s\(a\)\+BreakPenalty\(a,H\)s\(a\)\\leftarrow s\(a\)\+\\mathrm\{BreakPenalty\}\(a,H\)
17:
s\(a\)←s\(a\)\+6⋅LowOrphans\(H∖a\)s\(a\)\\leftarrow s\(a\)\+6\\cdot\\mathrm\{LowOrphans\}\(H\\setminus a\)
18:
s\(a\)←s\(a\)−4\|a\|s\(a\)\\leftarrow s\(a\)\-4\|a\|
19:ifphase
\(H\)\(H\)is latethen
20:
s\(a\)←s\(a\)−10s\(a\)\\leftarrow s\(a\)\-10
21:endif
22:if
TTis very strong and phase
\(H\)\(H\)is latethen
23:
s\(a\)←s\(a\)−10s\(a\)\\leftarrow s\(a\)\-10
24:endif
25:if
TTis four\-of\-a\-kind or straight flushthen
26:
s\(a\)←s\(a\)\+25s\(a\)\\leftarrow s\(a\)\+25
27:endif
28:endif
29:endfor
30:
a⋆←argmina∈ℬs\(a\)a^\{\\star\}\\leftarrow\\arg\\min\_\{a\\in\\mathcal\{B\}\}s\(a\)
31:ifPass
∈𝒜\\in\\mathcal\{A\}and phase
\(H\)\(H\)is early and
a⋆a^\{\\star\}uses at least two 2s and
s\(a⋆\)\>30s\(a^\{\\star\}\)\>30then
32:returnPass
33:endif
34:ifPass
∈𝒜\\in\\mathcal\{A\}and
TTis four\-of\-a\-kind or straight flushthen
35:returnPass
36:endif
37:return
a⋆a^\{\\star\}
#### Break penalty\.
BreakPenaltyreturns88in the early game or44in the mid game when an action breaks a remaining pair or triple\. When an action breaks a potential five\-card structure, the penalty is2020in the early game,88in the mid game, and44in the late game\. The implementation checks four\-of\-a\-kind, full\-house components, flushes with at least five cards of a suit, and five\-rank straight windows excluding rank 2\.
#### Game phases\.
The heuristic defines early, mid, and late game by the acting player’s hand size: early if\|H\|\>10\|H\|\>10, mid if6≤\|H\|≤106\\leq\|H\|\\leq 10, and late if\|H\|≤5\|H\|\\leq 5\.Similar Articles
Learn to Match: Two-Sided Matching with Temporally Extended Feedback
This paper introduces a framework for two-sided matching with temporally extended feedback, formulating it as a partially observable Markov game with costly screening, noisy observations, and evolving latent profiles. The authors present Learn2Match, a multi-agent reinforcement learning benchmark, and show that independent PPO outperforms bandit baselines in social welfare but incurs higher information-friction loss.
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
This paper introduces POISE, a method for stable policy optimization in large reasoning models by estimating baselines using the model's own internal states, reducing computational overhead compared to PPO and GRPO.
From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning
This paper proposes a unified framework for personalized agentic reinforcement learning that decouples generic task rewards from personalized preference rewards, introducing PARPO and PSGM for preference-aligned policy optimization and skill retrieval.
MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games
This paper introduces MAPLE, a tree search method that aggregates policy and value evaluations from multiple sampled world states, extending AlphaZero to imperfect-information games. Experiments on Phantom Go and Dark Hex show Elo improvements of 291 and 136 over the PIMC-based AlphaZero baseline.
Self-Distilled Policy Gradient
This paper proposes SDPG, a self-distilled policy-gradient framework that combines on-policy self-distillation with verifier advantages and KL regularization to improve reinforcement learning stability and performance.