Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

arXiv cs.AI Papers

Summary

This paper introduces Mahjax, a fully vectorized Riichi Mahjong simulator implemented in JAX for GPU-accelerated reinforcement learning, achieving high throughput and enabling tabula rasa training.

arXiv:2605.20577v1 Announce Type: new Abstract: Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:47 AM

# Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX
Source: [https://arxiv.org/html/2605.20577](https://arxiv.org/html/2605.20577)
Soichiro Nishimori1,2, Shinri Okano3, Keigo Habara4, Sotetsu Koyamada5,6,7, Eason Yu8, and Masashi Sugiyama2,11The University of Tokyo, Tokyo, Japan\.2RIKEN AIP, Tokyo, Japan\.3Nara Institute of Science and Technology, Nara, Japan\.4Independent Researcher\.5Kobe University, Kobe, Japan\.6Kyoto University, Kyoto, Japan\.7ATR, Kyoto, Japan\.8The University of Sydney, Sydney, Australia\. Corresponding author: Soichiro Nishimori\. Email: nishimori@ms\.k\.u\-tokyo\.ac\.jp\. This work has been submitted to the IEEE for possible publication\. Copyright may be transferred without notice, after which this version may no longer be accessible\.

###### Abstract

Riichi Mahjong is a multi\-player, imperfect\-information game characterized by stochasticity and high\-dimensional state spaces\. These attributes present a unique combination of challenges that mirror complex real\-world decision\-making problems in reinforcement learning\. While prior research has heavily relied on supervised learning from human play logs to pre\-train the policy, algorithms capable of learningtabula rasa\(from scratch\) offer greater potential for general applicability, as evidenced by the AlphaZero lineage\. To facilitate such research, we introduceMahjax, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large\-scale rollout parallelization on Graphics Processing Units \(GPUs\)\. We also provide a high\-quality visualization tool to streamline debugging and interaction with trained agents\. Experimental results demonstrate that Mahjax achieves throughputs of up to2 millionand1 million steps per secondon eight NVIDIA A100 GPUs under the no\-red and red rules, respectively\. Furthermore, we validate the environment’s utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies\. The code is available at[https://github\.com/nissymori/mahjax](https://github.com/nissymori/mahjax)\.

## IIntroduction

Riichi Mahjong is a popular tile\-based game where players compete to form a winning hand under imperfect information\[[11](https://arxiv.org/html/2605.20577#bib.bib6)\]\. The game exemplifies complex real\-world decision\-making problems characterized by multi\-agent interaction, high\-dimensional state spaces, and stochasticity\. Consequently, it has been extensively studied in the field of reinforcement learning \(RL\)\[[11](https://arxiv.org/html/2605.20577#bib.bib6),[27](https://arxiv.org/html/2605.20577#bib.bib7),[12](https://arxiv.org/html/2605.20577#bib.bib8),[16](https://arxiv.org/html/2605.20577#bib.bib9),[6](https://arxiv.org/html/2605.20577#bib.bib10)\]\.

A significant milestone in this domain is Suphx\[[11](https://arxiv.org/html/2605.20577#bib.bib6)\], the first AI to achieve top human\-level performance in Mahjong\. While subsequent works have demonstrated strong results, they also predominantly rely on supervised learning \(SL\) from human logs\[[11](https://arxiv.org/html/2605.20577#bib.bib6)\]or offline RL\[[10](https://arxiv.org/html/2605.20577#bib.bib5)\]for pre\-training\. In contrast, the AlphaZero family of algorithms\[[20](https://arxiv.org/html/2605.20577#bib.bib22),[22](https://arxiv.org/html/2605.20577#bib.bib23),[21](https://arxiv.org/html/2605.20577#bib.bib21)\]demonstrated that complex games can be mastered viatabula rasaself\-play without human priors\. This approach has recently extended to solving fundamental algorithmic problems\[[4](https://arxiv.org/html/2605.20577#bib.bib20)\]\. Inspired by these achievements, solving Mahjong from scratch via pure RL remains a promising yet underexplored frontier\.

However, self\-play in complex environments necessitates a vast amount of trial\-and\-error experience\. For instance, AlphaHoldem\[[26](https://arxiv.org/html/2605.20577#bib.bib30)\]required 6\.5 billion training steps to master heads\-up no\-limit poker\. Given that Mahjong involves four players and longer horizons than poker, existing Central Processing Unit \(CPU\) based simulators create a computational bottleneck for practical training\[[8](https://arxiv.org/html/2605.20577#bib.bib11)\]\. To address the data throughput challenge, the RL community has shifted toward hardware\-accelerated environments\[[9](https://arxiv.org/html/2605.20577#bib.bib12),[1](https://arxiv.org/html/2605.20577#bib.bib13),[5](https://arxiv.org/html/2605.20577#bib.bib14),[18](https://arxiv.org/html/2605.20577#bib.bib16),[14](https://arxiv.org/html/2605.20577#bib.bib15),[17](https://arxiv.org/html/2605.20577#bib.bib17)\]\. These vectorized environments enable agents to collect experience in massive batches directly on a Graphics Processing Unit \(GPU\), often yielding speedups exceeding100×100\\timesover CPU baselines\[[9](https://arxiv.org/html/2605.20577#bib.bib12),[14](https://arxiv.org/html/2605.20577#bib.bib15)\]\. Moreover, they facilitate novel algorithms that leverage massively parallel interactions\[[7](https://arxiv.org/html/2605.20577#bib.bib24),[13](https://arxiv.org/html/2605.20577#bib.bib25)\]\.

Among existing frameworks, Pgx\[[9](https://arxiv.org/html/2605.20577#bib.bib12)\]provides a suite of JAX\-based board games but currently lacks a comprehensive implementation of complex imperfect\-information games like Riichi Mahjong\. In this work, we introduceMahjax, a fully vectorizable Riichi Mahjong environment written in JAX\[[2](https://arxiv.org/html/2605.20577#bib.bib4)\], designed to enable large\-scale pure RL research\.

Our contributions are summarized as follows:1\) Vectorized Environment:We provide a high\-performance Mahjong environment adopting the Pgx Application Programming Interface \(API\), ensuring compatibility with modern JAX\-based RL pipelines\.2\) Performance:Mahjax scales efficiently across multiple GPUs, achieving up to2 millionand1 million steps per secondon eight NVIDIA A100 GPUs under the no\-red and red rules, respectively\.3\) Usability:We offer visualization tools to facilitate debugging and analysis\.4\) Validation:We validate the environment through successful RL training, demonstrating its readiness for research\.

## IIRelated Work

We review related work in the fields of Mahjong AI and GPU\-accelerated RL environments\.

Mahjong in RL\.Mahjong has been studied extensively in the RL literature\. Among agent\-focused efforts, the most notable milestone is Suphx\[[11](https://arxiv.org/html/2605.20577#bib.bib6)\], the first AI to achieve top human\-level performance on Tenhou, the most popular Mahjong platform in Japan\[[23](https://arxiv.org/html/2605.20577#bib.bib26)\]\. Since then, several agents have been developed by both commercial and open\-source communities\. For example, NAGA111[https://dmv\.nico/en/articles/mahjong\_ai\_naga/](https://dmv.nico/en/articles/mahjong_ai_naga/), developed by Dwango Media Village, reached the highest rank in Tenhou\. Mortal\[[3](https://arxiv.org/html/2605.20577#bib.bib29)\]serves as an open\-source framework for training Mahjong agents\. A common feature of these works is that they employ SL or offline RL to pre\-train the policy on human data collected from Tenhou, followed by fine\-tuning via deep RL\. Several works have also explored variants of Mahjong\. For instance,Zhao and Holden \[[27](https://arxiv.org/html/2605.20577#bib.bib7)\]developed an agent for 3\-player Mahjong \(Sanma\)\. Additionally,Ogamiet al\.\[[16](https://arxiv.org/html/2605.20577#bib.bib9)\]proposed a method to improve player evaluation\.

Regarding simulation infrastructure, Mjx\[[8](https://arxiv.org/html/2605.20577#bib.bib11)\]offers a fast C\+\+ simulator with a throughput of roughly 40k games per hour\. Similarly, Mortal provides a fast Rust\-based simulator named Libriichi\[[3](https://arxiv.org/html/2605.20577#bib.bib29)\], which achieves comparable speeds\. However, these CPU\-based simulators face scalability limitations when attempting to leverage the massive parallelization required for large\-scale self\-play training\.

GPU\-Accelerated Environments\.Recently, there has been active development of environments written natively in JAX\[[2](https://arxiv.org/html/2605.20577#bib.bib4)\]\[[9](https://arxiv.org/html/2605.20577#bib.bib12),[1](https://arxiv.org/html/2605.20577#bib.bib13),[5](https://arxiv.org/html/2605.20577#bib.bib14),[18](https://arxiv.org/html/2605.20577#bib.bib16),[14](https://arxiv.org/html/2605.20577#bib.bib15),[17](https://arxiv.org/html/2605.20577#bib.bib17)\]\. Pgx\[[9](https://arxiv.org/html/2605.20577#bib.bib12)\]provides classic board games like Go and Shogi, achieving speeds 10–100×\\timesfaster than their CPU counterparts\. Other domains cover combinatorial optimization in Jumanji\[[1](https://arxiv.org/html/2605.20577#bib.bib13)\], differentiable physics in Brax\[[5](https://arxiv.org/html/2605.20577#bib.bib14)\], and multi\-agent tasks in JaxMARL\[[18](https://arxiv.org/html/2605.20577#bib.bib16)\]\. More recently, environments such as Craftax\[[14](https://arxiv.org/html/2605.20577#bib.bib15)\]for open\-ended learning, Navix\[[17](https://arxiv.org/html/2605.20577#bib.bib17)\]for grid\-world navigation, and XLand\-Minigrid\[[15](https://arxiv.org/html/2605.20577#bib.bib18)\]for meta\-RL in grid\-worlds have been introduced\. These vectorized environments not only accelerate simulation but also facilitate novel RL algorithms that leverage massively parallel interactions, such as parallel Q\-learning \(PQN\)\[[7](https://arxiv.org/html/2605.20577#bib.bib24)\]\.

## IIIMahjax Overview

![Refer to caption](https://arxiv.org/html/2605.20577v1/x1.png)Figure 1:Example code snippet demonstrating the Mahjax API\.In this section, we describe the design choices and implementation details of Mahjax\.

### III\-AAPI Design and Implementation

Mahjax adopts the API design of Pgx\[[9](https://arxiv.org/html/2605.20577#bib.bib12)\]to ensure compatibility with fully vectorizable environments\. Figure[1](https://arxiv.org/html/2605.20577#S3.F1)illustrates a typical usage example\. To align with the JAX framework\[[2](https://arxiv.org/html/2605.20577#bib.bib4)\], we strictly adhere to a functional programming paradigm: theStatedataclass stores all game information—including hands, scores, winds, melds, and masks—as immutable JAX arrays\. This design contrasts with prior Mahjong simulators that typically employ stateful, object\-oriented architectures\[[8](https://arxiv.org/html/2605.20577#bib.bib11)\], which hinders the implementation in JAX\.

Crucially, implementing game logic as pure functions is essential for JAX Just\-In\-Time \(JIT\) compilation\. However, Mahjong logic involves complex conditional branching, which can hinder parallel performance on GPUs\. To mitigate this, we employed two primary optimization techniques: 1\)Vectorized Logic:We replaced control flow divergence \(e\.g\., if\-else statements\) with matrix operations wherever feasible\. 2\)Caching:We implemented caching for computationally intensive evaluations, such asYaku\(hand value\) calculation\. Specifically, we pre\-computed the relevant statistics for all possible suit combinations and encoded them into a bitmask\.

![Refer to caption](https://arxiv.org/html/2605.20577v1/x2.png)Figure 2:The SVG\-based visualization of the Mahjax game state\.
### III\-BRL Environment Design

Here, we describe the specific configurations of Mahjax as an RL environment\.

Rules\.We adhere to the standard rules of four\-player East\-SouthRiichiMahjong\.222[http://mahjong\-europe\.org/portal/images/docs/Riichi\-rules\-2025\-EN\.pdf](http://mahjong-europe.org/portal/images/docs/Riichi-rules-2025-EN.pdf)We support two major variants:

- •Tenhou \(Red\) Rules:The standard rules of four\-player East\-SouthRiichiMahjong as used in the Tenhou platform\[[23](https://arxiv.org/html/2605.20577#bib.bib26)\], including the red fives\. Previous research has mainly focused on this variant\[[11](https://arxiv.org/html/2605.20577#bib.bib6),[8](https://arxiv.org/html/2605.20577#bib.bib11),[3](https://arxiv.org/html/2605.20577#bib.bib29)\]\. We validated the correctness of the implementation using downloaded play logs followingKoyamadaet al\.\[[8](https://arxiv.org/html/2605.20577#bib.bib11)\]\.
- •No\-Red Rules:A variant of the game where red tiles are not used\. For simplicity and higher throughput, we removed several complex rules such as abortive draw\.

Game Modes\.To provide varying difficulty levels, we offer three modes:single,eastandhalf\. Insinglemode, the episode terminates after a singleKyoku, emphasizing immediate hand efficiency\. Conversely,eastmode continues for up to 4 rounds \(East\-round only\)\. Inhalfmode, the episode continues for up to 8 rounds \(East and South\), requiring long\-term strategic planning, such as rank defense and temporary cooperation\[[11](https://arxiv.org/html/2605.20577#bib.bib6)\]\.

Action Space\.The action space comprises discrete identifiers, covering discards,Kan, and special moves \(e\.g\.,Riichi,Ron,Pon, andPass\)\. Alegal\_action\_maskis provided to filter invalid logits\. To enforce strict rule adherence, executing an illegal action triggers immediate termination with a penalty \(default−1\.0\-1\.0\)\.

Observation Space\.Mahjax provides a structured dictionary observation for Transformer\-based agents\[[12](https://arxiv.org/html/2605.20577#bib.bib8)\]\. It contains tokenized inputs such as hand indices, action history, and scalar properties \(e\.g\.,shanten numberand scores\)\. All observations are ego\-centric to the current player\.

### III\-CVisualization and UI

Mahjax includes a Scalable Vector Graphics \(SVG\)\-based visualization tool \(Figure[2](https://arxiv.org/html/2605.20577#S3.F2)\) and a web\-based user interface\. These tools enable users to qualitatively analyze agent behaviors, debug the environment, and play against trained agents interactively\. To facilitate international research, the visualization supports English localization for users unfamiliar with traditional tile ideograms as shown in Figure[2](https://arxiv.org/html/2605.20577#S3.F2)\.

## IVExperiments

In this section, we evaluate the computational efficiency of Mahjax and validate its efficacy as a research platform for RL\.

### IV\-ASpeed Benchmark

![Refer to caption](https://arxiv.org/html/2605.20577v1/x3.png)\(a\)A100 x 1
![Refer to caption](https://arxiv.org/html/2605.20577v1/x4.png)\(b\)A100 x 8

Figure 3:Throughput comparison \(steps per second\) between Mahjax \(red and no\-red rules\), Pgx Shogi, and Libriichi across varying batch sizes\. On a single GPU setting, Mahjax reaches a throughput plateau around batch size2102^\{10\}\. In contrast, on eight GPUs it continues to scale to larger batch sizes for both rule sets\. Mahjax achieves peak throughputs of 2 million and 1 million steps per second on 8 GPUs for the no\-red and red rules, respectively, outperforming Libriichi by over10×10\\timesand surpassing Pgx Shogi\.Setup\.We compared Mahjax against two baselines: 1\)Libriichi\[[3](https://arxiv.org/html/2605.20577#bib.bib29)\], a Rust\-based CPU simulator for the red\-rule variant used in the Mortal project\[[3](https://arxiv.org/html/2605.20577#bib.bib29)\]; and 2\)Pgx \(Shogi\)\[[9](https://arxiv.org/html/2605.20577#bib.bib12)\], the Shogi environment in Pgx\. In the absence of other GPU\-accelerated Mahjong simulators, we included Pgx Shogi as a reference point to evaluate Mahjax’s scalability\. Benchmarks were conducted on computing nodes equipped with two Intel Xeon Platinum 8360Y CPUs and eight NVIDIA A100 GPUs\. We measured throughput usingjax\.pmapfor parallelization across devices for JAX environments, while utilizing Rayon333[https://github\.com/rayon\-rs/rayon](https://github.com/rayon-rs/rayon)for multi\-threaded execution in Libriichi\. For Mahjax, we reported results for both the no\-red and red\-rule variants on a single GPU and eight GPUs to evaluate scalability\. Simulations ran for 100 batch steps using a random policy, with batch sizes ranging from 2 to 16,384 \(starting from 8 for the 8\-GPU setting\)\.

Results\.Figure[3](https://arxiv.org/html/2605.20577#S4.F3)shows the results\. On a single GPU, Mahjax scales with batch size up to roughly2102^\{10\}environments, after which throughput largely saturates, while Libriichi plateaus around232^\{3\}due to CPU compute limitations\. In contrast, the 8\-GPU configuration continues to scale beyond this regime for both rule variants, demonstrating effective multi\-GPU parallelization\. Mahjax achieves peak throughputs of2 million SPSand1 million SPSon eight NVIDIA A100 GPUs for the no\-red and red rules, respectively, outperforming Libriichi by over10×10\\timesand surpassing Pgx Shogi\. These results confirm that Mahjax efficiently leverages GPU parallelism, making it suitable for large\-scale training\.

![Refer to caption](https://arxiv.org/html/2605.20577v1/x5.png)Figure 4:The plot shows the moving average rank against three fixed BC opponents over 1,000 evaluation games \(lower is better\)\. The solid line and shaded area represent the mean and standard deviation across three random seeds, respectively\. The horizontal dotted line at 2\.5 indicates the theoretical expected rank for players of equal skill\.
### IV\-BRL Experiment

To validate the environment’s stability for learning, we trained a policy using standard RL algorithms\.

Setup\.We utilized the no\-red rule and thesingle\-roundmode to accelerate experimental iteration\. To ensure training stability, we first initialized the policy via Behavioral Cloning \(BC\)\[[11](https://arxiv.org/html/2605.20577#bib.bib6)\]using 500k samples generated by a heuristic rule\-based agent\. The agent architecture consists of a Transformer encoder\[[24](https://arxiv.org/html/2605.20577#bib.bib28)\]processing high\-dimensional states \(hand, discards, and other melds\) into a latent representation, followed by separate multi\-layer perceptron \(MLP\) heads for policy and value estimation\.

Following BC pre\-training, we fine\-tuned the agent using Proximal Policy Optimization \(PPO\)\[[19](https://arxiv.org/html/2605.20577#bib.bib2)\]with Kullback\-Leibler \(KL\) regularization towards the BC policy\[[25](https://arxiv.org/html/2605.20577#bib.bib27)\]\. Training employed 1,024 parallel environments with a rollout length of 256 steps\. Hyperparameters were set as follows: discount factorγ=1\.0\\gamma=1\.0, Generalized Advantage Estimation parameterλ=0\.95\\lambda=0\.95, learning rateη=3×10−4\\eta=3\\times 10^\{\-4\}, clip rangeϵ=0\.2\\epsilon=0\.2, entropy coefficientcent=0\.01c\_\{\\text\{ent\}\}=0\.01, value coefficientcvf=0\.5c\_\{\\text\{vf\}\}=0\.5, and KL penaltycKL=0\.2c\_\{\\text\{KL\}\}=0\.2\. The training run spanned 100 million environmental steps, taking approximately 5\.8 hours on a single NVIDIA GH200 Grace Hopper GPU\. To evaluate performance, we played the trained agent against three fixed BC policies \(1 vs\. 3\) over 1,000 games\. Performance is measured by the average rank; since the expected rank for players of equal strength is 2\.5, a lower value indicates superior performance\. We reported the average and standard deviation of three independent runs\.

Results\.Figure[4](https://arxiv.org/html/2605.20577#S4.F4)presents the training trajectory\. The agent consistently achieves an average rank better \(lower\) than the neutral 2\.5 baseline, indicating successful policy improvement over the BC initialization\. While aiming for state\-of\-the\-art performance is beyond the scope of this paper, these results confirm that Mahjax provides a stable implementation for training deep RL agents\.

## VConcluding Remarks

In this work, we introducedMahjax, a fully vectorizable Riichi Mahjong environment implemented in JAX\. Our experiments demonstrated that Mahjax achieves throughputs of up to2 millionand1 million steps per secondon eight NVIDIA A100 GPUs for the no\-red and red rules, respectively, significantly outperforming the existing CPU\-based simulator\. We further validated the environment’s utility for RL by successfully training agents using PPO with KL regularization\.

Limitations and Future Work\.Our current release supports both no\-red and red\-rule variants, but RL evaluation is still limited to the single\-round setting and does not yet cover full round game mode\. Future work will expand rule support to include more game modes such as 3\-player modes\. Furthermore, while our current RL training, leveraging BC pre\-training, successfully demonstrates the simulator’s reliability and meets the primary scope of this work, we aim to move toward learning from scratch\.

## References

- \[1\]C\. Bonnet, D\. Luo, D\. Byrne, S\. Surana, S\. Abramowitz, P\. Duckworth, V\. Coyette, L\. I\. Midgley, E\. Tegegn, T\. Kalloniatis, O\. Mahjoub, M\. Macfarlane, A\. P\. Smit, N\. Grinsztajn, R\. Boige, C\. N\. Waters, M\. A\. Mimouni, U\. A\. M\. Sob, R\. de Kock, S\. Singh, D\. Furelos\-Blanco, V\. Le, A\. Pretorius, and A\. Laterre\(2024\)Jumanji: a diverse suite of scalable reinforcement learning environments in JAX\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1),[§II](https://arxiv.org/html/2605.20577#S2.p4.1)\.
- \[2\]JAX: composable transformations of Python\+NumPy programsExternal Links:[Link](http://github.com/jax-ml/jax)Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p4.1),[§II](https://arxiv.org/html/2605.20577#S2.p4.1),[§III\-A](https://arxiv.org/html/2605.20577#S3.SS1.p1.1)\.
- \[3\]MortalExternal Links:[Link](https://github.com/Equim-chan/Mortal)Cited by:[§II](https://arxiv.org/html/2605.20577#S2.p2.1),[§II](https://arxiv.org/html/2605.20577#S2.p3.1),[1st item](https://arxiv.org/html/2605.20577#S3.I1.i1.p1.1),[§IV\-A](https://arxiv.org/html/2605.20577#S4.SS1.p1.1)\.
- \[4\]A\. Fawzi, M\. Balog, A\. Huang, T\. Hubert, B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, F\. J\. R\. Ruiz, J\. Schrittwieser, G\. Swirszcz, D\. Silver, D\. Hassabis, and P\. Kohli\(2022\)Discovering faster matrix multiplication algorithms with reinforcement learning\.Nature610\(7930\),pp\. 47–53\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p2.1)\.
- \[5\]\(2021\)Brax \- A differentiable physics engine for large scale rigid body simulation\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual,Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1),[§II](https://arxiv.org/html/2605.20577#S2.p4.1)\.
- \[6\]H\. Fu, W\. Liu, S\. Wu, Y\. Wang, T\. Yang, K\. Li, J\. Xing, B\. Li, B\. Ma, Q\. Fu, and W\. Yang\(2022\)Actor\-critic policy optimization in a large\-scale imperfect\-information game\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p1.1)\.
- \[7\]M\. Gallici, M\. Fellows, B\. Ellis, B\. Pou, I\. Masmitja, J\. N\. Foerster, and M\. Martin\(2025\)Simplifying deep temporal difference learning\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1),[§II](https://arxiv.org/html/2605.20577#S2.p4.1)\.
- \[8\]S\. Koyamada, K\. Habara, N\. Goto, S\. Okano, S\. Nishimori, and S\. Ishii\(2022\)Mjx: a framework for mahjong ai research\.InIEEE Conference on Games \(CoG\),pp\. 504–507\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1),[§II](https://arxiv.org/html/2605.20577#S2.p3.1),[1st item](https://arxiv.org/html/2605.20577#S3.I1.i1.p1.1),[§III\-A](https://arxiv.org/html/2605.20577#S3.SS1.p1.1)\.
- \[9\]S\. Koyamada, S\. Okano, S\. Nishimori, Y\. Murata, K\. Habara, H\. Kita, and S\. Ishii\(2023\)Pgx: hardware\-accelerated parallel game simulators for reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 45716–45743\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1),[§I](https://arxiv.org/html/2605.20577#S1.p4.1),[§II](https://arxiv.org/html/2605.20577#S2.p4.1),[§III\-A](https://arxiv.org/html/2605.20577#S3.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2605.20577#S4.SS1.p1.1)\.
- \[10\]S\. Levine, A\. Kumar, G\. Tucker, J\. Fu, and C\. Finn\(2020\)Offline reinforcement learning\.arXiv preprint arXiv:2005\.01643\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p2.1)\.
- \[11\]J\. Li, S\. Koyamada, Q\. Ye, G\. Liu, C\. Wang, R\. Yang, L\. Zhao, T\. Qin, T\. Liu, and H\. Hon\(2020\)Suphx: mastering mahjong with deep reinforcement learning\.arXiv preprint arXiv:2003\.13590\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p1.1),[§I](https://arxiv.org/html/2605.20577#S1.p2.1),[§II](https://arxiv.org/html/2605.20577#S2.p2.1),[1st item](https://arxiv.org/html/2605.20577#S3.I1.i1.p1.1),[§III\-B](https://arxiv.org/html/2605.20577#S3.SS2.p3.1),[§IV\-B](https://arxiv.org/html/2605.20577#S4.SS2.p2.1)\.
- \[12\]X\. Li, B\. Liu, Z\. Wei, Z\. Wang, and L\. Wu\(2024\)Tjong: a transformer\-based mahjong ai via hierarchical decision\-making and fan backward\.CAAI Transactions on Intelligence Technology9\(4\),pp\. 982–995\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p1.1),[§III\-B](https://arxiv.org/html/2605.20577#S3.SS2.p5.1)\.
- \[13\]M\. Macfarlane, E\. Toledo, D\. Byrne, P\. Duckworth, and A\. Laterre\(2024\)SPO: sequential monte carlo policy optimisation\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 1019–1057\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1)\.
- \[14\]M\. T\. Matthews, M\. Beukman, B\. Ellis, M\. Samvelyan, M\. T\. Jackson, S\. Coward, and J\. N\. Foerster\(2024\)Craftax: A lightning\-fast benchmark for open\-ended reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1),[§II](https://arxiv.org/html/2605.20577#S2.p4.1)\.
- \[15\]A\. Nikulin, V\. Kurenkov, I\. Zisman, A\. Agarkov, V\. Sinii, and S\. Kolesnikov\(2024\)XLand\-minigrid: scalable meta\-reinforcement learning environments in JAX\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 43809–43835\.Cited by:[§II](https://arxiv.org/html/2605.20577#S2.p4.1)\.
- \[16\]T\. Ogami, K\. Amano, and Y\. Tsuruoka\(2024\)MJ\-dlvat: a deep learning value assessment technique for mahjong\.InIEEE Conference on Games \(CoG\),pp\. 1–8\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p1.1),[§II](https://arxiv.org/html/2605.20577#S2.p2.1)\.
- \[17\]E\. Pignatelli, J\. Liesen, R\. T\. Lange, C\. Lu, P\. S\. Castro, and L\. Toni\(2024\)NAVIX: scaling minigrid environments with JAX\.arXiv preprint arXiv:2407\.19396\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1),[§II](https://arxiv.org/html/2605.20577#S2.p4.1)\.
- \[18\]A\. Rutherford, B\. Ellis, M\. Gallici, J\. Cook, A\. Lupu, G\. Ingvarsson, T\. Willi, R\. Hammond, A\. Khan, C\. S\. de Witt, A\. Souly, S\. Bandyopadhyay, M\. Samvelyan, M\. Jiang, R\. T\. Lange, S\. Whiteson, B\. Lacerda, N\. Hawes, T\. Rocktäschel, C\. Lu, and J\. N\. Foerster\(2024\)JaxMARL: multi\-agent RL environments and algorithms in JAX\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 50925–50951\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1),[§II](https://arxiv.org/html/2605.20577#S2.p4.1)\.
- \[19\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§IV\-B](https://arxiv.org/html/2605.20577#S4.SS2.p3.7)\.
- \[20\]D\. Silver, A\. Huang, C\. J\. Maddison, A\. Guez, L\. Sifre, G\. van den Driessche, J\. Schrittwieser, I\. Antonoglou, V\. Panneershelvam, M\. Lanctot, S\. Dieleman, D\. Grewe, J\. Nham, N\. Kalchbrenner, I\. Sutskever, T\. P\. Lillicrap, M\. Leach, K\. Kavukcuoglu, T\. Graepel, and D\. Hassabis\(2016\)Mastering the game of go with deep neural networks and tree search\.Nature529\(7587\),pp\. 484–489\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p2.1)\.
- \[21\]D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel,et al\.\(2018\)A general reinforcement learning algorithm that masters chess, shogi, and go through self\-play\.Science362\(6419\),pp\. 1140–1144\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p2.1)\.
- \[22\]D\. Silver, J\. Schrittwieser, K\. Simonyan, I\. Antonoglou, A\. Huang, A\. Guez, T\. Hubert, L\. Baker, M\. Lai, A\. Bolton, Y\. Chen, T\. P\. Lillicrap, F\. Hui, L\. Sifre, G\. van den Driessche, T\. Graepel, and D\. Hassabis\(2017\)Mastering the game of go without human knowledge\.Nature550\(7676\),pp\. 354–359\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p2.1)\.
- \[23\]TenhouExternal Links:[Link](https://tenhou.net/)Cited by:[§II](https://arxiv.org/html/2605.20577#S2.p2.1),[1st item](https://arxiv.org/html/2605.20577#S3.I1.i1.p1.1)\.
- \[24\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§IV\-B](https://arxiv.org/html/2605.20577#S4.SS2.p2.1)\.
- \[25\]E\. Yu, T\. H\. Liu, Y\. Wang, C\. L\. Canonne, N\. H\. Tran, and C\. Xu\(2025\)Nash policy gradient: a policy gradient method with iteratively refined regularization for finding nash equilibria\.arXiv preprint arXiv:2510\.18183\.Cited by:[§IV\-B](https://arxiv.org/html/2605.20577#S4.SS2.p3.7)\.
- \[26\]E\. Zhao, R\. Yan, J\. Li, K\. Li, and J\. Xing\(2022\)AlphaHoldem: high\-performance artificial intelligence for heads\-up no\-limit poker via end\-to\-end reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p3.1)\.
- \[27\]X\. Zhao and S\. B\. Holden\(2022\)Building a 3\-player mahjong ai using deep reinforcement learning\.arXiv preprint arXiv:2202\.12847\.Cited by:[§I](https://arxiv.org/html/2605.20577#S1.p1.1),[§II](https://arxiv.org/html/2605.20577#S2.p2.1)\.

Similar Articles

I made a superhuman Generals.io agent with self-play RL [P]

Reddit r/MachineLearning

Trained a superhuman Generals.io agent using self-play reinforcement learning with a JAX-based pipeline and Vision Transformer. Achieved #1 on human 1v1 leaderboard; all code and a fast JAX simulator open-sourced.