Learn to Match: Two-Sided Matching with Temporally Extended Feedback

arXiv cs.LG 06/08/26, 04:00 AM Papers
Summary
This paper introduces a framework for two-sided matching with temporally extended feedback, formulating it as a partially observable Markov game with costly screening, noisy observations, and evolving latent profiles. The authors present Learn2Match, a multi-agent reinforcement learning benchmark, and show that independent PPO outperforms bandit baselines in social welfare but incurs higher information-friction loss.
arXiv:2606.06744v1 Announce Type: new Abstract: Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:18 AM
# Learn to Match: Two-Sided Matching with Temporally Extended Feedback
Source: [https://arxiv.org/html/2606.06744](https://arxiv.org/html/2606.06744)
Haijing Zong∗1Yancheng Liang∗2Boyang Zhou2Natasha Jaques2 1Department of Economics, University of Washington, Seattle, United States 2Paul G\. Allen School of Computer Science & Engineering, University of Washington, Seattle, United States ∗Equal contribution\. Correspondence:zhaijing@uw\.edu

###### Abstract

Two\-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation\. Existing matching models typically reduce this process to immediate sub\-Gaussian feedback about fixed preferences, missing settings where payoff\-relevant information is revealed gradually and changes future matching decisions\. We introduce a framework withtemporally extended feedback, that formulates two\-sided matching as a partially observable Markov game with costly pre\-match screening, noisy post\-match observations, evolving latent profiles, and endogenous continuation or dissolution\. We instantiate this framework inLearn2Match, a multi\-agent reinforcement\-learning benchmark for dynamic matching markets\.Learn2Matchsupports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information\-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences\. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit\-style CA\-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets\. However, PPO still incurs higher information\-friction loss, revealing that end\-to\-end MARL does not yet provide the coordinated exploration structure of matching\-bandit methods\. These results positionLearn2Matchas a benchmark for developing the next generation of matching\-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable\-matching mechanisms\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.06744v1/x1.png)Figure 1:Overview ofLearn2Match, a dynamic two\-sided matching framework with temporally extended feedback\. Agents interview, match, learn gradually during tenure, and decide whether to retain or dissolve relationships, in contrast to traditional bandit matching with immediate one\-step feedback\.Two\-sided matching studies how agents on opposite sides of a market form pairwise relationships based on preferences over one another\. Its central solution concept is stability\[[15](https://arxiv.org/html/2606.06744#bib.bib99)\]: no unmatched pair should mutually prefer each other to their assigned partners\. The classical Gale–Shapley framework and deferred\-acceptance algorithm form the foundation for this line of work, with applications in labor markets, residency matching, school choice, marriage and dating market\[[15](https://arxiv.org/html/2606.06744#bib.bib99),[47](https://arxiv.org/html/2606.06744#bib.bib53),[48](https://arxiv.org/html/2606.06744#bib.bib100),[1](https://arxiv.org/html/2606.06744#bib.bib112),[18](https://arxiv.org/html/2606.06744#bib.bib9),[56](https://arxiv.org/html/2606.06744#bib.bib10)\]\.

However, many real\-world matching markets depart from the static Gale–Shapley model\[[1](https://arxiv.org/html/2606.06744#bib.bib112),[21](https://arxiv.org/html/2606.06744#bib.bib5),[6](https://arxiv.org/html/2606.06744#bib.bib56),[57](https://arxiv.org/html/2606.06744#bib.bib57),[52](https://arxiv.org/html/2606.06744#bib.bib111)\]\. Recent work has therefore extended stable matching to incorporate learning, decentralization, uncertainty, and contextual information\[[22](https://arxiv.org/html/2606.06744#bib.bib61),[31](https://arxiv.org/html/2606.06744#bib.bib40),[9](https://arxiv.org/html/2606.06744#bib.bib42),[25](https://arxiv.org/html/2606.06744#bib.bib118),[45](https://arxiv.org/html/2606.06744#bib.bib120),[41](https://arxiv.org/html/2606.06744#bib.bib121),[60](https://arxiv.org/html/2606.06744#bib.bib126),[43](https://arxiv.org/html/2606.06744#bib.bib117),[5](https://arxiv.org/html/2606.06744#bib.bib124),[26](https://arxiv.org/html/2606.06744#bib.bib125),[19](https://arxiv.org/html/2606.06744#bib.bib103),[30](https://arxiv.org/html/2606.06744#bib.bib65)\]\. These models narrow the gap between classical matching theory and operational matching systems, but they typically assume that each matching decision generates an immediate reward, observation, or noisy signal\.

In many markets, however, the payoff\-relevant information generated by a matching decision is realized only through future states\. We call this phenomenon*temporally extended feedback*\. For example, in labor markets, match quality may evolve after employment begins as workers gain experience, firms learn about productivity, and both sides decide whether to continue or separate\[[14](https://arxiv.org/html/2606.06744#bib.bib95),[3](https://arxiv.org/html/2606.06744#bib.bib114),[27](https://arxiv.org/html/2606.06744#bib.bib43),[49](https://arxiv.org/html/2606.06744#bib.bib94),[55](https://arxiv.org/html/2606.06744#bib.bib96),[40](https://arxiv.org/html/2606.06744#bib.bib113)\]\. Similarly, pre\-match processes such as search, screening, interviewing, and negotiation may consume substantial calendar time and effort before any match is formed\[[34](https://arxiv.org/html/2606.06744#bib.bib44),[13](https://arxiv.org/html/2606.06744#bib.bib45),[39](https://arxiv.org/html/2606.06744#bib.bib46),[44](https://arxiv.org/html/2606.06744#bib.bib50),[11](https://arxiv.org/html/2606.06744#bib.bib97)\]\. Thus, matching decisions can affect future opportunities, information, and welfare through persistent state changes rather than through one\-period feedback alone\. In labor markets, such persistence is often described as*career path dependence*: a match can shape future outcomes even without frequent rematching, as employment relationships become more durable with tenure\[[14](https://arxiv.org/html/2606.06744#bib.bib95),[55](https://arxiv.org/html/2606.06744#bib.bib96)\]\. This path dependence arises because workers learn about fit and accumulate occupation\-specific human capital over time, so current matches affect future productivity, information, and mobility costs\[[35](https://arxiv.org/html/2606.06744#bib.bib34),[23](https://arxiv.org/html/2606.06744#bib.bib35),[16](https://arxiv.org/html/2606.06744#bib.bib36)\]\.

We propose to model two\-sided matching markets with temporally extended feedback explicitly\. In contrast to matching\-bandit formulations, where each matching decision yields one\-shot feedback about fixed underlying preferences, we cast the market as a partially observable Markov game: a sequential decision problem in which a latent state governs how observations, rewards, and matching opportunities evolve over time\. This formulation captures three sources of temporal feedback: evolving agent profiles, costly and persistent pre\-match or post\-match stages, and noisy partial observations of latent match quality\. Agents must therefore decide not only whom to match with, but also when to search, screen, continue, separate, and explore\.

We instantiate this framework throughLearn2Match, a benchmark for dynamic two\-sided matching markets\.Learn2Matchmodels agents with evolving latent profiles, costly pre\-match and post\-match decisions, noisy observations, and endogenous continuation or separation\. We evaluate multi\-agent reinforcement\-learning methods\[[33](https://arxiv.org/html/2606.06744#bib.bib28),[46](https://arxiv.org/html/2606.06744#bib.bib27),[59](https://arxiv.org/html/2606.06744#bib.bib13),[58](https://arxiv.org/html/2606.06744#bib.bib29),[50](https://arxiv.org/html/2606.06744#bib.bib30)\]against bandit\-style baselines adapted from the matching literature\[[42](https://arxiv.org/html/2606.06744#bib.bib33),[38](https://arxiv.org/html/2606.06744#bib.bib119)\]\. Our evaluation includes matching regret, social welfare, and an*information\-friction loss*: the gap between acting on current observed or believed preferences and the stable matching induced by the true latent preferences revealed through sufficient interaction\.

We test independent PPO, a state\-of\-the\-art multi\-agent reinforcement learning \(MARL\) algorithm\[[50](https://arxiv.org/html/2606.06744#bib.bib30),[58](https://arxiv.org/html/2606.06744#bib.bib29)\], onLearn2Match, and compare it with a bandit\-style baseline CA\-ETC\[[42](https://arxiv.org/html/2606.06744#bib.bib33)\]\. We show that MARL agents perform better in cumulative social welfare and reduce the cumulative regret, showing the promise of MARL for decision\-making in matching markets under temporally extended feedback\. Notably, despite having higher performance, the information\-friction loss of MARL remains higher compared to the bandit\-style algorithm, which is designed to have a coordinated exploration process by its algorithm\. More broadly,Learn2Matchopens a shared research agenda between MARL and matching theory: it challenges MARL community to build agents that coordinate exploration and exploitation in structured markets, while challenging the matching and bandit communities to move beyond one\-shot preference estimation toward algorithms that remain effective when information unfolds through time\. In this sense,Learn2Matchis not only a benchmark for evaluating existing methods, but also a call to design the next generation of learning algorithms for matching markets: algorithms that are as adaptive as RL agents, as statistically disciplined as bandit methods, and as structurally aware as stable\-matching mechanisms\.

Our contributions are threefold\. First, we introduceLearn2Match, a benchmark for dynamic two\-sided matching with temporally extended feedback, featuring multi\-round information acquisition, evolving agent profiles, and endogenous proposal, response, and dissolution decisions\. Second, we provide an initial empirical study comparing multi\-agent reinforcement learning with bandit\-based baselines\. Third,Learn2Matchopens a shared research agenda at the intersection of multi\-agent reinforcement learning and matching theory, and flexibly supports real\-world application modeling—from labor markets and residency matching to school choice, dating, and ride\-hailing dispatch\. Progress on this benchmark can translate into better outcomes wherever the static, fully revealed assumptions of classical matching theory fail to hold\.

## 2Related work

Stable matching with static information\.Two\-sided matching has classically been studied through the lens of stability, beginning with the Gale–Shapley framework and its applications to labor markets, school choice, and other centralized assignment settings\[[15](https://arxiv.org/html/2606.06744#bib.bib99),[47](https://arxiv.org/html/2606.06744#bib.bib53),[48](https://arxiv.org/html/2606.06744#bib.bib100),[1](https://arxiv.org/html/2606.06744#bib.bib112)\]\. Existing work studies how stable matchings or preferences can be learned from noisy observations, but typically assumes an underlying static match value or preference structure\.\[[22](https://arxiv.org/html/2606.06744#bib.bib61),[31](https://arxiv.org/html/2606.06744#bib.bib40),[9](https://arxiv.org/html/2606.06744#bib.bib42),[25](https://arxiv.org/html/2606.06744#bib.bib118),[41](https://arxiv.org/html/2606.06744#bib.bib121),[45](https://arxiv.org/html/2606.06744#bib.bib120),[60](https://arxiv.org/html/2606.06744#bib.bib126),[43](https://arxiv.org/html/2606.06744#bib.bib117),[26](https://arxiv.org/html/2606.06744#bib.bib125),[5](https://arxiv.org/html/2606.06744#bib.bib124),[30](https://arxiv.org/html/2606.06744#bib.bib65),[19](https://arxiv.org/html/2606.06744#bib.bib103)\]\. These models typically reduce uncertainty to noisy \(like zero\-mean sub\-Gaussian\) observations of fixed latent preferences, with feedback revealed immediately after each interaction\. This is insufficient for settings where match quality evolves through interaction and history\. For example, in a dating market, partners’ profiles evolve as they age and accumulate life experience\. Our work builds on this literature while shifting attention from stable\-outcome identification alone to a richer sequential environment in which agents choose how to acquire information, whether to match, and whether to remain matched over time\.

Frictional search and post\-match learningare widely studied by a large economics literature\[[34](https://arxiv.org/html/2606.06744#bib.bib44),[13](https://arxiv.org/html/2606.06744#bib.bib45),[39](https://arxiv.org/html/2606.06744#bib.bib46),[44](https://arxiv.org/html/2606.06744#bib.bib50),[53](https://arxiv.org/html/2606.06744#bib.bib89),[11](https://arxiv.org/html/2606.06744#bib.bib97)\], which emphasizes that employment matching are not instantaneous assignment\. These markets are shaped by costly vacancy posting, screening, and hiring frictions\[[8](https://arxiv.org/html/2606.06744#bib.bib59),[10](https://arxiv.org/html/2606.06744#bib.bib98),[20](https://arxiv.org/html/2606.06744#bib.bib101)\], and are followed by employer learning, human\-capital accumulation, turnover, and costly post\-match adjustment\[[37](https://arxiv.org/html/2606.06744#bib.bib93),[14](https://arxiv.org/html/2606.06744#bib.bib95),[3](https://arxiv.org/html/2606.06744#bib.bib114),[27](https://arxiv.org/html/2606.06744#bib.bib43),[49](https://arxiv.org/html/2606.06744#bib.bib94),[55](https://arxiv.org/html/2606.06744#bib.bib96),[40](https://arxiv.org/html/2606.06744#bib.bib113),[54](https://arxiv.org/html/2606.06744#bib.bib60)\]\. Recent works on stable matching\[[4](https://arxiv.org/html/2606.06744#bib.bib37),[2](https://arxiv.org/html/2606.06744#bib.bib123),[38](https://arxiv.org/html/2606.06744#bib.bib119),[7](https://arxiv.org/html/2606.06744#bib.bib122)\]treat screening and in,formation as complementary pre\-match signals to reveal the underlying preference and fails to capture that information can arrive after the match initiates\.Learn2Matchfills that gap and turns the above mechanisms into a controlled multi\-agent learning environment\. These models, however, are stylized analytical tools that isolate single mechanisms under tractability assumptions to explain aggregate behavior; they are not controlled environments in which agents*learn*matching strategies from interaction\.Learn2Matchfills that gap: a simulatable benchmark grounded in this literature\. Although labor markets are our leading example, the same structural features—persistent state, costly information acquisition, and gradual revelation of match quality—arise in dating, school choice, and platform\-based matching, where the framework applies equally

Dynamic matching and multi\-agent reinforcement learning \(MARL\)\.Minet al\.\[[36](https://arxiv.org/html/2606.06744#bib.bib92)\]initiated a theoretical study of RL on matching markets\. While their model allows the market context to evolve across rounds, each round still implements a centralized myopic stable matching planner and receives immediate noisy utility feedback for the matched pairs\. Our worker instead adopts a fully decentralized decision\-making paradigm with temporally extended feedback\.\[[32](https://arxiv.org/html/2606.06744#bib.bib41)\]study welfare maximization in Markov exchange economies, and\[[51](https://arxiv.org/html/2606.06744#bib.bib62)\]develop MARL frameworks for off\-policy evaluation in two\-sided markets\. More broadly, modern MARL provides tools for decentralized partially observable multi\-agent decision\-making\[[33](https://arxiv.org/html/2606.06744#bib.bib28),[46](https://arxiv.org/html/2606.06744#bib.bib27),[59](https://arxiv.org/html/2606.06744#bib.bib13),[58](https://arxiv.org/html/2606.06744#bib.bib29),[50](https://arxiv.org/html/2606.06744#bib.bib30)\], and has been used to simulate market\-scale systems such as tax policy and mechanism design\[[61](https://arxiv.org/html/2606.06744#bib.bib32),[24](https://arxiv.org/html/2606.06744#bib.bib106)\]\. Our contribution differs in three concrete axes\. First, we treat the market as a partially observable Markov*game*—decentralized two\-sided learningrather than a centralized planner with fully observable states\. Second, we model the specific mechanisms that produce temporally extended feedback in real matching markets: multi\-round pre\-match screening, gradual post\-match revelation through tenure, and endogenous dissolution, none of which appear together in the works above\.

## 3Preliminary

We study a two\-sided dynamic matching market with strategic agents on both sides or on one side only\. There areN𝒲N^\{\\mathcal\{W\}\}workers on side𝒲\\mathcal\{W\}andNℱN^\{\\mathcal\{F\}\}firms on sideℱ\\mathcal\{F\}\(which can also represent, e\.g\., students and schools, patients and hospitals, or women and men\)\. Agents on each side may propose to, accept, reject, or dissolve matches with agents on the opposite side\.

We denote\[N\]=\{1,2,…,N\}\[N\]=\\\{1,2,\.\.\.,N\\\}\. For a setSS,Δ\(S\)\\Delta\(S\)is the set of all probability distributions overSS\. If not stated otherwise, agentiiwithout any specification refers to an agent either on the𝒲\\mathcal\{W\}orℱ\\mathcal\{F\}side, for notation simplicity\.

To capture such delayed and sequentially revealed interaction feedback, we model the matching market as a partially observable Markov game\. Compared to repeated tabular games or bandit algorithms, a key property of Markov games is the capability to model sequential dynamics, which is necessary for a matching market with delayed feedback\. We formulate the environment as:

𝒢=\(N𝒲,Nℱ,𝒮,𝒫,\{𝒪i𝒲,𝒜i𝒲,Ri𝒲\}i=1N𝒲,\{𝒪jℱ,𝒜jℱ,Rjℱ\}j=1Nℱ,ρ0\)\.\\mathcal\{G\}=\\left\(N^\{\\mathcal\{W\}\},N^\{\\mathcal\{F\}\},\\mathcal\{S\},\\mathcal\{P\},\\\{\\mathcal\{O\}\_\{i\}^\{\\mathcal\{W\}\},\\mathcal\{A\}\_\{i\}^\{\\mathcal\{W\}\},R\_\{i\}^\{\\mathcal\{W\}\}\\\}\_\{i=1\}^\{N^\{\\mathcal\{W\}\}\},\\\{\\mathcal\{O\}\_\{j\}^\{\\mathcal\{F\}\},\\mathcal\{A\}\_\{j\}^\{\\mathcal\{F\}\},R\_\{j\}^\{\\mathcal\{F\}\}\\\}\_\{j=1\}^\{N^\{\\mathcal\{F\}\}\},\\rho\_\{0\}\\right\)\.
Here𝒮\\mathcal\{S\}is the state space, and𝒜𝒲=𝒜1𝒲×…×𝒜N𝒲𝒲\\mathcal\{A\}^\{\\mathcal\{W\}\}=\\mathcal\{A\}^\{\\mathcal\{W\}\}\_\{1\}\\times\.\.\.\\times\\mathcal\{A\}^\{\\mathcal\{W\}\}\_\{N^\{\\mathcal\{W\}\}\}and𝒜ℱ=𝒜1ℱ×…×𝒜Nℱℱ\\mathcal\{A\}^\{\\mathcal\{F\}\}=\\mathcal\{A\}^\{\\mathcal\{F\}\}\_\{1\}\\times\.\.\.\\times\\mathcal\{A\}^\{\\mathcal\{F\}\}\_\{N^\{\\mathcal\{F\}\}\}denote the joint action spaces of workers and firms, respectively\. On statess, each agentiireceives the observationoi=𝒪i\(s\)∈Oio\_\{i\}=\\mathcal\{O\}\_\{i\}\(s\)\\in O\_\{i\}by the observation mapping function𝒪i\\mathcal\{O\}\_\{i\}and takes an actionaia\_\{i\}following the policyπi:Oi→Δ\(𝒜i\)\\pi\_\{i\}:O\_\{i\}\\to\\Delta\(\\mathcal\{A\}\_\{i\}\)\. With the joint action𝐚𝒲=\(a1𝒲,…,aN𝒲𝒲\)\\mathbf\{a\}^\{\\mathcal\{W\}\}=\\left\(a^\{\\mathcal\{W\}\}\_\{1\},\.\.\.,a^\{\\mathcal\{W\}\}\_\{N^\{\\mathcal\{W\}\}\}\\right\)and𝐚ℱ=\(a1ℱ,…,aNℱℱ\)\\mathbf\{a\}^\{\\mathcal\{F\}\}=\\left\(a^\{\\mathcal\{F\}\}\_\{1\},\.\.\.,a^\{\\mathcal\{F\}\}\_\{N^\{\\mathcal\{F\}\}\}\\right\), the next states′∼𝒫\(s,𝐚𝒲,𝐚ℱ\)s^\{\\prime\}\\sim\\mathcal\{P\}\(s,\\mathbf\{a\}^\{\\mathcal\{W\}\},\\mathbf\{a\}^\{\\mathcal\{F\}\}\)follows the transition function𝒫:S×𝒜𝒲×𝒜ℱ→Δ\(𝒮\)\\mathcal\{P\}:S\\times\\mathcal\{A\}^\{\\mathcal\{W\}\}\\times\\mathcal\{A\}^\{\\mathcal\{F\}\}\\to\\Delta\(\\mathcal\{S\}\)\. And each agentiireceives a rewardrir\_\{i\}given by the reward functionRi:𝒮×𝒜𝒲×𝒜ℱ→ℝR\_\{i\}:\\mathcal\{S\}\\times\\mathcal\{A\}^\{\\mathcal\{W\}\}\\times\\mathcal\{A\}^\{\\mathcal\{F\}\}\\to\\mathbb\{R\}\.

## 4Problem setting

Bandit\-based matching formulations assume immediate feedback, in which the consequences of proposals, acceptances, and separations are realized within the same round\. Instead, our formulation allows delayed feedback throughout the matching process\. We instead thread partial observability through every stage; An offer need not be resolved immediately but may enter an intermediate evaluation stage whose outcome is revealed only after several periods; And even after a match is formed, match quality may remain partially observed and continue to evolve over time\. Continuation and dissolution decisions therefore depend on information that arrives gradually rather than immediately\.

State\.At each step, the market is in a state

s=\(X\(s\),Y\(s\),M\(s\),H\(s\)\)\)s=\(X\(s\),Y\(s\),M\(s\),H\(s\)\)\)Here,X\(s\)X\(s\)andY\(s\)Y\(s\)represent the information \(which can be latent, believed, or observed\) of the agents of side𝒲\\mathcal\{W\}orℱ\\mathcal\{F\}, respectively\. To be specific,xi,yj∈ℝdx\_\{i\},y\_\{j\}\\in\\mathbb\{R\}^\{d\}are the latent profiles of agentiion side𝒲\\mathcal\{W\}and agentjjon sideℱ\\mathcal\{F\}, respectively\. On statess, agentjjon sideℱ\\mathcal\{F\}has an observation or belief of agentiion side𝒲\\mathcal\{W\}, which is denoted asx^ij∈ℝd\\hat\{x\}\_\{ij\}\\in\\mathbb\{R\}^\{d\}\.y^ij\\hat\{y\}\_\{ij\}is defined similarly\.

The matching set

M\(s\)∈\[N𝒲\]×\[Nℱ\]M\(s\)\\in\[N^\{\\mathcal\{W\}\}\]\\times\[N^\{\\mathcal\{F\}\}\]denotes all matched pairs\(i,j\)\(i,j\)for the current statess\.

The history termH\(s\)H\(s\)summarizes the relevant past interactions in the market, such as previous matches, past rewards, interview outcomes, proposal and rejection decisions, and other observations generated by the environment\. We do not give a strict form ofH\(s\)H\(s\)here to allow for flexible control of the environment dynamics\.

Asymptotic Revelation\.Motivated by economics literature\[[12](https://arxiv.org/html/2606.06744#bib.bib2),[14](https://arxiv.org/html/2606.06744#bib.bib95),[27](https://arxiv.org/html/2606.06744#bib.bib43)\], we assume that if a pair\(i,j\)\(i,j\)is matched for infinite times, the latent variable will be revealed for both agent \(up to a constantϵ\)\\epsilon\)\.

∀i∈\[N𝒲\],j∈\[Nℱ\],\(s1,s2,…\)∈𝒮∞,\\displaystyle\\forall i\\in\[N^\{\\mathcal\{W\}\}\],j\\in\[N^\{\\mathcal\{F\}\}\],\(s\_\{1\},s\_\{2\},\.\.\.\)\\in\\mathcal\{S\}^\{\\infty\},\(∑t=0T\[\(i,j\)∈M\(st\)\]\)=∞⇒limt→∞\|x^i,j\(st\)−xi\|≤ϵandlimt→∞\|y^i,j\(st\)−yj\|≤ϵ\.\\displaystyle\\left\(\\sum\_\{t=0\}^\{T\}\[\(i,j\)\\in M\(s\_\{t\}\)\]\\right\)=\\infty\\Rightarrow\\lim\_\{t\\to\\infty\}\|\\hat\{x\}\_\{i,j\}\(s\_\{t\}\)\-x\_\{i\}\|\\leq\\epsilon~\\text\{and\}~\\lim\_\{t\\to\\infty\}\|\\hat\{y\}\_\{i,j\}\(s\_\{t\}\)\-y\_\{j\}\|\\leq\\epsilon\.
Action\.We allow three broad categories of actions:*proposal*,*response*, and*dissolution*\.

Proposal\.An agent may initiate an interaction with an agent on the opposite side\. There could be multiple pre\-matching screening stages\. For example, in the labor market, the interview is the main screening stage\. In school application, there can be multiple screening stages, like application review and interview\.

Response\.Upon receiving a proposal, the receiving agent chooses whether to*accept*or*reject*it\. If a screening invitation is accepted, the two agents enter the screening stage; if a match proposal is accepted by both sides under the market rules \(optional environment design choice, e\.g\., match can only be accepted after interview\), the pair becomes matched and will be added toM\(s\)M\(s\)\.

Dissolution\.Either side may dissolve an existing match at any time, potentially incurring a cost\. This captures quitting, firing, churn, or other forms of separation\.

In this work, we assume only agents on side𝒲\\mathcal\{W\}can propose a matching\. \(Such constraints do not apply to screening proposals\.\) This assumption follows the population agent–arm formulation commonly adopted in bandit learning for two\-sided matching markets\[[31](https://arxiv.org/html/2606.06744#bib.bib40),[9](https://arxiv.org/html/2606.06744#bib.bib42),[30](https://arxiv.org/html/2606.06744#bib.bib65),[25](https://arxiv.org/html/2606.06744#bib.bib118),[42](https://arxiv.org/html/2606.06744#bib.bib33),[28](https://arxiv.org/html/2606.06744#bib.bib67),[38](https://arxiv.org/html/2606.06744#bib.bib119)\], and enforces the one\-to\-one matching constraint at the environment level by preventing duplicate assignments, i\.e\., matching one agent to multiple agents on the other side in the same round\.

Step or Time Unit\.In this work the time stepttcan be regarded as the minimal time unit or division \(e\.g\., a calendar month\) for matching\. That is, no changes of matching can happen inside the samett\. Except from that, we note that within the samettthere can be multiple actions taken “simultaneously”\. For example, a pair of agents can initiate a screening and accept it in the same step\. We consider a finite horizonTTfor each episode\.

Reward\.We consider a linear reward model with unobservable latent profiles of agents and observable empirical variables of agents\. To be specific, for a matched pair\(i,j\)\(i,j\), agentiion side𝒲\\mathcal\{W\}receivesRi𝒲\(s\)=⟨xi,y^ij\(s\)⟩R^\{\\mathcal\{W\}\}\_\{i\}\(s\)=\\langle x\_\{i\},\\hat\{y\}\_\{ij\}\(s\)\\rangleand agentjjon sideℱ\\mathcal\{F\}receives a rewardRjℱ\(s\)=⟨x^ij\(s\),yj⟩R^\{\\mathcal\{F\}\}\_\{j\}\(s\)=\\langle\\hat\{x\}\_\{ij\}\(s\),y\_\{j\}\\rangleon statess\. If an agent is not matched on the current steptt, it receives zero reward\. We remark that linear rewards are widely used in real\-world matching markets such as recommendation and advertising\[[29](https://arxiv.org/html/2606.06744#bib.bib6),[17](https://arxiv.org/html/2606.06744#bib.bib15)\], and are general enough to represent complete ordinal preference rankings, i\.e\., strict preference lists, as used in the classic Gale–Shapley stable matching model\[[15](https://arxiv.org/html/2606.06744#bib.bib99)\]\.

For MARL, the objective \(value function\) ofπi\\pi\_\{i\}is

Viπi,π−i=𝔼\[∑t=1TR\(st,𝐚t\)∣\(s1,a1,…,sT,aT\)∼πi,π−i\]\.V\_\{i\}^\{\\pi\_\{i\},\\pi\_\{\-i\}\}=\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}R\(s\_\{t\},\\mathbf\{a\}\_\{t\}\)\\mid\(s\_\{1\},\\mathrm\{a\}\_\{1\},\.\.\.,s\_\{T\},\\mathrm\{a\}\_\{T\}\)\\sim\\pi\_\{i\},\\pi\_\{\-i\}\\right\]\.
Therefore, the value ofπi\\pi\_\{i\}depends not on the policy itself but alsoπ−i\\pi\_\{\-i\}, the policies of all other agents on both sides, as is typical of multi\-agent Markov games\.

Evaluation metric\.Following prior work on bandit learning in decentralized matching markets\[[31](https://arxiv.org/html/2606.06744#bib.bib40),[9](https://arxiv.org/html/2606.06744#bib.bib42),[25](https://arxiv.org/html/2606.06744#bib.bib118),[42](https://arxiv.org/html/2606.06744#bib.bib33)\], we adopt the worker\-optimal regret and firm\-pessimal regret\. LetM∗M^\{\*\}be the𝒲\\mathcal\{W\}\-optimal stable matching, equivalently the worker\-optimal and firm\-pessimal stable matching, as characterized by the Gale–Shapley framework and the lattice structure of stable matchings\[[15](https://arxiv.org/html/2606.06744#bib.bib99),[47](https://arxiv.org/html/2606.06744#bib.bib53)\]\. The regrets are

Regret𝒲¯=\\displaystyle\\overline\{\\textit\{Regret\}^\{\\mathcal\{W\}\}\}=∑t=1T\(∑\(i,j\)∈M∗N𝒲⟨xi,yj⟩−∑\(i,j\)∈M\(st\)⟨xi,y^ij\(st\)⟩\),\\displaystyle\\sum\_\{t=1\}^\{T\}\\left\(\\sum\_\{\(i,j\)\\in M^\{\*\}\}^\{N^\{\\mathcal\{W\}\}\}\\langle x\_\{i\},y\_\{j\}\\rangle\-\\sum\_\{\(i,j\)\\in M\(s\_\{t\}\)\}\\langle x\_\{i\},\\hat\{y\}\_\{ij\}\(s\_\{t\}\)\\rangle\\right\),Regretℱ¯=\\displaystyle\\underline\{\\textit\{Regret\}^\{\\mathcal\{F\}\}\}=∑t=1T\(∑\(i,j\)∈M∗⟨xi,yj⟩−∑\(i,j\)∈M\(st\)⟨x^ij\(st\),yj⟩\)\.\\displaystyle\\sum\_\{t=1\}^\{T\}\\left\(\\sum\_\{\(i,j\)\\in M^\{\*\}\}\\langle x\_\{i\},y\_\{j\}\\rangle\-\\sum\_\{\(i,j\)\\in M\(s\_\{t\}\)\}\\langle\\hat\{x\}\_\{ij\}\(s\_\{t\}\),y\_\{j\}\\rangle\\right\)\.
We also considerSocial Welfareas the total reward of all agents\.

Social Welfare=∑t=1T∑\(i,j\)∈M\(st\)⟨xi,y^ij\(st\)⟩\+⟨x^ij\(st\),yj⟩\.\\textit\{Social Welfare\}=\\sum\_\{t=1\}^\{T\}\\sum\_\{\(i,j\)\\in M\(s\_\{t\}\)\}\\langle x\_\{i\},\\hat\{y\}\_\{ij\}\(s\_\{t\}\)\\rangle\+\\langle\\hat\{x\}\_\{ij\}\(s\_\{t\}\),y\_\{j\}\\rangle\.
To capture temporally extended feedback, we consider a specificFriction Lossas the loss of social walfare due to incorrect belief, assessment or matching decisions\. LetM∗\(st\)M^\{\*\}\(s\_\{t\}\)be the𝒲\\mathcal\{W\}worker\-optimal stable matching under the preference defined by⟨xi,y^ij\(st\)⟩,⟨x^ij\(st\),yj⟩\\langle x\_\{i\},\\hat\{y\}\_\{ij\}\(s\_\{t\}\)\\rangle,\\langle\\hat\{x\}\_\{ij\}\(s\_\{t\}\),y\_\{j\}\\rangle\.

Fricition Loss=∑t=1T\(∑\(i,j\)∈M∗⟨xi,yj⟩−∑\(i,j\)∈M∗\(st\)⟨xi,yj⟩\)\.\\textit\{Fricition Loss\}=\\sum\_\{t=1\}^\{T\}\\left\(\\sum\_\{\(i,j\)\\in M^\{\*\}\}\\langle x\_\{i\},y\_\{j\}\\rangle\-\\sum\_\{\(i,j\)\\in M^\{\*\}\(s\_\{t\}\)\}\\langle x\_\{i\},y\_\{j\}\\rangle\\right\)\.
TheFriction Losscaptures how the gap on delayed observation can cause a gap on the final outcome \(reward or social walfare\)\.

## 5Learn2Match: A MARL Benchmark for Dynamic Two\-Sided Matching

We instantiate the framework in Section[4](https://arxiv.org/html/2606.06744#S4)asLearn2Match, a multi\-agent reinforcement\-learning benchmark for two\-sided matching with temporally extended feedback\. The environment models a market in which agents screen potential partners, form matches, learn gradually from interaction, and may endogenously dissolve matches\.

### 5\.1Order of action and reward settlement

Each stepttinLearn2Matchconsists of five phases\. First, unmatched agents on both sides simultaneously propose interviews\. Second, each agent accepts at most one received interview proposal; accepted pairs\(i,j\)\(i,j\)generate noisy observations but no match or production reward, so screening only refines beliefs\[[38](https://arxiv.org/html/2606.06744#bib.bib119),[4](https://arxiv.org/html/2606.06744#bib.bib37),[2](https://arxiv.org/html/2606.06744#bib.bib123),[7](https://arxiv.org/html/2606.06744#bib.bib122)\]\. Third, unmatched workers may propose matches only to previously interviewed firms\. Fourth, an unmatched firm accepts or rejects the match proposal; accepted pairs are added to the current matching\. Finally, either party in a matched pair may dissolve the match\. Rewards are earned only by pairs that remain matched through steptt\.

### 5\.2Life cycle of a matching

At the beginning of an episode, the latent profilesxix\_\{i\}andyjy\_\{j\}are sampled i\.i\.d\. fromN\(0,Id\)N\(0,I\_\{d\}\)for each agent\. In this work, we considerd=3d=3, but people can feel free to try different parameters\.

We will show the dynamics of the environment from the perspective of a pair\(i,j\)\(i,j\)going through the entire process of interview, matching, and dissolution\. For any statess, lettenureij\(H\(s\)\)\\text\{tenure\}\_\{ij\}\(H\(s\)\)be the number of steps that\(i,j\)\(i,j\)are matched until statess\.

Suppose at stept1t\_\{1\},\(i,j\)\(i,j\)is not matched\. \(That is,\(i,j\)∉M\(st\)\(i,j\)\\notin M\(s\_\{t\}\)\.\) Theniican initialize an interview proposal tojj\. Ifjjaccepts the interview, both agents observes

x^ij\(st1\)=xi\+εinterview,ij𝒲\(st1\),y^ij\(st1\)=yj\+εinterview,ijℱ\(st1\),\\hat\{x\}\_\{ij\}\(s\_\{t\_\{1\}\}\)=x\_\{i\}\+\\varepsilon^\{\\mathcal\{W\}\}\_\{\\text\{interview\},ij\}\(s\_\{t\_\{1\}\}\),\\qquad\\hat\{y\}\_\{ij\}\(s\_\{t\_\{1\}\}\)=y\_\{j\}\+\\varepsilon^\{\\mathcal\{F\}\}\_\{\\text\{interview\},ij\}\(s\_\{t\_\{1\}\}\),where all noise termsεinterview\\varepsilon\_\{\\text\{interview\}\}are i\.i\.d\. sampled from𝒩\(0,σinterview2\)\\mathcal\{N\}\(0,\\sigma\_\{\\text\{interview\}\}^\{2\}\)\. As the number of interviews increases, the accumulated noisy observations become increasingly concentrated around the underlying latent profiles, enabling agents to infer the latent profiles more accurately from the interaction history\.

Based on the observation history through the interview, the agents can decide to form a match, proposed by agents on side𝒲\\mathcal\{W\}and responded by agents on sideℱ\\mathcal\{F\}\. Suppose on timet2≥t1t\_\{2\}\\geq t\_\{1\},\(i,j\)∈M\(st2\)\(i,j\)\\in M\(s\_\{t\_\{2\}\}\), both agents observe

x^ij\(st2\)\\displaystyle\\hat\{x\}\_\{ij\}\(s\_\{t\_\{2\}\}\)=xi1\+e−λreveal⋅tenureij\(H\(st2\)\)\+ϵmatch,ij𝒲\(st2\),\\displaystyle=\\frac\{x\_\{i\}\}\{1\+e^\{\-\\lambda\_\{\\mathrm\{reveal\}\}\\cdot\\text\{tenure\}\_\{ij\}\(H\(s\_\{t\_\{2\}\}\)\)\}\}\+\\epsilon^\{\\mathcal\{W\}\}\_\{\\text\{match\},ij\}\(s\_\{t\_\{2\}\}\),y^ij\(st2\)\\displaystyle\\hat\{y\}\_\{ij\}\(s\_\{t\_\{2\}\}\)=yj1\+e−λreveal⋅tenureij\(H\(st2\)\)\+ϵmatch,ijℱ\(st2\),σinterview2\>σmatch2,\\displaystyle=\\frac\{y\_\{j\}\}\{1\+e^\{\-\\lambda\_\{\\mathrm\{reveal\}\}\\cdot\\text\{tenure\}\_\{ij\}\(H\(s\_\{t\_\{2\}\}\)\)\}\}\+\\epsilon^\{\\mathcal\{F\}\}\_\{\\text\{match\},ij\}\(s\_\{t\_\{2\}\}\),~~~\\sigma\_\{\\text\{interview\}\}^\{2\}\>\\sigma\_\{\\text\{match\}\}^\{2\},and receives rewards according to the above observation\. Such dynamics models how the agent’s profile, belief, or observation may vary over time and how that converges to the latent profile with longer interaction\. Therefore, a match may be locally acceptable under current beliefs but later become undesirable as new information arrives or states evolve\. Thus, post\-match learning and dissolution cannot be reduced to a one\-shot stable\-matching outcome, consistent with employer\-learning and turnover models\[[14](https://arxiv.org/html/2606.06744#bib.bib95),[3](https://arxiv.org/html/2606.06744#bib.bib114),[27](https://arxiv.org/html/2606.06744#bib.bib43),[55](https://arxiv.org/html/2606.06744#bib.bib96)\]\. The parametersλreveal\\lambda\_\{\\mathrm\{reveal\}\}control how quickly realized states approach latent capacities while matched, capturing learning\-by\-doing, human\-capital accumulation, firm growth, or related productivity changes\[[37](https://arxiv.org/html/2606.06744#bib.bib93),[14](https://arxiv.org/html/2606.06744#bib.bib95),[3](https://arxiv.org/html/2606.06744#bib.bib114),[27](https://arxiv.org/html/2606.06744#bib.bib43),[49](https://arxiv.org/html/2606.06744#bib.bib94)\]\.

The pair\(i,j\)\(i,j\)can dissolve at some future stept3≥t2t\_\{3\}\\geq t\_\{2\}\. Each matched agent observes its current partner, tenure, latest estimate, and recent reward, then chooses whether to keep or dissolve the match\. Dissolution occurs if either side chooses to dissolve\. A dissolved agent becomes unmatched before entering a new match; stricter constraints, such as requiring exit before new interviews, can also be imposed\.

Rewards are computed only after the dissolution phase\. \(In the example above, rewarding period ist2,t2\+1,…,t3−1t\_\{2\},t\_\{2\}\+1,\.\.\.,t\_\{3\}\-1\. Hence, a match that forms and dissolves within the same time step generates no reward, imposing an opportunity cost on failed match formation\.

## 6Experiments

### 6\.1Algorithms

We useLearn2Matchto study how temporally extended feedback affects different decision\-making algorithms\. We compare one classic bandit\-style stable matching algorithm with an end\-to\-end MARL algorithm\.

Independent PPO\.We use the state\-of\-the\-art MARL algorithm independent proximal policy optimization \(PPO\)\[[50](https://arxiv.org/html/2606.06744#bib.bib30),[58](https://arxiv.org/html/2606.06744#bib.bib29)\]and adopt a decentralized implementation\. Each agent observes its local state—current match status, available proposals, interview history, belief estimates, tenure, and recent rewards—and outputs actions over interview proposals, match proposals or responses, and dissolution\. Agents are trained to maximize cumulative reward in an episode\.

CA\-ETC\[[42](https://arxiv.org/html/2606.06744#bib.bib33)\]\.We adapt CA\-ETC\[[42](https://arxiv.org/html/2606.06744#bib.bib33)\]toLearn2Matchby treating workers as players and firms as arms, executed entirely through the environment’s native action protocol\. In the exploration stage of CA\-ETC, a collision\-avoiding round\-robin schedule routes every worker–firm pair through the legal interview→\\rightarrowmatch\-proposal→\\rightarrowacceptance sequence so that empirical estimates ofRi𝒲R\_\{i\}^\{\\mathcal\{W\}\}andRjℱR\_\{j\}^\{\\mathcal\{F\}\}are recorded once at match formation\. In the exploitation stage of CA\-ETC, agents construct preference lists from estimated reward and execute the worker\-proposing Gale–Shapley by issuing the corresponding proposals and acceptances\. Subsequent dissolution is governed by theLearn2Matchtransition dynamics and does not affect CA\-ETC’s preference estimation; the adaptation thus preserves the original explore\-then\-commit structure while respecting the benchmark’s interaction constraints\. Implementation details are in the Appendix\.

### 6\.2Environment Setups

In the experiments, small market usesN𝒲=Nℱ=5N^\{\\mathcal\{W\}\}=N^\{\\mathcal\{F\}\}=5,d=3d=3,T=200T=200and large market usesN𝒲=Nℱ=20N^\{\\mathcal\{W\}\}=N^\{\\mathcal\{F\}\}=20,d=5d=5,T=600T=600\. We consider the following two settings:

Low\-noise / near\-static\.A near\-noiseless configuration withσinterview=10−6\\sigma\_\{\\text\{interview\}\}=10^\{\-6\},σmatch=0\\sigma\_\{\\text\{match\}\}=0, andλreveal=100\\lambda\_\{\\text\{reveal\}\}=100\. In the notation of Section[5](https://arxiv.org/html/2606.06744#S5), the largeλreveal\\lambda\_\{\\text\{reveal\}\}makes the sigmoid factor1/\(1\+e−λreveal⋅tenure\)1/\(1\+e^\{\-\\lambda\_\{\\mathrm\{reveal\}\}\\cdot\\text\{tenure\}\}\)saturate to one essentially instantaneously, so latent profiles are recoverable from a single interview and post\-match observations converge to the latent profiles immediately\. This setting matches the information structure assumed by decentralized matching\-bandit algorithms and is intentionally favorable to CA\-ETC\.

Temporally extended feedback\.The full benchmark withσinterview=1\\sigma\_\{\\text\{interview\}\}=1,σmatch=0\.6\\sigma\_\{\\text\{match\}\}=0\.6, andλreveal=1\\lambda\_\{\\text\{reveal\}\}=1\. We run such a setting with both small and large markets\. Thus,x^ij\\hat\{x\}\_\{ij\}andy^ij\\hat\{y\}\_\{ij\}approach their latent values only after several periods of tenure rather than instantaneously\. The latent stable matching is no longer recoverable from a single round of interviews\.

### 6\.3Evaluation metrics

We evaluate both PPO agents and CA\-ETC by worker\-side regret, firm\-side regret, social welfare, and friction loss as defined in Section[4](https://arxiv.org/html/2606.06744#S4), and provide training curves for PPO\.

### 6\.4Results

#### Low\-noise / near\-static setting\.

Even in this favorable regime for CA\-ETC, PPO achieves lower realized regret and higher realized welfare over the finite horizon \(Figure[2](https://arxiv.org/html/2606.06744#S6.F2)\)\. Att=200t=200, cumulative firm and worker regret are roughly halved relative to CA\-ETC, showing the advantage of MARL algorithms on long\-term planning\. CA\-ETC achieves near\-zero cumulative friction loss, while PPO accumulates about∼400\{\\sim\}400\. This is consistent with the design of CA\-ETC: when feedback is nearly immediate and noiseless, explicit per\-pair empirical averages and repeated Gale–Shapley recomputation recover a belief\-induced stable matching that closely matches the latent stable benchmark\.

![Refer to caption](https://arxiv.org/html/2606.06744v1/x2.png)\(a\)Firm regret
![Refer to caption](https://arxiv.org/html/2606.06744v1/x3.png)\(b\)Worker regret
![Refer to caption](https://arxiv.org/html/2606.06744v1/x4.png)\(c\)Social welfare
![Refer to caption](https://arxiv.org/html/2606.06744v1/x5.png)\(d\)Friction loss

Figure 2:Comparison ofLearn2Match\(PPO\) against CA\-ETC in Low\-noise / near\-static setting in the small market\. CA\-ETC has near\-zero cumulative friction loss\. However, PPO still outperforms CA\-ETC in both regret and social welfare\.![Refer to caption](https://arxiv.org/html/2606.06744v1/x6.png)\(a\)Worker regret
![Refer to caption](https://arxiv.org/html/2606.06744v1/x7.png)\(b\)Firm regret
![Refer to caption](https://arxiv.org/html/2606.06744v1/x8.png)\(c\)Social welfare
![Refer to caption](https://arxiv.org/html/2606.06744v1/x9.png)\(d\)Friction loss

Figure 3:Comparison ofLearn2Match\(PPO\) against CA\-ETC in the temporally extended feedback setting in the small market\. PPO outperforms CA\-ETC in both regret and social welfare, but CA\-ETC has lower friction loss\.![Refer to caption](https://arxiv.org/html/2606.06744v1/x10.png)\(a\)Worker regret
![Refer to caption](https://arxiv.org/html/2606.06744v1/x11.png)\(b\)Firm regret
![Refer to caption](https://arxiv.org/html/2606.06744v1/x12.png)\(c\)Social welfare
![Refer to caption](https://arxiv.org/html/2606.06744v1/x13.png)\(d\)Friction loss

Figure 4:PPO learning curves in the large market, temporally extended feedback setting\. Worker regret and firm regret decrease over training; social welfare increases and then converges\. Friction loss converges to a non\-zero value\.![Refer to caption](https://arxiv.org/html/2606.06744v1/x14.png)\(a\)Worker regret
![Refer to caption](https://arxiv.org/html/2606.06744v1/x15.png)\(b\)Firm regret
![Refer to caption](https://arxiv.org/html/2606.06744v1/x16.png)\(c\)Social welfare
![Refer to caption](https://arxiv.org/html/2606.06744v1/x17.png)\(d\)Friction loss

Figure 5:Comparison ofLearn2Match\(PPO\) against CA\-ETC in the temporally extended feedback setting in the large market\. The result is consistent with the small market\. PPO outperforms CA\-ETC in both regret and social welfare, but CA\-ETC has lower friction loss\.
#### Temporally extended feedback\.

The main benchmark setting restores the structure motivated in the introduction: interviews are noisy, post\-match observations are noisy, and latent profiles are revealed only gradually through tenure\. In this regime, PPO again outperforms CA\-ETC on three of the four evaluation metrics, both in the small market \(Figure[3](https://arxiv.org/html/2606.06744#S6.F3)\) and the large market \(Figure[5](https://arxiv.org/html/2606.06744#S6.F5)\)\.

This is the regime where temporally extended feedback matters most\. CA\-ETC estimates preferences from noisy, short\-run observations and then executes Gale–Shapley\-style exploitation from those estimates\. The welfare and regret gains therefore support the paper’s central claim that matching markets with delayed revelation require algorithms that act while information is still unfolding, rather than algorithms that reduce the problem to one\-shot preference estimation\.

Notably, although PPO achieves higher performance, the friction loss does not converge to zero, indicating insufficient exploration\. Figure[6](https://arxiv.org/html/2606.06744#S6.F6)explains this gap: PPO interviews a few pairs and then commits for long periods to promising matches, echoing the*career path dependence*documented in labor economics, where relationships become more persistent with tenure and current matches shape future learning and mobility costs\[[14](https://arxiv.org/html/2606.06744#bib.bib95),[55](https://arxiv.org/html/2606.06744#bib.bib96),[35](https://arxiv.org/html/2606.06744#bib.bib34),[23](https://arxiv.org/html/2606.06744#bib.bib35)\]\. This improves realized welfare through tenure\-based learning, but leaves many unobserved pairs near their priors, highlighting the need to combine PPO\-like adaptivity with CA\-ETC\-like coordinated exploration and stable\-matching structure\.

![Refer to caption](https://arxiv.org/html/2606.06744v1/small_heat.png)Figure 6:Left:interview coverage—the fraction that each pair\(i,j\)\(i,j\)was interviewed at least once by the end of the episode across all evaluation environments\.Right:mean cumulative tenure of each pair at the final period\. Both figures are from the large market setting\.

## 7Conclusion

We introducedLearn2Match, a benchmark for studying two\-sided matching markets with temporally extended feedback\. Unlike standard learning\-based matching models that treat proposals or matches as producing immediate noisy feedback about fixed preferences, our formulation models matching as a partially observable Markov game in which information is revealed gradually through sequential interaction\. Our experiments show that reinforcement\-learning policies better exploit temporal structure than bandit\-style baselines by allocating interviews toward informative future matches, rematching early under uncertainty, and reducing information\-friction loss over time\.

## References

- \[1\]\(2003\)School choice: a mechanism design approach\.American economic review93\(3\),pp\. 729–747\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p1.1),[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[2\]M\. Allman, I\. Ashlagi, A\. Saberi, and S\. H\. Yu\(2025\)From signaling to interviews in random matching markets\.InProceedings of the 57th Annual ACM Symposium on Theory of Computing,pp\. 1556–1567\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.06744#S5.SS1.p1.3)\.
- \[3\]J\. G\. Altonji and C\. R\. Pierret\(2001\)Employer learning and statistical discrimination\.The quarterly journal of economics116\(1\),pp\. 313–350\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§5\.2](https://arxiv.org/html/2606.06744#S5.SS2.p4.5)\.
- \[4\]I\. Ashlagi, J\. Chen, M\. Roghani, and A\. Saberi\(2025\)Stable matching with interviews\.In16th Innovations in Theoretical Computer Science Conference \(ITCS 2025\),pp\. 12–1\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.06744#S5.SS1.p1.3)\.
- \[5\]A\. Athanasopoulos, A\. George, and C\. Dimitrakakis\(2025\)Probably correct optimal stable matching for two\-sided markets under uncertainty\.arXiv preprint arXiv:2501\.03018\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[6\]X\. Azagirre, A\. Balwally, G\. Candeli, N\. Chamandy, B\. Han, A\. King, H\. Lee, M\. Loncaric, S\. Martin, V\. Narasiman,et al\.\(2024\)A better match for drivers and riders: reinforcement learning at lyft\.INFORMS Journal on Applied Analytics54\(1\),pp\. 71–83\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1)\.
- \[7\]M\. Babaioff, R\. Gil, and A\. Romm\(2026\)Efficient interview scheduling for stable matching\.arXiv preprint arXiv:2602\.20358\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.06744#S5.SS1.p1.3)\.
- \[8\]J\. M\. Barron, M\. C\. Berger, and D\. A\. Black\(1997\)Employer search, training, and vacancy duration\.Economic inquiry35\(1\),pp\. 167–192\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[9\]S\. Basu, K\. A\. Sankararaman, and A\. Sankararaman\(2021\)Beyondlog2⁡\(T\)\\log^\{2\}\(T\)regret for decentralized bandits in matching markets\.InInternational Conference on Machine Learning,pp\. 705–715\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1),[§4](https://arxiv.org/html/2606.06744#S4.p10.1),[§4](https://arxiv.org/html/2606.06744#S4.p15.2)\.
- \[10\]M\. Blatter, S\. Muehlemann, and S\. Schenker\(2012\)The costs of hiring skilled workers\.European Economic Review56\(1\),pp\. 20–35\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[11\]C\. Carrillo\-Tudela, H\. Gartner, and L\. Kaas\(2023\)Recruitment policies, job\-filling rates, and matching efficiency\.Journal of the European Economic Association21\(6\),pp\. 2413–2459\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[12\]M\. W\. Cripps, J\. C\. Ely, G\. J\. Mailath, and L\. Samuelson\(2008\)Common learning\.Econometrica76\(4\),pp\. 909–933\.Cited by:[§4](https://arxiv.org/html/2606.06744#S4.p5.2)\.
- \[13\]P\. A\. Diamond\(1982\)Aggregate demand management in search equilibrium\.Journal of political Economy90\(5\),pp\. 881–894\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[14\]H\. S\. Farber and R\. Gibbons\(1996\)Learning and wage dynamics\.The Quarterly Journal of Economics111\(4\),pp\. 1007–1047\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§4](https://arxiv.org/html/2606.06744#S4.p5.2),[§5\.2](https://arxiv.org/html/2606.06744#S5.SS2.p4.5),[§6\.4](https://arxiv.org/html/2606.06744#S6.SS4.SSS0.Px2.p3.1)\.
- \[15\]D\. Gale and L\. S\. Shapley\(1962\)College admissions and the stability of marriage\.The American mathematical monthly69\(1\),pp\. 9–15\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p1.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1),[§4](https://arxiv.org/html/2606.06744#S4.p12.9),[§4](https://arxiv.org/html/2606.06744#S4.p15.2)\.
- \[16\]F\. Groes, P\. Kircher, and I\. Manovskii\(2015\)The u\-shapes of occupational mobility\.The Review of Economic Studies82\(2\),pp\. 659–692\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1)\.
- \[17\]L\. Guo, J\. Jin, H\. Zhang, Z\. Zheng, Z\. Yang, Z\. Xing, F\. Pan, L\. Niu, F\. Wu, H\. Xu,et al\.\(2021\)We know what you want: an advertising strategy recommender system for online advertising\.InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining,pp\. 2919–2927\.Cited by:[§4](https://arxiv.org/html/2606.06744#S4.p12.9)\.
- \[18\]G\. J\. Hitsch, A\. Hortaçsu, and D\. Ariely\(2010\)Matching and sorting in online dating\.American Economic Review100\(1\),pp\. 130–163\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p1.1)\.
- \[19\]H\. Hosseini, S\. Roy, and D\. Zhang\(2024\)Putting gale & shapley to work: guaranteeing stability through learning\.Advances in Neural Information Processing Systems37,pp\. 69043–69068\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[20\]F\. Huang and P\. Cappelli\(2006\)Employee screening: theory and evidence\.National Bureau of Economic Research Cambridge, Mass\., USA\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[21\]N\. Immorlica, B\. Lucier, V\. Manshadi, and A\. Wei\(2021\)Designing approximately optimal search on matching platforms\.InProceedings of the 22nd ACM Conference on Economics and Computation,pp\. 632–633\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1)\.
- \[22\]M\. Jagadeesan, A\. Wei, Y\. Wang, M\. Jordan, and J\. Steinhardt\(2021\)Learning equilibria in matching markets from bandit feedback\.Advances in Neural Information Processing Systems34,pp\. 3323–3335\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[23\]G\. Kambourov and I\. Manovskii\(2009\)Occupational mobility and wage inequality\.The Review of Economic Studies76\(2\),pp\. 731–759\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§6\.4](https://arxiv.org/html/2606.06744#S6.SS4.SSS0.Px2.p3.1)\.
- \[24\]S\. Karten, W\. Li, Z\. Ding, S\. Kleiner, Y\. Bai, and C\. Jin\(2025\)Llm economist: large population models and mechanism design in multi\-agent generative simulacra\.arXiv preprint arXiv:2507\.15815\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p3.1)\.
- \[25\]F\. Kong and S\. Li\(2023\)Player\-optimal stable regret for bandit learning in matching markets\.InProceedings of the 2023 Annual ACM\-SIAM Symposium on Discrete Algorithms \(SODA\),pp\. 1512–1522\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1),[§4](https://arxiv.org/html/2606.06744#S4.p10.1),[§4](https://arxiv.org/html/2606.06744#S4.p15.2)\.
- \[26\]F\. Kong, J\. Tang, M\. Li, P\. Lu, J\. C\. Lui, and S\. Li\(2025\)Bandit learning in matching markets with indifference\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[27\]F\. Lange\(2007\)The speed of employer learning\.Journal of Labor Economics25\(1\),pp\. 1–35\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§4](https://arxiv.org/html/2606.06744#S4.p5.2),[§5\.2](https://arxiv.org/html/2606.06744#S5.SS2.p4.5)\.
- \[28\]S\. Li, Z\. Wang, and F\. Kong\(2025\)A survey on bandit learning in matching markets\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence,pp\. 10546–10554\.Cited by:[§4](https://arxiv.org/html/2606.06744#S4.p10.1)\.
- \[29\]Y\. Li, Y\. Wang, X\. Chen, and Y\. Zhou\(2021\)Tight regret bounds for infinite\-armed linear contextual bandits\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 370–378\.Cited by:[§4](https://arxiv.org/html/2606.06744#S4.p12.9)\.
- \[30\]Y\. Li, C\. Wang, G\. Cheng, and W\. W\. Sun\(2022\)Dynamic matching bandit for two\-sided online markets\.arXiv preprint arXiv:2205\.03699\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1),[§4](https://arxiv.org/html/2606.06744#S4.p10.1)\.
- \[31\]L\. T\. Liu, F\. Ruan, H\. Mania, and M\. I\. Jordan\(2021\)Bandit learning in decentralized matching markets\.Journal of Machine Learning Research22\(211\),pp\. 1–34\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1),[§4](https://arxiv.org/html/2606.06744#S4.p10.1),[§4](https://arxiv.org/html/2606.06744#S4.p15.2)\.
- \[32\]Z\. Liu, M\. Lu, Z\. Wang, M\. Jordan, and Z\. Yang\(2022\)Welfare maximization in competitive equilibrium: reinforcement learning for markov exchange economy\.InInternational Conference on Machine Learning,pp\. 13870–13911\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p3.1)\.
- \[33\]R\. Lowe, Y\. I\. Wu, A\. Tamar, J\. Harb, O\. Pieter Abbeel, and I\. Mordatch\(2017\)Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p5.1),[§2](https://arxiv.org/html/2606.06744#S2.p3.1)\.
- \[34\]J\. J\. McCall\(1970\)Economics of information and job search\.The Quarterly Journal of Economics84\(1\),pp\. 113–126\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[35\]R\. A\. Miller\(1984\)Job matching and occupational choice\.Journal of Political economy92\(6\),pp\. 1086–1120\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§6\.4](https://arxiv.org/html/2606.06744#S6.SS4.SSS0.Px2.p3.1)\.
- \[36\]Y\. Min, T\. Wang, R\. Xu, Z\. Wang, M\. Jordan, and Z\. Yang\(2022\)Learn to match with no regret: reinforcement learning in markov matching markets\.Advances in Neural Information Processing Systems35,pp\. 19956–19970\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p3.1)\.
- \[37\]J\. A\. Mincer\(1974\)Schooling and earnings\.InSchooling, experience, and earnings,pp\. 41–63\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§5\.2](https://arxiv.org/html/2606.06744#S5.SS2.p4.5)\.
- \[38\]A\. Mirfakhar, X\. Wang, M\. Xu, H\. Beyhaghi, and M\. Hajiesmaili\(2026\)Bandit learning in matching markets with interviews\.arXiv preprint arXiv:2602\.12224\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p5.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§4](https://arxiv.org/html/2606.06744#S4.p10.1),[§5\.1](https://arxiv.org/html/2606.06744#S5.SS1.p1.3)\.
- \[39\]D\. T\. Mortensen and C\. A\. Pissarides\(1994\)Job creation and job destruction in the theory of unemployment\.The review of economic studies61\(3\),pp\. 397–415\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[40\]L\. Munasinghe\(2000\)Wage growth and the theory of turnover\.Journal of Labor Economics18\(2\),pp\. 204–220\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[41\]T\. Pagare and A\. Ghosh\(2023\)Two\-sided bandit learning in fully\-decentralized matching markets\.InICML 2023 Workshop The Many Facets of Preference\-Based Learning,Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[42\]T\. Pagare and A\. Ghosh\(2024\)Explore\-then\-commit algorithms for decentralized two\-sided matching markets\.In2024 IEEE International Symposium on Information Theory \(ISIT\),pp\. 2092–2097\.Cited by:[§A\.2](https://arxiv.org/html/2606.06744#A1.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.06744#S1.p5.1),[§1](https://arxiv.org/html/2606.06744#S1.p6.1),[§4](https://arxiv.org/html/2606.06744#S4.p10.1),[§4](https://arxiv.org/html/2606.06744#S4.p15.2),[§6\.1](https://arxiv.org/html/2606.06744#S6.SS1.p3.4),[§6\.1](https://arxiv.org/html/2606.06744#S6.SS1.p3.4.1)\.
- \[43\]S\. Parikh, S\. Basu, A\. Ghosh, and A\. Sankararaman\(2024\)Competing bandits in decentralized contextual matching markets\.arXiv preprint arXiv:2411\.11794\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[44\]C\. A\. Pissarides\(2000\)Equilibrium unemployment theory\.MIT press\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[45\]G\. Pokharel and S\. Das\(2023\)Converging to stability in two\-sided bandits: the case of unknown preferences on both sides of a matching market\.arXiv preprint arXiv:2302\.06176\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[46\]T\. Rashid, M\. Samvelyan, C\. S\. De Witt, G\. Farquhar, J\. Foerster, and S\. Whiteson\(2020\)Monotonic value function factorisation for deep multi\-agent reinforcement learning\.Journal of Machine Learning Research21\(178\),pp\. 1–51\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p5.1),[§2](https://arxiv.org/html/2606.06744#S2.p3.1)\.
- \[47\]A\. E\. Roth and M\. Sotomayor\(1992\)Two\-sided matching\.Handbook of game theory with economic applications1,pp\. 485–541\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p1.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1),[§4](https://arxiv.org/html/2606.06744#S4.p15.2)\.
- \[48\]A\. E\. Roth\(1996\)The national residency matching program as a labor market\.\.JAMA275\(13\),pp\. 1054–1056\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p1.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[49\]U\. Schönberg\(2007\)Testing for asymmetric employer learning\.Journal of Labor Economics25\(4\),pp\. 651–691\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§5\.2](https://arxiv.org/html/2606.06744#S5.SS2.p4.5)\.
- \[50\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p5.1),[§1](https://arxiv.org/html/2606.06744#S1.p6.1),[§2](https://arxiv.org/html/2606.06744#S2.p3.1),[§6\.1](https://arxiv.org/html/2606.06744#S6.SS1.p2.1)\.
- \[51\]C\. Shi, R\. Wan, G\. Song, S\. Luo, H\. Zhu, and R\. Song\(2023\)A multiagent reinforcement learning framework for off\-policy evaluation in two\-sided markets\.The Annals of Applied Statistics17\(4\),pp\. 2701–2722\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p3.1)\.
- \[52\]P\. Shi\(2025\)Optimal match recommendations in two\-sided marketplaces with endogenous prices\.Management Science71\(9\),pp\. 7431–7448\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1)\.
- \[53\]R\. Shimer\(2005\)The cyclical behavior of equilibrium unemployment and vacancies\.American economic review95\(1\),pp\. 25–49\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[54\]J\. I\. Silva and M\. Toledo\(2009\)Labor turnover costs and the cyclical behavior of vacancies and unemployment\.Macroeconomic Dynamics13\(S1\),pp\. 76–96\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p2.1)\.
- \[55\]R\. H\. Topel and M\. P\. Ward\(1992\)Job mobility and the careers of young men\.The Quarterly Journal of Economics107\(2\),pp\. 439–479\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p3.1),[§2](https://arxiv.org/html/2606.06744#S2.p2.1),[§5\.2](https://arxiv.org/html/2606.06744#S5.SS2.p4.5),[§6\.4](https://arxiv.org/html/2606.06744#S6.SS4.SSS0.Px2.p3.1)\.
- \[56\]K\. Tu, B\. Ribeiro, D\. Jensen, D\. Towsley, B\. Liu, H\. Jiang, and X\. Wang\(2014\)Online dating recommendations: matching markets and learning preferences\.InProceedings of the 23rd international conference on world wide web,pp\. 787–792\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p1.1)\.
- \[57\]R\. Yan, R\. Le, Y\. Song, T\. Zhang, X\. Zhang, and D\. Zhao\(2019\)Interview choice reveals your preference on the market: to improve job\-resume matching through profiling memories\.InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining,pp\. 914–922\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1)\.
- \[58\]C\. Yu, A\. Velu, E\. Vinitsky, J\. Gao, Y\. Wang, A\. Bayen, and Y\. Wu\(2022\)The surprising effectiveness of ppo in cooperative multi\-agent games\.Advances in neural information processing systems35,pp\. 24611–24624\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p5.1),[§1](https://arxiv.org/html/2606.06744#S1.p6.1),[§2](https://arxiv.org/html/2606.06744#S2.p3.1),[§6\.1](https://arxiv.org/html/2606.06744#S6.SS1.p2.1)\.
- \[59\]K\. Zhanget al\.\(2021\)Multi\-agent reinforcement learning: a selective overview\.Foundations and Trends in Machine Learning\.Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p5.1),[§2](https://arxiv.org/html/2606.06744#S2.p3.1)\.
- \[60\]Y\. Zhang and Z\. Fang\(2024\)Decentralized two\-sided bandit learning in matching market\.InThe 40th Conference on Uncertainty in Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.06744#S1.p2.1),[§2](https://arxiv.org/html/2606.06744#S2.p1.1)\.
- \[61\]S\. Zheng, A\. Trott, S\. Srinivasa, N\. Naik, M\. Gruesbeck, D\. C\. Parkes, and R\. Socher\(2020\)The ai economist: improving equality and productivity with ai\-driven tax policies\.arXiv preprint arXiv:2004\.13332\.Cited by:[§2](https://arxiv.org/html/2606.06744#S2.p3.1)\.

## Appendix AImplementation details

### A\.1PPO implementation details

Both small and large markets use an outside\-option penalty of−109\-10^\{9\}\. Latent profilesxix\_\{i\}andyjy\_\{j\}are drawn i\.i\.d\. from𝒩\(0,Id\)\\mathcal\{N\}\(0,I\_\{d\}\)and truncated to be non\-negative\.

The PPO implementation follows the following hyperparameters:

Table 1:PPO hyperparameters used for all three multi\-seed runs\.HyperparameterValuebatch size4096chunk length16Recurrent layer typeGRUnumber of Recurrent layers1GAEλ\\lambda0\.95discounting factorγ\\gamma0\.997hidden size64learning rate3×10−43\\times 10^\{\-4\}parallel environments128ppo epochs4ratio clip0\.2
### A\.2For CA\-ETC

Table 2:CA\-ETC baseline hyperparameters used for all runs\.#### CA\-ETC baseline inLearn2Match\.

We adapt the epoch\-based CA\-ETC algorithm toLearn2Matchby treating agents on side𝒲\\mathcal\{W\}as players and agents on sideℱ\\mathcal\{F\}as arms\. The algorithm proceeds in epochs, each consisting of an exploration block followed by an exploitation block\. The exploration block collects empirical rewards for worker–firm pairs, and the exploitation block computes and executes a Gale–Shapley matching using the empirical preferences induced by these rewards\.

For each pair\(i,j\)∈\[N𝒲\]×\[Nℱ\]\(i,j\)\\in\[N^\{\\mathcal\{W\}\}\]\\times\[N^\{\\mathcal\{F\}\}\], the baseline maintains the empirical average rewards received by workeriiand firmjjwhen they are matched\. These averages estimate

Ri𝒲\(s\)=⟨xi,y^ij\(s\)⟩,Rjℱ\(s\)=⟨x^ij\(s\),yj⟩\.R\_\{i\}^\{\\mathcal\{W\}\}\(s\)=\\langle x\_\{i\},\\hat\{y\}\_\{ij\}\(s\)\\rangle,\\qquad R\_\{j\}^\{\\mathcal\{F\}\}\(s\)=\\langle\\hat\{x\}\_\{ij\}\(s\),y\_\{j\}\\rangle\.They are updated only from realized reward feedback\. The CA\-ETC baseline records reward feedback at the moment a scheduled worker–firm pair forms a match\. Specifically, once workeriiproposes to firmjjand firmjjaccepts, the baseline records the match\-induced reward signals

Ri𝒲\(st\),Rjℱ\(st\),R\_\{i\}^\{\\mathcal\{W\}\}\(s\_\{t\}\),\\qquad R\_\{j\}^\{\\mathcal\{F\}\}\(s\_\{t\}\),and updates the empirical averages for pair\(i,j\)\(i,j\)\. Whether the pair later remains matched through the dissolution phase affects the realized payoff inLearn2Match, but it does not affect whether CA\-ETC has obtained a reward sample for preference estimation\.

During the exploration block of epochℓ\\ell, workers select firms according to a collision\-avoidance round\-robin schedule, which underT0≥⌈Nw/Nf⌉⋅NfT\_\{0\}\\geq\\lceil N\_\{w\}/N\_\{f\}\\rceil\\cdot N\_\{f\}visits every worker–firm pair\(i,j\)\(i,j\)at least once per block\. Each scheduled pull is embedded into oneLearn2Matchmarket period: workeriiconducts the scheduled interview with firmjj, drawing the noisy belief signalsy^i,j=yj\+εi,j\(f\)\\hat\{y\}\_\{i,j\}=y\_\{j\}\+\\varepsilon^\{\(f\)\}\_\{i,j\}andx^i,j=xi\+εi,j\(w\)\\hat\{x\}\_\{i,j\}=x\_\{i\}\+\\varepsilon^\{\(w\)\}\_\{i,j\}withεi,j\(⋅\)∼𝒩\(0,σinterview2Id\)\\varepsilon^\{\(\\cdot\)\}\_\{i,j\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\mathrm\{interview\}\}^\{2\}I\_\{d\}\); workeriithen proposes a match and firmjjaccepts, forming the tentative pair\. The belief\-induced utilities⟨xi,y^i,j⟩\\langle x\_\{i\},\\hat\{y\}\_\{i,j\}\\rangleand⟨x^i,j,yj⟩\\langle\\hat\{x\}\_\{i,j\},y\_\{j\}\\rangleare recorded as the CA\-ETC sample for\(i,j\)\(i,j\)and incorporated into the running empirical means\. The tentative pair is released at the subsequent retention step, as the exploration block aims only to measure⟨xi,y^i,j⟩\\langle x\_\{i\},\\hat\{y\}\_\{i,j\}\\rangleand⟨x^i,j,yj⟩\\langle\\hat\{x\}\_\{i,j\},y\_\{j\}\\ranglefor each scheduled pair; releasing the agents frees them for the next round\-robin pull\.

After the exploration block, every agent holds an empirical estimate of its preferences over the opposite side, which is passed to the Gale–Shapley routine that begins the exploitation block\.

During the exploitation block, the baseline runs worker\-proposing Gale–Shapley with the current empirical preference lists\. The resulting matching is not imposed directly on the environment state; it is executed through theLearn2Matchaction protocol\. Each worker proposes to its Gale–Shapley assigned firm, each firm accepts its assigned worker and rejects other proposals, and matched pairs do not voluntarily dissolve during the exploitation block\. The exploitation block does not perform interviews or collect additional preference information\. At the end of the exploitation block, the next epoch begins with a new exploration block, after which empirical preferences and the Gale–Shapley matching are recomputed\.

Thus, the adaptation preserves the CA\-ETC structure

interview\+tentative match⏟Exploration \(epochℓ\)⟶LCB/UCB ranking⏟Preference estimation⟶GS matching\+retention⏟Exploitation \(epochℓ\)→ℓ↦ℓ\+1⋯\\underbrace\{\\text\{interview\}\+\\text\{tentative match\}\}\_\{\\text\{Exploration \(epoch \}\\ell\\text\{\)\}\}\\;\\longrightarrow\\;\\underbrace\{\\text\{LCB/UCB ranking\}\}\_\{\\text\{Preference estimation\}\}\\;\\longrightarrow\\;\\underbrace\{\\text\{GS matching\}\+\\text\{retention\}\}\_\{\\text\{Exploitation \(epoch \}\\ell\\text\{\)\}\}\\;\\xrightarrow\{\\ell\\mapsto\\ell\+1\}\\;\\cdotsrepeated across epochs\. The only substantive change is that each CA\-ETC exploration pull of pair\(i,j\)\(i,j\)must be received through the validLearn2Matchinteraction protocol: interview, match proposal, match acceptance but without retention, and reward recording\. Exploitation only computes Gale–Shapley from the resulting empirical preference lists and executes the selected matching through standardLearn2Matchmatch proposals and acceptances\.

#### CA\-ETC under the low\-noise setting\.

We run CA\-ETC insideLearn2Matchunder\-\-Nw 5 \-\-Nf 5 \-\-d 3 \-\-horizon 200 \-\-sigma\_interview 0\.001 \-\-sigma\_match 0 \-\-lambda\_reveal 100\. Figures[7](https://arxiv.org/html/2606.06744#A1.F7)and[8](https://arxiv.org/html/2606.06744#A1.F8)show that cumulative per\-worker and per\-firm regret rise during the initial exploration block and then flatten as the algorithm commits to its empirical Gale–Shapley matching, reproducing the structure reported in\[[42](https://arxiv.org/html/2606.06744#bib.bib33)\]\. We prove that we are able to show that we can cover exactly the same CA\-ETC results under some parameter setting\.

![Refer to caption](https://arxiv.org/html/2606.06744v1/ca-etc.png)Figure 7:Cumulative per\-worker regret of CA\-ETC insideLearn2Matchunder the low\-noise setting\. Each line is one worker, seed\-mean across runs\. Regret plateaus once the algorithm commits to its empirical Gale–Shapley matching![Refer to caption](https://arxiv.org/html/2606.06744v1/ca-etc-firm.png)Figure 8:Cumulative per\-firm regret of CA\-ETC insideLearn2Matchunder the low\-noise setting, with the same plateau behavior as Figure[7](https://arxiv.org/html/2606.06744#A1.F7)\.![Refer to caption](https://arxiv.org/html/2606.06744v1/coverage_heatmap.png)Figure 9:Left:interview coverage\.Right:mean cumulative tenure of each pair at the final period\.

## Appendix BBroader Impact and Limitations

Learn2Matchprovides a shared testbed for the matching and MARL communities to develop algorithms closer to the structure of real markets such as labor, residency, school choice, dating, and ride\-hailing, with the potential to reduce search frictions and information\-friction loss for participants on both sides\. As with any matching system operating in high\-stakes domains, learned policies may inherit biases in observed preferences and create incentives for strategic behavior, which is why we view the benchmark as a controlled research environment rather than a deployable system\. Compared to real\-world matching markets with millions of users, our experiments are limited to small markets with one\-to\-one matching, leaving the scaling behavior of PPO and CA\-ETC in larger and many\-to\-many markets open\. The benchmark also relies on specific functional\-form choices—Gaussian latent profiles, linear rewards, and sigmoidal revelation—that capture temporally extended feedback tractably but do not reflect the non\-Gaussian preferences and heterogeneous learning rates of real markets\. Finally, we evaluate only one representative MARL method against one bandit baseline on synthetic data, and view broader algorithmic comparisons, alternative interaction protocols, and validation on real matching data as important directions for future work\.
Learn to Match: Two-Sided Matching with Temporally Extended Feedback

Similar Articles

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization

Submit Feedback

Similar Articles

Self-Play Reinforcement Learning under Imperfect Information in Big 2
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems
Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization