Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices

arXiv cs.LG 05/11/26, 04:00 AM Papers
Summary
This paper demonstrates the robustness of refugee matching impact evaluations using off-policy methods like IPW and AIPW, confirming previous findings on algorithmic refugee assignment.
arXiv:2605.06686v1 Announce Type: new Abstract: Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered by Bansak et al. (2018). This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are also consistent with the results originally presented in Bansak et al. (2018).
Original Article
View Cached Full Text
Cached at: 05/11/26, 06:40 AM
# Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices
Source: [https://arxiv.org/html/2605.06686](https://arxiv.org/html/2605.06686)
Kirk Bansak,a,bElisabeth Paulson,a,cDominik Rothenhäusler,d Jeremy Ferwerda,a,eJens Hainmueller,a,fMichael Hotarda

\(aImmigration Policy Lab, Stanford University bDepartment of Political Science, University of California, Berkeley cTechnology and Operations Management Unit, Harvard Business School dDepartment of Statistics, Stanford University eDepartment of Government, Dartmouth College fDepartment of Political Science, Stanford University \)

###### Abstract

Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered byBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)\. This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off\-policy evaluation methods\. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting \(IPW\) and multiple variants of augmented inverse probability weighting \(AIPW\)\. We also consider various modifications, including alternative modeling architectures and different assignment procedures\. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases\. Furthermore, the estimates are also consistent with the results originally presented inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)\.

## 1Introduction

The idea of algorithmic refugee assignment to improve refugee outcomes within their host countries was originally proposed byBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\), which also presented preliminary counterfactual impact evaluations\. Since then, various studies have considered a range of extensions\(e\.g Gölz and Procaccia,,[2019](https://arxiv.org/html/2605.06686#bib.bib13); Ahani et al\.,,[2021](https://arxiv.org/html/2605.06686#bib.bib2); Acharya et al\.,,[2022](https://arxiv.org/html/2605.06686#bib.bib1); Freund et al\.,,[2023](https://arxiv.org/html/2605.06686#bib.bib12); Ahani et al\.,,[2024](https://arxiv.org/html/2605.06686#bib.bib3); Bansak and Paulson,,[2024](https://arxiv.org/html/2605.06686#bib.bib7); Bansak et al\.,,[2024](https://arxiv.org/html/2605.06686#bib.bib8); Jain et al\.,,[2025](https://arxiv.org/html/2605.06686#bib.bib14); Rodriguez\-Diaz et al\.,,[2025](https://arxiv.org/html/2605.06686#bib.bib15); Bansak et al\.,,[2026](https://arxiv.org/html/2605.06686#bib.bib6)\)\. This paper extends the impact evaluations inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)of the potential offered by algorithmic refugee assignment in the United States, demonstrating the stability of the results using a range of off\-policy evaluation methods\.

In their original analyses,Bansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)employed common model\-based policy evaluation procedures\. A number of studies have discussed issues with model\-based counterfactual estimation and policy evaluation, such as the possibility of “winner’s curse” bias\(e\.g\. Andrews et al\.,,[2024](https://arxiv.org/html/2605.06686#bib.bib4); Zrnic and Fithian,,[2025](https://arxiv.org/html/2605.06686#bib.bib16); Bastani et al\.,,[2026](https://arxiv.org/html/2605.06686#bib.bib9)\)\. For example,Bastani et al\., \([2026](https://arxiv.org/html/2605.06686#bib.bib9)\)show that the winner’s curse bias can be large on a synthetic dataset\.

In this paper, we employ more robust evaluation methods to estimate the counterfactual impact of refugee matching in the United States, including inverse probability weighting \(IPW\) and multiple variants of augmented inverse probability weighting \(AIPW\)\. We also consider various modifications, including alternative modeling architectures and different assignment procedures\. In all scenarios, the impact estimates remain consistent in magnitude and statistically significant in most cases\. Furthermore, the estimates are also consistent with the results originally presented inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)\. As one of our evaluation scenarios we use the exact same data and models originally used inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\), and our robust evaluation methods yield similar impact estimates as the model\-based methodology originally used inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\), demonstrating that those original results were not driven by winner’s curse bias\.

In sum, we find stable impact evaluation results using a range of robust evaluation methods\. The results underscore the potential of data\-driven refugee matching to meaningfully improve employment outcomes\.

## 2Setup

We observe a finite population of individuals in a test/evaluation set indexed byi=1,…,Ni=1,\\dots,N\. Each individual was historically assigned to one ofKKlocations, denotedAi∈\{1,…,K\}A\_\{i\}\\in\\\{1,\\dots,K\\\}, and experienced employment outcomeYi∈\{0,1\}Y\_\{i\}\\in\\\{0,1\\\}\. A proposed algorithmic matching assigns each individual to a new locationgi∈\{1,…,K\}g\_\{i\}\\in\\\{1,\\dots,K\\\}\. We wish to estimate the average employment rate that would have been obtained if the algorithmic assignmentg=\(g1,…,gN\)g=\(g\_\{1\},\\dots,g\_\{N\}\)had been used instead of the historical assignmentAA\.

We use data on historical refugee assignments and outcomes from one of the largest resettlement agencies in the United States\. Historical assignment was quasi\-random\.111Based on empirical assessments and conversations with representatives from the resettlement agency, we know that the historical assignments of free cases were done in a manner unrelated to expected outcomes, though some case officers may have used covariates that we also have access to\. Therefore, the historical assignment satisfies ignorability\.Therefore, we consider two possibilities\. The first is that local\-level assignment probabilitiesπa,i\\pi\_\{a,i\}, denoting the probability that an individualiiwas assigned to locationaa, are homogeneous across individuals such thatπa,i=π\(a\)\\pi\_\{a,i\}=\\pi\(a\), which is known based upon the historical empirical counts\. Second, as a robustness check, we also consider the possibility thatπa,i\\pi\_\{a,i\}are conditional on covariates,πa,i=π\(a\|Xi\)\\pi\_\{a,i\}=\\pi\(a\|X\_\{i\}\), in which case we estimate those propensities\. We assume no interference \(SUTVA\), and we letYi\(a\)Y\_\{i\}\(a\)denote the potential employment outcome for individualiiunder locationaa\. We observeYi=Yi\(Ai\)Y\_\{i\}=Y\_\{i\}\(A\_\{i\}\)\.

We are interested in estimating the policy value

V\(g\)=1N∑i=1NYi\(gi\)\.V\(g\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Y\_\{i\}\(g\_\{i\}\)\.To predict counterfactual rewards used in the algorithm, we train supervised machine learning models on a set of training data \(separate from the evaluation set\)\. Letμi\(a\)\\mu\_\{i\}\(a\)denote an estimate of the conditional expected outcome𝔼\[Yi\(a\)\|Xi\]\\mathbb\{E\}\[Y\_\{i\}\(a\)\|X\_\{i\}\]\. For each individual, define

μA,i=μi\(Ai\),μg,i=μi\(gi\),πA,i=πAi,i\\mu\_\{A,i\}=\\mu\_\{i\}\(A\_\{i\}\),\\qquad\\mu\_\{g,i\}=\\mu\_\{i\}\(g\_\{i\}\),\\qquad\\pi\_\{A,i\}=\\pi\_\{A\_\{i\},i\}where, as noted above, we consider both possibilities thatπA,i=π\(Ai\)\\pi\_\{A,i\}=\\pi\(A\_\{i\}\)andπA,i=π\(Ai\|Xi\)\\pi\_\{A,i\}=\\pi\(A\_\{i\}\|X\_\{i\}\)\.

We perform counterfactual algorithmic assignment on two data setups\. The first is the exact data used inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\), in which case we form our evaluation set of assignments by pooling their evaluation sets from 2015 Q4 to 2016 Q3, and we use their exact estimates of𝔼\[Yi\(a\)\]\\mathbb\{E\}\[Y\_\{i\}\(a\)\], which were obtained from a predictive model where prior data are used to train Stochastic Gradient Boosted Tree \(SGBT\) ensembles\. The exact models, predictions, and assignments used inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)are used here, constituting a replication ofBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)\.

The second data setup uses updated data, where our evaluation set of assignments includes all historical assignments made by our resettlement partner over the entire year of 2016\. In this case, estimates of𝔼\[Yi\(a\)\]\\mathbb\{E\}\[Y\_\{i\}\(a\)\]are obtained from a predictive model where prior data are used to train Bayesian Additive Regression Tree \(BART\) models\. Furthermore in this setup, while we employ location\-specific BART models for locations that are sufficiently large, we use a pooled BART model for small locations \(with location\-specific indicators as predictors\)\. Using these predictions, we formulate new counterfactual assignments for our evaluation set\. In doing so, we consider both offline and online assignments, where for the latter we implement the online assignment algorithm presented inBansak et al\., \([2026](https://arxiv.org/html/2605.06686#bib.bib6)\)\.

In both data setups, when we consider the possibility thatπA,i=π\(Ai\)\\pi\_\{A,i\}=\\pi\(A\_\{i\}\), we use the historical empirical counts to calculate the known propensities \(i\.e\. “empirical propensities”\)\. When we consider the possibility thatπA,i=π\(Ai\|Xi\)\\pi\_\{A,i\}=\\pi\(A\_\{i\}\|X\_\{i\}\), we estimate the propensities using random forest probability models \(i\.e\. “estimated propensities”\)\. In addition, to address the impact of small propensity scores on the variance of our estimators \(described below\), we also consider a modification to our algorithmic design that pools small locations \(those withπ\(a\)<0\.01\\pi\(a\)<0\.01\) into a single pseudo\-location\. When using this modification, the pooling applies to both the matching and evaluation procedure\. Effectively, this says that anyone assigned to a “small location” is randomly \(proportionally\) assigned to one of the locations in the pseudo\-location pool\. This strategy can theoretically introduce a downward bias in the point estimate but should also reduce variance\. It is also a valid procedure to use in practice\.

In both data setups, we make the counterfactual assignments on the level of cases \(i\.e\. families\) and evaluate impact on the basis of individuals, just as inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)\.

## 3Estimators

We estimate the counterfactual impact of algorithmic assignment via three different estimators, described below\.

### 3\.1Inverse Probability Weighting \(IPW\)

The IPW estimator uses only the randomized historical assignment and does*not*use model predictions\. As noted above, letπa,i\\pi\_\{a,i\}denote the historical probability of individualiibeing assigned to locationaa\. Define the indicator𝟏\(Ai=gi\)\\mathbf\{1\}\(A\_\{i\}=g\_\{i\}\), which equals one if individualiihappens to have been assigned historically to the same location as the algorithmic policy recommends\. The IPW estimator is

V^IPW=∑i=1N𝟏\(Ai=gi\)YiπA,i∑i=1N𝟏\(Ai=gi\)1πA,i\.\\widehat\{V\}\_\{\\mathrm\{IPW\}\}=\\frac\{\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\(A\_\{i\}=g\_\{i\}\)\\,\\dfrac\{Y\_\{i\}\}\{\\pi\_\{A,i\}\}\}\{\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\(A\_\{i\}=g\_\{i\}\)\\,\\dfrac\{1\}\{\\pi\_\{A,i\}\}\}\.\(1\)
Identification of IPW requires the following conditions:

1. 1\.Ignorability:Yi\(a\)⟂Ai\|XiY\_\{i\}\(a\)\\perp A\_\{i\}\\\>\\\>\|\\\>\\\>X\_\{i\}for allaa\.
2. 2\.Positivity: If the policy assigns some individuals togi=ag\_\{i\}=a, thenπa,i\>0\\pi\_\{a,i\}\>0\.
3. 3\.SUTVA: No interference across individuals\.

IPW uses only observed outcomesYiY\_\{i\}and does not involve any predictive model\. However, the possibility of small propensity scores is known to sometimes “blow up” the variance of the IPW estimator\. To address this, we also consider the pooling modification described earlier, whereby small locations \(those withπ\(a\)<0\.01\\pi\(a\)<0\.01\) are pooled into a single pseudo\-location\.

### 3\.2Augmented IPW \(AIPW\)

AIPW combines inverse probability weighting with an outcome regression model\. The AIPW estimator is

V^AIPW=1N∑i=1N\[μg,i\+𝟏\(Ai=gi\)Yi−μA,iπA,i\]\.\\widehat\{V\}\_\{\\mathrm\{AIPW\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\[\\mu\_\{g,i\}\+\\mathbf\{1\}\(A\_\{i\}=g\_\{i\}\)\\,\\frac\{Y\_\{i\}\-\\mu\_\{A,i\}\}\{\\pi\_\{A,i\}\}\\right\]\.\(2\)
AIPW is doubly robust\. It is consistent if either:

1. 1\.The historical assignment probabilitiesπA,i\\pi\_\{A,i\}are correctly specified, or
2. 2\.The outcome regression modelμi\(a\)\\mu\_\{i\}\(a\)is correctly specified\.

AIPW also offers greater statistical efficiency relative to IPW\. For AIPW, we also consider both the standard matching procedure and pooled matching procedure described above\.

### 3\.3AIPW\-local \(AIPWl\)

We also consider a variant of AIPW that we call AIPW\-local, or AIPWl\. For the AIPWl estimator, the augmentation term replaces the marginal propensityπA,i\\pi\_\{A,i\}with a location\-specific estimate of

πL\(a\)=Pr⁡\(Ai=a∣gi=a\)≈\#\{i:Ai=gi=a\}\#\{i:gi=a\}\.\\pi\_\{\\mathrm\{L\}\}\(a\)\\;=\\;\\Pr\(A\_\{i\}=a\\mid g\_\{i\}=a\)\\;\\approx\\;\\frac\{\\\#\\\{i:A\_\{i\}=g\_\{i\}=a\\\}\}\{\\\#\\\{i:g\_\{i\}=a\\\}\}\.Define

ψiL=μg,i\+𝟏\(Ai=gi\)Yi−μA,iπL\(gi\)\.\\psi\_\{i\}^\{\\mathrm\{L\}\}=\\mu\_\{g,i\}\+\\mathbf\{1\}\(A\_\{i\}=g\_\{i\}\)\\,\\frac\{Y\_\{i\}\-\\mu\_\{A,i\}\}\{\\pi\_\{\\mathrm\{L\}\}\(g\_\{i\}\)\}\.The AIPWl estimator is

V^AIPWl=1N∑i=1NψiL\.\\widehat\{V\}\_\{\\mathrm\{AIPWl\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\psi\_\{i\}^\{\\mathrm\{L\}\}\.

### 3\.4Model\-Based

As a benchmark comparison, we also consider the model\-based evaluation approach employed inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)that calculates counterfactual employment and gains solely by using the estimates of𝔼\[Yi\(a\)\]\\mathbb\{E\}\[Y\_\{i\}\(a\)\]that are also used for the assignment decisions\. In the Results section later, we refer to the results using this method simply as “model\-based\.”

## 4Design\-based Uncertainty Quantification

We are interested in estimating, for a given population and a given matching/assignment, the overall employment rate that this population and assignment would result in\. Therefore, we treat the population, historical refugee arrivals that were used for trainingμ\\mu, and policy assignmentggas fixed\. We use the quasi\-randomness of the historical assignments used for evaluation for uncertainty quantification\. That is, we use variation ofAiA\_\{i\},i=1,…,Ni=1,\\ldots,N\. In the following, we describe uncertainty quantification for the quasi\-random case \(πA,i=π\(Ai\)\\pi\_\{A,i\}=\\pi\(A\_\{i\}\)\)\.

### 4\.1Variance of AIPW and AIPWl

V−V^AIPW=1N∑i=1NYi\(gi\)−μg,i−1Ai=giYi−μA,iπA,i=1N∑i=1N\(π\(gi\)−1Ai=gi\)\(Yi\(gi\)−μg,i\)π\(gi\)\.V\-\\hat\{V\}\_\{\\text\{AIPW\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Y\_\{i\}\(g\_\{i\}\)\-\\mu\_\{g,i\}\-1\_\{A\_\{i\}=g\_\{i\}\}\\frac\{Y\_\{i\}\-\\mu\_\{A,i\}\}\{\\pi\_\{A,i\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(\\pi\(g\_\{i\}\)\-1\_\{A\_\{i\}=g\_\{i\}\}\)\\frac\{\(Y\_\{i\}\(g\_\{i\}\)\-\\mu\_\{g,i\}\)\}\{\\pi\(g\_\{i\}\)\}\.Then,

Var\(V−V^AIPW\)=1N2∑i=1Nπ\(gi\)\(1−π\(gi\)\)\(\(Yi\(gi\)−μg,i\)π\(gi\)\)2\\text\{Var\}\(V\-\\hat\{V\}\_\{\\text\{AIPW\}\}\)=\\frac\{1\}\{N^\{2\}\}\\sum\_\{i=1\}^\{N\}\\pi\(g\_\{i\}\)\(1\-\\pi\(g\_\{i\}\)\)\\left\(\\frac\{\(Y\_\{i\}\(g\_\{i\}\)\-\\mu\_\{g,i\}\)\}\{\\pi\(g\_\{i\}\)\}\\right\)^\{2\}We cannot directly compute this variance since we do not observe all terms for everyone\. An unbiased estimate is

Var^\(V^AIPW−V\)\\displaystyle\\widehat\{\\text\{Var\}\}\(\\hat\{V\}\_\{\\text\{AIPW\}\}\-V\)=1N2∑i=1N1Ai=gi\(1−π\(gi\)\)\(\(Yi\(gi\)−μg,i\)π\(gi\)\)2\\displaystyle=\\frac\{1\}\{N^\{2\}\}\\sum\_\{i=1\}^\{N\}1\_\{A\_\{i\}=g\_\{i\}\}\(1\-\\pi\(g\_\{i\}\)\)\\left\(\\frac\{\(Y\_\{i\}\(g\_\{i\}\)\-\\mu\_\{g,i\}\)\}\{\\pi\(g\_\{i\}\)\}\\right\)^\{2\}=1N2∑i:Ai=gi\(1−πA,i\)\(\(Yi−μA,i\)πA,i\)2\\displaystyle=\\frac\{1\}\{N^\{2\}\}\\sum\_\{i:A\_\{i\}=g\_\{i\}\}\(1\-\\pi\_\{A,i\}\)\\left\(\\frac\{\(Y\_\{i\}\-\\mu\_\{A,i\}\)\}\{\\pi\_\{A,i\}\}\\right\)^\{2\}The same approximate variance formula applies also to AIPWl, simply replacingπ\(Ai\)\\pi\(A\_\{i\}\)withπD\(Ai\)\\pi\_\{D\}\(A\_\{i\}\)\.

### 4\.2Variance of IPW

First,

V^IPW−V=1N∑i=1N1Ai=giπA,i\(Yi−V\)1N∑i=1N1Ai=giπA,i=1N∑i=1N1Ai=giπA,i\(Yi\(gi\)−V\)\+oP\(1/N\)\\hat\{V\}\_\{\\text\{IPW\}\}\-V=\\frac\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{1\_\{A\_\{i\}=g\_\{i\}\}\}\{\\pi\_\{A,i\}\}\(Y\_\{i\}\-V\)\}\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{1\_\{A\_\{i\}=g\_\{i\}\}\}\{\\pi\_\{A,i\}\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{1\_\{A\_\{i\}=g\_\{i\}\}\}\{\\pi\_\{A,i\}\}\(Y\_\{i\}\(g\_\{i\}\)\-V\)\+o\_\{P\}\(1/\\sqrt\{N\}\)where we used that the numerator isOP\(1/N\)O\_\{P\}\(1/\\sqrt\{N\}\)and the denominator converges to11\(since potential outcomes are bounded\)\. Then the variance of the right hand side \(ignoring lower\-order terms\) is

1N2∑i=1Nπ\(gi\)\(1−π\(gi\)\)\(Yi\(gi\)−Vπ\(gi\)\)2,\\frac\{1\}\{N^\{2\}\}\\sum\_\{i=1\}^\{N\}\\pi\(g\_\{i\}\)\(1\-\\pi\(g\_\{i\}\)\)\\left\(\\frac\{Y\_\{i\}\(g\_\{i\}\)\-V\}\{\\pi\(g\_\{i\}\)\}\\right\)^\{2\},which can be estimated via

Var^\(V−V^IPW\)=1N2∑i:Ai=gi\(1−πA,i\)\(Yi−V^IPWπA,i\)2\.\\widehat\{\\text\{Var\}\}\(V\-\\hat\{V\}\_\{\\text\{IPW\}\}\)=\\frac\{1\}\{N^\{2\}\}\\sum\_\{i:A\_\{i\}=g\_\{i\}\}\(1\-\\pi\_\{A,i\}\)\\left\(\\frac\{Y\_\{i\}\-\\hat\{V\}\_\{\\text\{IPW\}\}\}\{\\pi\_\{A,i\}\}\\right\)^\{2\}\.

## 5Results

The results across the various setups described above are shown in Tables[A1](https://arxiv.org/html/2605.06686#A1.T1)–[A10](https://arxiv.org/html/2605.06686#A1.T10)in the Appendix\. Each table includes point estimates for the employment rate \(policy value\) under algorithmic assignment according to each evaluation method\. Also displayed are the point estimates, variance estimates, and 95% confidence intervals for the projected employment gains from algorithmic assignment relative to the status quo, in terms of percentage\-point gains\. As shown in the tables, all of the evaluation methods yield generally comparable estimates across the board\. In addition, the benchmark model\-based method does not appear to be overly or consistently optimistic\. Figures[1](https://arxiv.org/html/2605.06686#S5.F1)and[2](https://arxiv.org/html/2605.06686#S5.F2)below present a visual summary of the results in terms of estimated gains\. Figure[1](https://arxiv.org/html/2605.06686#S5.F1)displays percentage\-point gains relative to the status quo baseline, while Figure[2](https://arxiv.org/html/2605.06686#S5.F2)displays percent gains over the baseline\.

![Refer to caption](https://arxiv.org/html/2605.06686v1/x1.png)Figure 1:Percentage\-Point Gains Above the Observed Baseline![Refer to caption](https://arxiv.org/html/2605.06686v1/x2.png)Figure 2:Percent Gains Over the Observed Baseline
## 6Discussion

The results, which substantively hold across a range of counterfactual evaluation methods, underscore the potential of data\-driven refugee matching to meaningfully improve employment outcomes\. In addition, as can be clearly seen in our evaluations, the results produced by the benchmark model\-based method—including the results originally reported inBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)—are not overly optimistic\. That is, we find no evidence that the model\-based estimated gains are an artifact of winner’s curse bias\. This may be attributable to the specific ML methods and modeling we employed, specifically our SGBT and BART models\.

SGBT models have several forms of regularization built in, including base learner \(i\.e\. tree\) shrinkage, constraints on tree size, and training on random subsamples of the data\. If such regularization led to predictive flattening \(or mean reversion\) across the models, that could have led to systematic under\-prediction of rewards at the upper end of the distribution of reward estimates\.

With respect to BART, Bayesian methods are naturally less vulnerable to winner’s curse bias because of their Bayesian structure\(Efron,,[2011](https://arxiv.org/html/2605.06686#bib.bib10); Ferguson et al\.,,[2013](https://arxiv.org/html/2605.06686#bib.bib11); Andrews et al\.,,[2024](https://arxiv.org/html/2605.06686#bib.bib4)\)\. More specifically, posterior means are Bayes\-unbiased222Bayes\-unbiased is similar to unbiasedness, but unbiased would be the wrong word here; unbiased means that the parameter is fixed and we average over uncertainty of the data\. In the Bayesian world, the data is fixed and we average over the uncertainty in the parameters\.\(i\.e\., they hit the target on average, conditional on all observed data\)\. Therefore, if we make decisions based on the data \(such as assigning someone\), then summing the corresponding posterior means is also Bayes\-unbiased for the counterfactual reward\. Of course, the reliability of these estimates depends on whether the prior is reasonable\. Furthermore, in our Data and Modeling Setup 2, we employ location\-specific BART models for locations that are sufficiently large, but we use a pooled BART model for small locations \(with location\-specific indicators as predictors\)\. Therefore, in general the predictions are likely being pushed towards the mean\.

In sum, if our ML models pick up onenoughinteractions \(i\.e\. heterogeneity across locations\) to find a good assignment, but they stillunder\-predictthose interactions, it is quite possible that the true reward of that assignment is actually larger than what the models themselves would predict\. The adjustments of IPW and AIPW correct for this\.

## Acknowledgments

The authors thank Global Refuge for access to data and guidance\. The data used in this study were provided under a collaboration research agreement with Global Refuge\. This work is associated with the GeoMatch project within the Immigration Policy Lab \(IPL\) at Stanford University and ETH Zurich\. The GeoMatch project is supported by funding from the Charles Koch Foundation, Google\.org, Open Society Foundations, and Stanford Impact Labs\. Rothenhäusler gratefully acknowledges support as a David Huntington Faculty Scholar, Chamber Fellow, and from the Dieter Schwarz Foundation\.

## References

- Acharya et al\., \(2022\)Acharya, A\., Bansak, K\., and Hainmueller, J\. \(2022\)\.Combining outcome\-based and preference\-based matching: A constrained priority mechanism\.Political Analysis, 30\(1\):89–112\.
- Ahani et al\., \(2021\)Ahani, N\., Andersson, T\., Martinello, A\., Teytelboym, A\., and Trapp, A\. C\. \(2021\)\.Placement optimization in refugee resettlement\.Operations Research, 69\(5\):1468–1486\.
- Ahani et al\., \(2024\)Ahani, N\., Gölz, P\., Procaccia, A\. D\., Teytelboym, A\., and Trapp, A\. C\. \(2024\)\.Dynamic placement in refugee resettlement\.Operations Research, 72\(3\):1087–1104\.
- Andrews et al\., \(2024\)Andrews, I\., Kitagawa, T\., and McCloskey, A\. \(2024\)\.Inference on winners\.The Quarterly Journal of Economics, 139\(1\):305–358\.
- Bansak et al\., \(2018\)Bansak, K\., Ferwerda, J\., Hainmueller, J\., Dillon, A\., Hangartner, D\., Lawrence, D\., and Weinstein, J\. \(2018\)\.Improving refugee integration through data\-driven algorithmic assignment\.Science, 359\(6373\):325–329\.
- Bansak et al\., \(2026\)Bansak, K\., Lee, S\., Manshadi, V\., Niazadeh, R\., and Paulson, E\. \(2026\)\.Dynamic matching with post\-allocation service and its application to refugee resettlement\.Management Science\(forthcoming\)\.
- Bansak and Paulson, \(2024\)Bansak, K\. and Paulson, E\. \(2024\)\.Outcome\-driven dynamic refugee assignment with allocation balancing\.Operations Research, 72\(6\):2375–2390\.
- Bansak et al\., \(2024\)Bansak, K\., Paulson, E\., and Rothenhäusler, D\. \(2024\)\.Learning under random distributional shifts\.InInternational Conference on Artificial Intelligence and Statistics, pages 3943–3951\. PMLR\.
- Bastani et al\., \(2026\)Bastani, H\., Bastani, O\., and McLaughlin, B\. \(2026\)\.Winner’s curse drives false promises in data\-driven decisions: A case study in refugee matching\.arXiv preprint arXiv:2602\.08892\.
- Efron, \(2011\)Efron, B\. \(2011\)\.Tweedie’s formula and selection bias\.Journal of the American Statistical Association, 106\(496\):1602–1614\.
- Ferguson et al\., \(2013\)Ferguson, J\. P\., Cho, J\. H\., Yang, C\., and Zhao, H\. \(2013\)\.Empirical Bayes correction for the winner’s curse in genetic association studies\.Genetic Epidemiology, 37\(1\):60–68\.
- Freund et al\., \(2023\)Freund, D\., Lykouris, T\., Paulson, E\., Sturt, B\., and Weng, W\. \(2023\)\.Group fairness in dynamic refugee assignment\.arXiv preprint arXiv:2301\.10642\.
- Gölz and Procaccia, \(2019\)Gölz, P\. and Procaccia, A\. D\. \(2019\)\.Migration as submodular optimization\.InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 549–556\.
- Jain et al\., \(2025\)Jain, G\., Rothenhäusler, D\., Bansak, K\., and Paulson, E\. \(2025\)\.CTRL your shift: Clustered transfer residual learning for many small datasets\.arXiv preprint arXiv:2508\.11144\.
- Rodriguez\-Diaz et al\., \(2025\)Rodriguez\-Diaz, P\., Bansak, K\., and Paulson, E\. \(2025\)\.A dual perspective on decision\-focused learning: Scalable training via dual\-guided surrogates\.arXiv preprint arXiv:2511\.04909\.
- Zrnic and Fithian, \(2025\)Zrnic, T\. and Fithian, W\. \(2025\)\.A flexible defense against the winner’s curse\.The Annals of Statistics, 53\(6\):2516–2535\.

## Appendix AAppendix

### A\.1Results from Data and Modeling Setup 1

This Data and Modeling Setup 1 constitutes the original setup and assignments fromBansak et al\., \([2018](https://arxiv.org/html/2605.06686#bib.bib5)\)\. Note that the Gains below are relative to the empirically observed employment rate of0\.3370\.337\. Confidence intervals \(CI\) are 95%\.

Table A1:Using offline assignment and empirical propensity scores\. No pooling\.Table A2:Using offline assignment and empirical propensity scores\. Pooling % = 0\.01Table A3:Using offline assignment and estimated propensity scores\. No pooling\.Table A4:Using offline assignment and estimated propensity scores\. Pooling % = 0\.01\.
### A\.2Results from Data and Modeling Setup 2

Note that the Gains below are relative to the empirically observed employment rate of0\.3420\.342\. Confidence intervals \(CI\) are 95%\.

Table A5:Using offline assignment and empirical propensity scores\. No pooling\.AIPWAIPWlIPWModel\-BasedPoint Estimate0\.4270\.4270\.3850\.425Gains0\.0850\.0860\.0430\.084Var\(Gains\)\(0\.0019\)\(0\.001\)\(0\.0033\)NACI of Gains\[0\.000,0\.170\]\[0\.000,\\;0\.170\]\[0\.024,0\.148\]\[0\.024,\\;0\.148\]\[−0\.070,0\.157\]\[\-0\.070,\\;0\.157\]NATable A6:Using online assignment and empirical propensity scores\. No pooling\.AIPWAIPWlIPWModel\-BasedPoint Estimate0\.4550\.4220\.4310\.41Gains0\.1130\.080\.0890\.068Var\(Gains\)\(0\.002\)\(0\.0013\)\(0\.0032\)NACI of Gains\[0\.025,0\.202\]\[0\.025,\\;0\.202\]\[0\.010,0\.150\]\[0\.010,\\;0\.150\]\[−0\.021,0\.200\]\[\-0\.021,\\;0\.200\]NATable A7:Using offline assignment and empirical propensity scores\. Pooling % = 0\.01\.AIPWAIPWlIPWModel\-BasedPoint Estimate0\.4120\.4420\.3570\.419Gains0\.0710\.1010\.0160\.078Var\(Gains\)\(0\.0012\)\(0\.0014\)\(0\.0023\)NACI of Gains\[0\.002,0\.139\]\[0\.002,\\;0\.139\]\[0\.027,0\.175\]\[0\.027,\\;0\.175\]\[−0\.077,0\.109\]\[\-0\.077,\\;0\.109\]NATable A8:Using online assignment and empirical propensity scores\. Pooling % = 0\.01\.AIPWAIPWlIPWModel\-BasedPoint Estimate0\.4870\.4650\.4440\.406Gains0\.1450\.1240\.1030\.065Var\(Gains\)\(0\.0018\)\(0\.0014\)\(0\.0026\)NACI of Gains\[0\.062,0\.229\]\[0\.062,\\;0\.229\]\[0\.050,0\.198\]\[0\.050,\\;0\.198\]\[0\.003,0\.203\]\[0\.003,\\;0\.203\]NATable A9:Using offline assignment and estimated propensity scores\. No pooling\.AIPWAIPWlIPWModel\-BasedPoint Estimate0\.4270\.4270\.4240\.425Gains0\.0860\.0860\.0830\.084Var\(Gains\)\(0\.0008\)\(0\.001\)\(0\.0013\)NACI of Gains\[0\.031,0\.140\]\[0\.031,\\;0\.140\]\[0\.024,0\.148\]\[0\.024,\\;0\.148\]\[0\.011,0\.155\]\[0\.011,\\;0\.155\]NATable A10:Using online assignment and estimated propensity scores\. No pooling\.AIPWAIPWlIPWModel\-BasedPoint Estimate0\.5380\.4220\.5190\.41Gains0\.1970\.080\.1780\.068Var\(Gains\)\(0\.0097\)\(0\.0013\)\(0\.0072\)NACI of Gains\[0\.004,0\.390\]\[0\.004,\\;0\.390\]\[0\.010,0\.150\]\[0\.010,\\;0\.150\]\[0\.012,0\.344\]\[0\.012,\\;0\.344\]NA
Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices

Similar Articles

Off-Policy Evaluation with Strategic Agents via Local Disclosure

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions

Information Gain-based Rollout Policy Optimization: An Adaptive Tree-Structured Rollout Approach for Multi-Turn LLM Agents

Submit Feedback

Similar Articles

Off-Policy Evaluation with Strategic Agents via Local Disclosure
When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions
Information Gain-based Rollout Policy Optimization: An Adaptive Tree-Structured Rollout Approach for Multi-Turn LLM Agents