Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

arXiv cs.AI Papers

Summary

This paper reformulates hospital mechanism design as program synthesis for language models, using a multi-agent simulator (Medi-Sim) to evaluate policy rules under strategic provider responses. It demonstrates pressure migration across provider channels and synthesizes an inspectable mixed-objective program that reduces up-coding and rejection while retaining funds.

arXiv:2605.30680v1 Announce Type: new Abstract: Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:24 AM

# Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response
Source: [https://arxiv.org/html/2605.30680](https://arxiv.org/html/2605.30680)
Zihan Wang1Xiang Xu1Hongyuan Zha1Wenhao Li2 1The Chinese University of Hong Kong, Shenzhen 2Tongji University

###### Abstract

Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce\. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored byMedi\-Sim, a multi\-agent simulator with five strategic provider channels \(coding, selection, delay, effort, triage\)\. An incentive sweep recovers classical health\-economics findings as adjacent regimes—up\-coding and low\-complexity\-patient selection under profit pressure, and Goodhart\-style drift where measured performance becomes anti\-correlated with true outcomes—and a single audit lever exposes*pressure migration*: closing the coding channel more than doubles low\-complexity selection\. LLM\-guided evolutionary code search over the same rule\-program space then synthesizes an inspectable mixed\-objective program that eliminates up\-coding, halves rejection, and retains most of the profit\-oriented baseline’s funds\.

Healthcare Mechanisms from Policy\-as\-Code Search under Strategic Provider Response

Zihan Wang1Xiang Xu1Hongyuan Zha1Wenhao Li21The Chinese University of Hong Kong, Shenzhen2Tongji University

## 1Introduction

A case\-based payment becomes a coding rule, an audit reshapes patient selection, and a quality bonus redirects effort toward the measured score: in each case, a hospital mechanism is realized as the*composition*of an administrator’s instruction with a provider’s best response, and the composition—not the text—determines billing, access, and outcome\.111Healthcare\-specific terminology used throughout the paper—including hospitalDRGandDRG\-style arrivals, hospitalCMI, hospitalKPIandKPIsteering, the five provider\-response channels \(coding, selection, delay, effort, triage\), the coding and measurement wedges, gold\-plating, skimping, cream\-skimming, the Identify–Produce–Settle \(IPS\) loop, and the hospital policy DSL—is collected in the glossary of Appendix[L](https://arxiv.org/html/2605.30680#A12)\.The dynamic we focus on is*pressure migration*—a multi\-channel feature of strategic best response in which, when a rule closes one provider channel, the same incentive resurfaces in an adjacent one, so a benchmark that scores rules against a fixed provider systematically over\-rewards mechanisms whose effect is to relocate rather than remove distortion\. We therefore evaluate hospital mechanisms inside a closed\-loop strategic\-response simulator, and because regulated deployment additionally requires every rule to remain auditable line by line, we restrict the administrator’s policy class to inspectable, typed rule programs—reframing mechanism design as*program synthesis*over a constrained administrative interface\.

Pressure migration is visible across three decades of healthcare reform\. Hospitals respond to Medicare diagnosis\-pricing changes by re\-coding rather than by treating more patients\(Dafny,[2005](https://arxiv.org/html/2605.30680#bib.bib9)\); Medicare Advantage risk scores grow faster than fee\-for\-service scores through coding intensity\(Kronick and Welch,[2014](https://arxiv.org/html/2605.30680#bib.bib2)\); and English NHS waiting\-time targets change both reported waits and the operations that produce them\(Bevan and Hood,[2006](https://arxiv.org/html/2605.30680#bib.bib15); Propperet al\.,[2010](https://arxiv.org/html/2605.30680#bib.bib5)\)\. The machine\-learning reading is direct: each is*Goodhart\-style drift*\(Manheim and Garrabrant,[2018](https://arxiv.org/html/2605.30680#bib.bib34)\)delivered through a strategic\-response shift in the data\-generating process, of the kind formalized by strategic classification and performative prediction\(Hardtet al\.,[2016](https://arxiv.org/html/2605.30680#bib.bib30); Perdomoet al\.,[2020](https://arxiv.org/html/2605.30680#bib.bib31)\)\.

Existing benchmarks cannot see this dynamic\. Healthcare AI environments train clinician\-level policies on a fixed environment with passive providers\(Komorowskiet al\.,[2018](https://arxiv.org/html/2605.30680#bib.bib27); Yuet al\.,[2021](https://arxiv.org/html/2605.30680#bib.bib42); Gottesmanet al\.,[2019](https://arxiv.org/html/2605.30680#bib.bib26)\), treating provider behavior as exogenous noise\. Automated mechanism\-design systems do model strategic response, but instantiate taxes, auctions, or generic allocation rather than healthcare primitives—reimbursement, audits, care\-team queues, measured quality—and their searched controllers are black\-box neural networks that fail line\-by\-line auditability\(Zhenget al\.,[2022](https://arxiv.org/html/2605.30680#bib.bib23); Düttinget al\.,[2024](https://arxiv.org/html/2605.30680#bib.bib29); Sandholm,[2003](https://arxiv.org/html/2605.30680#bib.bib6)\)\. Neither camp scores administrator rules and strategic provider responses through realized access, reimbursement, and performance in the same rollout\.

We instantiate the missing loop inMedi\-Sim\. Administrator rules are written as*policy\-as\-code*: typed, executable expressions over a fixed set of approved levers \(incentive coefficients, audit intensity, bonus pool, performance\-score weights\) that are auditable line by line\(Rudin,[2019](https://arxiv.org/html/2605.30680#bib.bib4)\), and expose the kind of clear, context\-relevant information emphasized in good machine learning practice for medical devices \(GMLP\)\(U\.S\. Food and Drug Administrationet al\.,[2021](https://arxiv.org/html/2605.30680#bib.bib3)\)\. Providers respond through five named channels—coding, selection, delay, effort, and triage—drawn from health economics\(Ellis,[1998](https://arxiv.org/html/2605.30680#bib.bib8); Ma,[1994](https://arxiv.org/html/2605.30680#bib.bib7); Kuhn and Siciliani,[2008](https://arxiv.org/html/2605.30680#bib.bib11); Holmstrom and Milgrom,[1991](https://arxiv.org/html/2605.30680#bib.bib13)\), and an Identify–Produce–Settle \(IPS\) loop keeps rules, responses, and outcomes in the same rollout, treating settlement \(reimbursement, scores, bonuses\) as part of the mechanism rather than a reporting layer\. The same loop is the search interface: because candidates are typed code expressions over a small state\-feature set, useful mutations are*semantic*edits over code rather than gradient steps or random rewrites—a regime in which LLM\-guided code search outperforms unguided genetic operators\(Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2605.30680#bib.bib21); Lehmanet al\.,[2022](https://arxiv.org/html/2605.30680#bib.bib40); Novikovet al\.,[2025](https://arxiv.org/html/2605.30680#bib.bib22)\), while line\-by\-line auditability rules out neural controllers\. The language model therefore acts as a code\-editing operator over the rule program under a safety\-penalized closed\-loop fitness; provider agents are parameterized response classes, not LLMs\.

Three experiments close the loop\. An incentive sweep recovers classical findings as adjacent regimes of one phase diagram—up\-coding and low\-complexity selection under profit pressure, balanced\-interior Goodhart drift—and administrative lever sweeps expose pressure migration: audits shift pressure from coding to selection, while bonus pools and KPI\-steered flex capacity reveal proxy and waiting\-time failures\. LLM\-guided code search over the same rule interface refines a diverse warm\-start library into an inspectable mixed\-objective program that eliminates up\-coding, halves rejection, and retains most of the profit\-oriented baseline’s funds; ablations show warm\-start priors and LLM\-guided refinement to be jointly necessary\.

#### Contributions\.

\(1\) An LLM\-program\-synthesis testbed for high\-stakes mechanism design\.We recast provider\-side mechanism design as LLM\-guided program synthesis over a typed administrative DSL, in which neural controllers are excluded by audit requirements and the LLM acts as a code\-editing operator on inspectable rule programs under safety\-penalized multi\-agent rollouts\.\(2\) A closed\-loop strategic\-response benchmark\.We releaseMedi\-Sim, an Identify–Produce–Settle simulator that keeps administrator rules, five strategic provider channels, and realized access/reimbursement/performance in the same rollout, exposing the channel\-level diagnostics needed to detect strategic\-response distortion\.\(3\) Pressure migration as a benchmark phenomenon addressable by LLM\-guided code search\.Classical healthcare failures occupy adjacent regimes of one mechanism space, and LLM\-guided refinement of a diverse warm\-start library can reduce targeted manipulation while monitoring whether pressure reappears on adjacent channels; in the main held\-out mixed\-policy comparison, the searched program closes the coding channel without increasing rejection\. Ablations attribute the effect jointly to priors and LLM code editing\.

## 2Problem Formulation

We model the hospital as a finite\-horizon stochastic Stackelberg game overTTperiods\. The*hospital administrator*is the leader, committing to a mechanism actionutu\_\{t\}at each step; the*provider population*is the follower, drawn from a tractable response classΠP\\Pi\_\{P\}described below\. Throughout,JJindexes care teams\.

#### State and leader action\.

The hospital stateXtX\_\{t\}collects fundsFtF\_\{t\}, congestionQtQ\_\{t\}, per\-team queues\{𝒬j,t\}\\\{\\mathcal\{Q\}\_\{j,t\}\\\}, the previous\-periodKPIvector, and reputationRept\\mathrm\{Rep\}\_\{t\}\(Eq\. \([6](https://arxiv.org/html/2605.30680#A2.E6)\), App\.[B](https://arxiv.org/html/2605.30680#A2)\)\. The leader actionutu\_\{t\}collects incentive coefficients\(αt,βt\)\(\\alpha\_\{t\},\\beta\_\{t\}\)for provider financial and quality sensitivity, total and flexible capacities\(Bttot,Btflex\)\(B^\{\\mathrm\{tot\}\}\_\{t\},B^\{\\mathrm\{flex\}\}\_\{t\}\), the bonus poolBtpoolB^\{\\mathrm\{pool\}\}\_\{t\}and softmax sharpnessκ\\kappa, theKPIweights\(wH,wW,wrej,wC\)\(w\_\{H\},w\_\{W\},w\_\{\\mathrm\{rej\}\},w\_\{C\}\)on health/waiting/rejection/cost, audit intensityqtq\_\{t\}, and an optionalKPI\-steering switchξt\\xi\_\{t\}\(Eq\. \([7](https://arxiv.org/html/2605.30680#A2.E7)\), App\.[B](https://arxiv.org/html/2605.30680#A2)\)\.

#### Follower action: five distortion channels\.

Each teamjjobserves its queue, capacity signals, fatigue, and incentives, and chooses an action that decomposes into five channels:

aj,t=\(g^i,t⏟coding,di​j,tacc⏟selection,di​j,tdef⏟delay,Ei​j,t⏟effort,Ri​j,t⏟triage/resource\),a\_\{j,t\}=\\big\(\\underbrace\{\\hat\{g\}\_\{i,t\}\}\_\{\\textit\{coding\}\},\\ \\underbrace\{d^\{\\mathrm\{acc\}\}\_\{ij,t\}\}\_\{\\textit\{selection\}\},\\ \\underbrace\{d^\{\\mathrm\{def\}\}\_\{ij,t\}\}\_\{\\textit\{delay\}\},\\ \\underbrace\{E\_\{ij,t\}\}\_\{\\textit\{effort\}\},\\ \\underbrace\{R\_\{ij,t\}\}\_\{\\textit\{triage/resource\}\}\\big\),\(1\)indexed by candidate patientii\. These are exactly the five channels through which providers strategically respond to medical mechanisms in the health\-economics literature\(Ellis,[1998](https://arxiv.org/html/2605.30680#bib.bib8); Ma,[1994](https://arxiv.org/html/2605.30680#bib.bib7); Dafny,[2005](https://arxiv.org/html/2605.30680#bib.bib9); Kuhn and Siciliani,[2008](https://arxiv.org/html/2605.30680#bib.bib11); Holmstrom and Milgrom,[1991](https://arxiv.org/html/2605.30680#bib.bib13)\), and they map one\-to\-one to the distortion measurements reported in §[5](https://arxiv.org/html/2605.30680#S5)\. The five\-channel choice covers the main margins exposed by the Identify–Produce–Settle loop without making the response model too broad to diagnose channel\-by\-channel behavior\. The provider\-side team utility that drives these five channels is

Uj,t=\\displaystyle U\_\{j,t\}\\;=αt​\(Revj,t−Cj,t\)\+βt​Hj,t\\displaystyle\\;\\alpha\_\{t\}\(\\mathrm\{Rev\}\_\{j,t\}\-C\_\{j,t\}\)\+\\beta\_\{t\}H\_\{j,t\}\(2\)\+θ​Bonusj,t−ν​\[max⁡\(0,Loadj,t−E¯j\)\]2,\\displaystyle\{\}\+\\theta\\,\\mathrm\{Bonus\}\_\{j,t\}\-\\nu\\,\\big\[\\max\(0,\\mathrm\{Load\}\_\{j,t\}\-\\bar\{E\}\_\{j\}\)\\big\]^\{2\},whereθ\>0\\theta\>0is the fixed weight on the realized bonus andν\>0\\nu\>0is the convex fatigue penalty above per\-team load capacityE¯j\\bar\{E\}\_\{j\}\.

#### Bounded\-rationality response class\.

We restrict the follower to a tractable response classΠP=\{πPϕ:ϕ∈Φ\}\\Pi\_\{P\}=\\\{\\pi\_\{P\}^\{\\phi\}:\\phi\\in\\Phi\\\}parameterized by interpretable behavioral coefficientsϕ\\phithat govern, per channel, how aggressively the team’s action moves with the local gradient of Eq\. \([2](https://arxiv.org/html/2605.30680#S2.E2)\); the functional forms are given in §[3](https://arxiv.org/html/2605.30680#S3)\. This is a deliberate design choice rather than an equilibrium claim: it preserves per\-channel identifiability, keeps each channel inspectable, and matches the comparative\-static predictions used to validate the simulator in §[5](https://arxiv.org/html/2605.30680#S5)\.

#### Stackelberg objective\.

A mechanism is evaluated only through the rollout distribution induced by the follower’s response\. The leader optimizes a chosen social objectiveo∈\{welfare,profit,mixed\}o\\in\\\{\\mathrm\{welfare\},\\mathrm\{profit\},\\mathrm\{mixed\}\\\}with per\-objective discounted return on seedss,

Go​\(πA,πP∗;s\)=∑t=1Tγdt−1​rAo​\(Xt,ut\),G^\{o\}\(\\pi\_\{A\},\\pi\_\{P\}^\{\*\};s\)=\\sum\_\{t=1\}^\{T\}\\gamma\_\{d\}^\{t\-1\}\\,r\_\{A\}^\{o\}\(X\_\{t\},u\_\{t\}\),\(3\)and solves

πAo,∗∈argmaxπA∈ΠA\{\\displaystyle\\pi\_\{A\}^\{o,\*\}\\in\\arg\\max\_\{\\pi\_\{A\}\\in\\Pi\_\{A\}\}\\big\\\{𝔼s,πP∗​\[Go\]\\displaystyle\\mathbb\{E\}\_\{s,\\pi\_\{P\}^\{\*\}\}\[G^\{o\}\]\(4\)−λunsafe𝔼\[V\]−λvarVars\[Go\]\},\\displaystyle\{\}\-\\lambda\_\{\\mathrm\{unsafe\}\}\\mathbb\{E\}\[V\]\-\\lambda\_\{\\mathrm\{var\}\}\\mathrm\{Var\}\_\{s\}\[G^\{o\}\]\\big\\\},whereπP∗∈ΠP​\(πA\)\\pi\_\{P\}^\{\*\}\\in\\Pi\_\{P\}\(\\pi\_\{A\}\)is the bounded\-rationality best response,VVaggregates safety/distortion diagnostics \(unsafe waiting, high\-complexity deferral, up\-coding, rejection, insolvency\), and the variance term is a seed\-reliability regularizer \(App\.[B](https://arxiv.org/html/2605.30680#A2)\)\. A mechanism is successful only if the induced provider behavior remains acceptable on every diagnostic, on the average seed*and*reliably across seeds\. We defer the choice of policy classΠA\\Pi\_\{A\}to §[4](https://arxiv.org/html/2605.30680#S4), which discharges auditability constraints by instantiatingΠA\\Pi\_\{A\}as a typed inspectable program class and solves Eq\. \([4](https://arxiv.org/html/2605.30680#S2.E4)\) byAlphaEvolve\-style code search\.

## 3TheMedi\-SimEnvironment

![Refer to caption](https://arxiv.org/html/2605.30680v1/figures/overview_ips_v2.png)Figure 1:Medi\-SimIPS and policy\-as\-code overview\. Top: the hospital administrator writes episode\-level front\-desk, ward, and billing/review rules; stars mark levers refined byAlphaEvolve\. Middle: clinician programs respond within locked rules through the Identify–Produce–Settle loop\. Bottom: the dashboard reports channel\-level diagnostics that guide policy search\.Medi\-Simimplements the Identify–Produce–Settle \(IPS\) decomposition of §[2](https://arxiv.org/html/2605.30680#S2)with the policy interface shown in Figure[1](https://arxiv.org/html/2605.30680#S3.F1): hospital administrative rules are fixed for an episode, clinician\-side programs respond within those rules, and the resulting rollout dashboard exposes the same distortion channels used by L1–L3 \(Algorithm[A\.1](https://arxiv.org/html/2605.30680#alg1), App\.[A](https://arxiv.org/html/2605.30680#A1)gives the period\-by\-period loop\)\. We describe each primitive in turn\.

### 3\.1Identify: arrivals, classification, and the coding wedge

At each step, a default hospitalDRG\-style arrival process222HospitalDRG\-style arrivals carry clinical type, urgency, tolerance, and reimbursement\-relevant case weight; see Appendix[L](https://arxiv.org/html/2605.30680#A12)\.draws a Poisson batch of patients𝒫t\\mathcal\{P\}\_\{t\}with rateλarr\\lambda\_\{\\mathrm\{arr\}\}\(Eq\. \([10](https://arxiv.org/html/2605.30680#A3.E10)\), App\.[C](https://arxiv.org/html/2605.30680#A3)\); non\-Poisson kernels are admissible \(App\.[L](https://arxiv.org/html/2605.30680#A12)\)\. Each patientiicarries a true groupgi⋆∈𝒢g\_\{i\}^\{\\star\}\\in\\mathcal\{G\}, a normalized hospital case\-mix indexCMI333CMIis the normalized hospital case\-mix index used as patient complexity and as the payment\-relevant weight that coding can distort; see Appendix[L](https://arxiv.org/html/2605.30680#A12)\.valueCMIi⋆∈\[0,1\]\\mathrm\{CMI\}\_\{i\}^\{\\star\}\\in\[0,1\], urgencyUrgi∈\[0,1\]\\mathrm\{Urg\}\_\{i\}\\in\[0,1\], and waiting tolerance\. The default simulator uses an aggregated macro\-DRGdistribution inspired by common inpatient categories andDRGrelative weights\(Centers for Medicare & Medicaid Services,[2026](https://arxiv.org/html/2605.30680#bib.bib24); Healthcare Cost and Utilization Project,[2025](https://arxiv.org/html/2605.30680#bib.bib25)\)\. True complexity determines clinical need; coded complexity determines settlement\. This split is the*hospital coding wedge*444The hospital coding wedge is the gap between true clinical complexity and the reported billing group used for settlement; see Appendix[L](https://arxiv.org/html/2605.30680#A12)\.: it creates the rent that the coding channel arbitrates\.

#### Coding action\.

A hospital coder chooses a reported groupg^i,t\\hat\{g\}\_\{i,t\}from a finite candidate set around the true group by applying a score\-based choice rule to candidate groups\. For each candidate group, the host simulator computes candidate\-level signals: incremental reimbursementΔ​Ri​\(g\)\\Delta R\_\{i\}\(g\), audit\-expected penalty under the configured audit schedule, ethics pressure from coded\-complexity inflation, and the resulting coding gap\. The coding rule maps these signals to a scalar candidate score; the baseline closed form is Eq\. \([11](https://arxiv.org/html/2605.30680#A3.E11)\) in App\.[C](https://arxiv.org/html/2605.30680#A3)\.

The same score can be instantiated either as a stochastic softmax choice or as its deterministic zero\-temperature limit,g^i,t=arg⁡maxg⁡si​\(g\)\\hat\{g\}\_\{i,t\}=\\arg\\max\_\{g\}s\_\{i\}\(g\)\. The AlphaEvolve L3 implementation uses the deterministic variant to reduce evaluation noise, while keeping the candidate set, score features, audit schedule, and settlement routine fixed\. As incentives vary, this score\-based channel produces intermediate aggregate up\-coding rates across patients and rollouts rather than making coding an all\-or\-nothing administrative switch\.

In L3,candidate\_scoreis only the exposed behavioral scoring map for this channel\. Edits to it reweight the coder’s sensitivity to fixed host\-computed features such asupcode\_pressure,audit\_penalty,ethics\_pressure, andcoding\_gap; candidate construction, audit, penalties, and clawback remain host\-side settlement routines\.

#### Patient routing\.

Patients register to service units rather than being reassigned by the hospital\. In L1/L2 active strategic routing is disabled andKPI\-aware steering acts only on the flexible capacity pool \(§[3\.3](https://arxiv.org/html/2605.30680#S3.SS3)\), so selection and delay arise predominantly through the provider triage channel \(§[3\.2](https://arxiv.org/html/2605.30680#S3.SS2)\); the external\-validity implications are discussed in Limitations\.

### 3\.2Produce: capacity\-constrained treatment

A service unit observes its queue, capacity signals, fatigue, and incentives\. Its action realizes four of the five distortion channels at once:*triage/resource*\(accept, reject, defer, request a constrained resource\),*selection*\(the conditional distribution of acceptance over patient types\),*delay*\(deferral as a cost\-relief lever\), and*effort*\(intensity per accepted case\)\. All four channels are implemented as closed\-form behavioral rules that operate on the local gradient of the team utilityUj,tU\_\{j,t\}defined in Eq\. \([2](https://arxiv.org/html/2605.30680#S2.E2)\); collectively, these rules are the concrete instantiation of the response classΠP\\Pi\_\{P\}used in §[2](https://arxiv.org/html/2605.30680#S2)\.

#### Treatment production\.

For an accepted patientiitreated by unitjj, true health output follows a diminishing\-return production in effortEi​j,t≥0E\_\{ij,t\}\\geq 0, gated by a constrained resourceRi​j,t∈\{0,1\}R\_\{ij,t\}\\in\\\{0,1\\\}and scaled by team skill and inverseCMI; cost is convex in effort with exponentϕ\>1\\phi\>1\(Eqs\. \([12](https://arxiv.org/html/2605.30680#A3.E12)\)–\([13](https://arxiv.org/html/2605.30680#A3.E13)\), App\.[C](https://arxiv.org/html/2605.30680#A3)\)\. Effort therefore arbitrates the intensive margin \(gold\-plating versus skimping\) while triage and selection arbitrate the extensive margin \(who is treated and when\), exactly as in the two\-margin view of provider response\(Ellis,[1998](https://arxiv.org/html/2605.30680#bib.bib8); Ma,[1994](https://arxiv.org/html/2605.30680#bib.bib7)\)\.

#### Triage and delay\.

Triage gates accept/reject/defer through a scalar score combining urgency, waiting, predicted margin, capacity, fatigue, andKPI\-targeting signals; strategic delay is the cost\-relief channel that parks high\-cost cases under profit pressure, producing the L1 delay signatures of §[5\.1](https://arxiv.org/html/2605.30680#S5.SS1)\.

### 3\.3Settle: Hospital KPI, bonuses, and the measurement wedge

Team\-level measured performance combines health with operational and financial penalties:

KPIj,t=\\displaystyle\\mathrm\{KPI\}\_\{j,t\}=wH​H¯j,t−wW​W¯j,t\\displaystyle\\ w\_\{H\}\\bar\{H\}\_\{j,t\}\-w\_\{W\}\\bar\{W\}\_\{j,t\}\(5\)−wrej​Rejectj,t−wC​C¯j,t\.\\displaystyle\-w\_\{\\mathrm\{rej\}\}\\mathrm\{Reject\}\_\{j,t\}\-w\_\{C\}\\bar\{C\}\_\{j,t\}\.Bonuses are allocated through a softmax tournament overKPIj,t\\mathrm\{KPI\}\_\{j,t\}with sharpnessκ\\kappaand poolBtpoolB^\{\\mathrm\{pool\}\}\_\{t\}, yielding a local marginal bonus pressureBtpool​κ​sj,t​\(1−sj,t\)B^\{\\mathrm\{pool\}\}\_\{t\}\\kappa\\,s\_\{j,t\}\(1\-s\_\{j,t\}\)that feeds back into effort, triage, andKPI\-targeting behavior \(Eqs\. \([14](https://arxiv.org/html/2605.30680#A3.E14)\)–\([16](https://arxiv.org/html/2605.30680#A3.E16)\), App\.[C](https://arxiv.org/html/2605.30680#A3)\)\. Because the measured hospitalKPIis a weighted aggregate of true health, waiting, rejection, and cost, it is generically misaligned with the principal’s objective whenever complex care raises true health but worsens the proxy score—this is the*hospital measurement wedge*555The hospital measurement wedge is the gap between true clinical value and the measured hospitalKPIused for bonuses or steering; see Appendix[L](https://arxiv.org/html/2605.30680#A12)\.that hosts Goodhart\-style gaming on triage signals\(Holmstrom and Milgrom,[1991](https://arxiv.org/html/2605.30680#bib.bib13); Baker,[1992](https://arxiv.org/html/2605.30680#bib.bib14); Bevan and Hood,[2006](https://arxiv.org/html/2605.30680#bib.bib15)\)\. The L1 incentive sweep in §[5\.1](https://arxiv.org/html/2605.30680#S5.SS1)shows that this wedge becomes large and negative in exactly the intermediate\-incentive interior region predicted by multitask theory\.

## 4Strategic Policy\-as\-Code

#### Design desiderata\.

For regulated healthcare deployment, an admissible policy classΠA\\Pi\_\{A\}must be \(i\)*inspectable*line\-by\-line for compliance review; \(ii\)*regulable*over a fixed lever set whose semantics match real\-world payer and hospital contracts, so that search cannot smuggle in new state variables, measurements, or distortion channels; \(iii\)*sufficiently expressive*to admit state\-conditional rules on observable aggregates \(waiting, rejection, profit, utilization\); and \(iv\)*stress\-testable*against the response classΠP\\Pi\_\{P\}of §[2](https://arxiv.org/html/2605.30680#S2)\. Black\-box neural controllers\(Zhenget al\.,[2022](https://arxiv.org/html/2605.30680#bib.bib23); Düttinget al\.,[2024](https://arxiv.org/html/2605.30680#bib.bib29)\)satisfy \(iii\) but neither \(i\) nor \(ii\)\.

#### Policy class: typed executable programs\.

We instantiateΠA\\Pi\_\{A\}as a typed assignment\-only DSL\. A candidate policy bundle exposes the search\-writable administrative expressions:\(α,β\)\(\\alpha,\\beta\), total and flexible capacities, the bonus pool and sharpnessκ\\kappa, theKPIweight vector\(wH,wW,wrej,wC\)\(w\_\{H\},w\_\{W\},w\_\{\\mathrm\{rej\}\},w\_\{C\}\), and an optionalKPI\-steering switch\. The audit schedule is not a DSL field: the audit intensityqtq\_\{t\}, the audit\-hit functionpauditp\_\{\\mathrm\{audit\}\}, penalty multipliers, and clawback logic are fixed host\-side configuration for a given rollout/evaluation setting\. The bundle also includes selected provider\-response expressions for effort, triage/resource requests, and coding candidate scoring, which instantiate the bounded\-rationality response classΠP\\Pi\_\{P\}\.

L1 and L2 keep the provider\-response rules and administrative rule programs fixed at the baseline of §[3\.2](https://arxiv.org/html/2605.30680#S3.SS2), except for the designated one\-at\-a\-time diagnostic sweeps\. L3 allows search over selected provider\-response and administrative expression constants, but the simulator dynamics, patient generation, exposed features, metric computation, host\-side clipping, feasibility projection, audit schedule, and settlement/audit routines remain fixed\. ThusΠP\\Pi\_\{P\}remains a stress\-test reference class for the simulator’s response channels: L3 reweights bounded\-rationality responses inside the same provider\-response class rather than introducing a new provider model, new measurements, or a co\-evolved opponent\.

#### Search and evaluation\.

We instantiate this policy class with AlphaEvolve\-style evolutionary code search\(Novikovet al\.,[2025](https://arxiv.org/html/2605.30680#bib.bib22); Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2605.30680#bib.bib21)\), implemented throughOpenEvolve\. A candidate program is first checked for syntactic and type validity, restricted to assignment\-only edits over fixed policy fields, and then evaluated on short and full stochastic rollouts; Algorithm[A\.2](https://arxiv.org/html/2605.30680#alg2)\(App\.[A](https://arxiv.org/html/2605.30680#A1)\) summarizes the outer loop\. The scalar fitness is the empirical estimator of the Stackelberg objective Eq\. \([4](https://arxiv.org/html/2605.30680#S2.E4)\) withπP∗\\pi\_\{P\}^\{\*\}taken as the bounded\-rationality best response, and the mixed objective uses a log\-scaled funds\-plus\-reputation reward \(Eqs\. \([B](https://arxiv.org/html/2605.30680#A2.Ex1)\)–\([9](https://arxiv.org/html/2605.30680#A2.E9)\), App\.[B](https://arxiv.org/html/2605.30680#A2)\)\. Programs that achieve highGGthrough up\-coding or unsafe deferral incur a largeVVand are demoted, so the search trajectory implements the safety constraint rather than relying on a post\-hoc filter\.

## 5Experiments

The experiments follow a two\-stage validation\-to\-discovery logic\. We first verified that nine canonical healthcare stylized facts—DRGcoding rent, profit\-driven case\-mix distortion, audit\-induced channel substitution, target gaming, queueing capacity response, flexible\-capacity allocation effects, quality\-side multitasking, risk\-adjustment up\-coding, and bonus\-pool misalignment—reproduce*directionally*as the simulator’s incentive, audit, and capacity levers are moved\. The full validation map, with anchor citations and per\-row signatures, is in Appendix Table[4](https://arxiv.org/html/2605.30680#A9.T4)\. With this external check in place, we useMedi\-Simas a mechanism diagnostic across three layers: L1 maps provider responses over the incentive surface; L2 perturbs administrative levers and traces where provider pressure moves; L3 tests policy\-as\-code search under welfare, profit, and safety\-penalized mixed objectives\. All L1/L2 summaries report 30\-seed means over horizonT=200T=200unless otherwise stated; Appendix[H](https://arxiv.org/html/2605.30680#A8)reports the incentive grid, L2 lever sweeps, defaults, and routing/steering switches\.

#### Empirical findings\.

The experiments highlight four mechanism\-level findings\. First, classical healthcare failures occupy neighboring regions of an incentive phase diagram\. Second, the balanced interior hides risk on less\-visible margins: up\-coding and rejection recede while delay andKPItargeting intensify\. Third, administrative controls move pressure across channels—audit closes coding but raises selection, bonus expansion worsens proxy alignment, andKPI\-steered flexible capacity raises waiting\. Fourth, policy\-as\-code search follows its objective: pure\-profit search amplifies coding, whereas the mixed objective reshapes incentive geometry and removes up\-coding while preserving much of the return\.

### 5\.1L1: failure regimes form one incentive phase diagram

![Refer to caption](https://arxiv.org/html/2605.30680v1/x1.png)Figure 2:L1 incentive phase diagram\. Each panel reports 30\-seed means over the11×1111\\times 11\(α,β\)\(\\alpha,\\beta\)grid\. The grid separates low\-incentive access rationing, profit\-driven coding and selection, quality\-driven effort and budget pressure, and balanced\-interior delay andKPItargeting\.L1 sweeps financial and quality sensitivities over an11×1111\\times 11grid\. The response surface is structured but non\-monotone: weak incentives ration access; profit pressure activates coding and selection \(up\-coding0\.2260\.226, high\-CMI rejection gap0\.1820\.182\); quality pressure suppresses these channels while increasing effort and solvency stress; and the balanced interior shifts pressure to delay andKPItargeting \(representative\-regime metrics in Appendix Table 2\)\. The profit\- and quality\-driven regions recover familiar incentive\-theory predictions, but the balanced interior is most diagnostic: visible metrics improve while pressure moves into selective deferral and proxy targeting—high\-CMI delay is0\.2900\.290versus0\.0100\.010for low\-CMI patients, and measuredKPIbecomes negatively aligned with true health \(−0\.659\-0\.659\)\. L1 therefore exposes an interior failure hidden by single\-channel diagnostics\.

### 5\.2L2: administrative levers move pressure across channels

![Refer to caption](https://arxiv.org/html/2605.30680v1/x2.png)Figure 3:L2 one\-at\-a\-time policy ablations\. Curves report 30\-seed means over horizonT=200T=200for balanced, quality\-driven, and profit\-driven regimes; shaded bands are 95% confidence intervals\. Audits suppress up\-coding, capacity lowers waiting, andKPI/bonus/flex levers induce nonlinear responses\.L2 reads administrative levers as pressure\-tracing interventions: each lever is evaluated by its full response vector across coding, selection, delay, effort, and triage\.

The audit sweep gives the cleanest substitution pattern\. Raising audit probability lowers balanced\-regime up\-coding, while cherry\-picking rises\. Audit reallocates pressure from billing to selection, making access and delay diagnostics necessary alongside billing accuracy\.

The bonus\-pool sweep shows the measurement version of the same mechanism\. Larger bonuses strengthen the reward attached to the measuredKPI\. When that proxy is misaligned with true health, stronger incentives widen the wedge: at the endpoints of the balanced\-regime sweep,KPI–true\-health correlation falls from−0\.447\-0\.447atBpool=0B^\{\\mathrm\{pool\}\}=0to−0\.839\-0\.839atBpool=15B^\{\\mathrm\{pool\}\}=15\. This is the main Goodhart\-style L2 result\.

The flexible\-capacity sweep gives the operational version: underKPIsteering, balanced waiting rises from1\.881\.88to2\.232\.23as the flexible pool grows because additional capacity follows bonus\-sensitive teams rather than the longest queues, while a steering\-off diagnostic flattens the slope to1\.88→1\.881\.88\\rightarrow 1\.88\(App\.[G](https://arxiv.org/html/2605.30680#A7)\)\. Flexible capacity acts through its allocation rule, not through capacity volume alone\.

### 5\.3L3: search follows the incentives

L3 uses policy\-as\-code search to test how objective choice shapes the discovered mechanism family\. Candidates are typed edits over the allowed DSL keys only \(App\.[K](https://arxiv.org/html/2605.30680#A11)\); implementation details and seed splits are in App\.[J](https://arxiv.org/html/2605.30680#A10)\.

ObjectiveMethodFitness↑\\uparrowWait↓\\downarrowReject↓\\downarrowUpcodingFundsWelfareGreedy\-Quality16\.6341\.6930\.0320\.00010\.8WelfareAlphaEvolve16\.9321\.5290\.0110\.000420\.6ProfitGreedy\-Profit121\.8461\.7840\.0680\.7587288\.3ProfitAlphaEvolve122\.0461\.7860\.0680\.8077353\.7MixedGreedy\-Profit13\.5801\.7840\.0680\.7587288\.3MixedBest warm start13\.6071\.6740\.0580\.0005445\.9MixedAlphaEvolve13\.8761\.7270\.0330\.0005480\.7Table 1:L3 held\-out performance\. Entries are means over held\-out test seeds\. Blue cells identify searched policies; green cells mark key comparisons; red flags the profit\-only up\-coding risk\.The welfare objective is the positive\-control case:AlphaEvolveimproves the welfare\-family policy along every dimension in Table[1](https://arxiv.org/html/2605.30680#S5.T1)while keeping up\-coding at zero, and lifts doctor margin—search trims excessive effort cost while preserving the gains rewarded by the welfare objective \(App\.[J\.2](https://arxiv.org/html/2605.30680#A10.SS2)\)\. The profit objective is the reward\-hacking diagnostic:AlphaEvolvemakes only a small refinement overGreedy\-Profitand spends it through the coding channel \(up\-coding0\.758→0\.8070\.758\\\!\\to\\\!0\.807alongside higher funds\), showing that coding remains an available optimization channel when safety violations carry no penalty\.The mixed objective produces the central L3 result\.AlphaEvolvekeeps return close to the profit\-oriented baseline, reduces the violation score from 8\.170 to 3\.002, halves rejection, and drives up\-coding to zero\. The mechanism\-level change visible in App\.[J\.2](https://arxiv.org/html/2605.30680#A10.SS2)is that the searched policy lowers local bonus pressure while retaining aggregate performance: search therefore changes*which channels*earn the return, rather than improving a scalar score\.The warm\-start ablation defines the scope of this result\. With a neutral\-only library, search fails to recover the mixed family\. This scopes the policy\-as\-code interface as a structured refinement tool over meaningful policy priors\.

## 6Discussion and Conclusion

#### Three classical episodes, one simulator\.

The episodes motivating §[1](https://arxiv.org/html/2605.30680#S1)map onto distinct\(α,β\)\(\\alpha,\\beta\)regions of L1’s phase diagram—Dafny\-style up\-coding at the high\-α\\alpha/low\-β\\betacorner, Silverman ownership\-conditional case\-mix distortion alongα\\alphaaxis, and Bevan target gaming in the balanced interior\. Recovering these patterns under one set of dynamics gives Medi\-Sim its benchmark role: a common diagnostic environment for evaluating mechanisms under strategic provider response, rather than a collection of separately tuned stylized examples\.

#### Channel substitution is the structural obstacle\.

L2 shows that closing one distortion channel can reopen another: audits suppress up\-coding, but provider response shifts toward selection or delay, becauseΠP\\Pi\_\{P\}’s five channels are coupled through Eq\. \([2](https://arxiv.org/html/2605.30680#S2.E2)\)\. Evaluation must therefore track the whole response, not the single metric a mechanism was designed to improve\.

#### Implications and closing\.

Medi\-Sim makes three evaluation requirements explicit for healthcare policy ML: provider response should be endogenous, searched mechanisms should remain inspectable as code, and scalar reward should be decomposed into channel\-level diagnostics so that reward gains through distortion are counted as failures\. The released closed\-loop environment instantiates these requirements with an Identify–Produce–Settle simulator and anAlphaEvolve\-style search interface over a typed DSL\. In the held\-out mixed\-objective evaluation, the searched policy eliminates measured up\-coding while retaining much of the profit\-oriented baseline’s funds, showing that policy\-as\-code search can reshape the channel through which return is earned\. We hopeMedi\-Simmakes provider\-side strategic response a standard benchmark target for ML on healthcare policy\.

## Limitations

Medi\-Simis a mechanistic simulator, not a calibrated clinical deployment model, and the design trades external validity for per\-channel inspectability in several deliberate ways\.

Bounded\-rationality response, not equilibrium\.The response classΠP\\Pi\_\{P\}of §[2](https://arxiv.org/html/2605.30680#S2)is implemented as closed\-form behavioral rules driven by the local gradient of the team utility \([2](https://arxiv.org/html/2605.30680#S2.E2)\) rather than as solved equilibria of an inner game\. This is a tractable but strict approximation; how well it tracks a fully strategic provider population is an empirical question we do not resolve here\.

AggregatedDRGand patient\-choice routing\.The default simulator uses an aggregated macro\-DRGdistribution and patient\-choice registration\. Active strategic hospital routing is disabled in L1/L2, and hospitalKPIsteering acts only on the flexible capacity pool\. These choices keep the selection and delay channels attributable to provider triage, which is essential for the L1/L2 attribution claims, but they limit the realism of routing\-heavy interventions\.

L3 depends on the warm\-start library\.As reported in §[5\.3](https://arxiv.org/html/2605.30680#S5.SS3)and Appendix[J](https://arxiv.org/html/2605.30680#A10), the L3 mixed\-objective result is a refinement of a diverse warm\-start library; with a neutral\-only library,K=200K=200search does not recover the same family\. We therefore present AlphaEvolve overMedi\-Simas a feasibility demonstration of program search over the Mechanism\-as\-Code policy class rather than as a benchmark\-winning algorithm\.

Synthetic rollouts\.Held\-out evaluation uses fixed seed splits over synthetic rollouts; we do not establish real\-world effectiveness, and any deployment use would require domain validation, calibration to local case mix, and safety, equity, and legal review\.

## Ethical Considerations

The simulator studies high\-stakes healthcare operations\. Its policies must not be interpreted as clinical recommendations or administrative guidance for real hospitals without domain validation, safety review, and fairness analysis\. Modeling behaviors such as up\-coding, cherry\-picking, and strategic delay is intended for detection and stress testing, not for operationalizing manipulation\. Any future deployment would require safeguards for patient access, equity across case complexity, clinical safety, privacy, and legal compliance\.

## References

- Incentive contracts and performance measurement\.Journal of Political Economy100\(3\),pp\. 598–614\.External Links:[Document](https://dx.doi.org/10.1086/261831)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.19.19.8.1.1),[§3\.3](https://arxiv.org/html/2605.30680#S3.SS3.p1.4)\.
- R\. Bekker, G\. Koole, and D\. Roubos \(2017\)Flexible bed allocations for hospital wards\.Health Care Management Science20\(4\),pp\. 453–466\.External Links:[Document](https://dx.doi.org/10.1007/s10729-016-9364-4)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.34.34.8.1.1)\.
- G\. Bevan and C\. Hood \(2006\)What’s measured is what matters: targets and gaming in the english public health care system\.Public Administration84\(3\),pp\. 517–538\.External Links:[Document](https://dx.doi.org/10.1111/j.1467-9299.2006.00600.x)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.13.13.4.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.30680#S3.SS3.p1.4)\.
- S\. C\. Brailsford, P\. R\. Harper, B\. Patel, and M\. Pitt \(2009\)An analysis of the academic literature on simulation and modelling in health care\.Journal of Simulation3\(3\),pp\. 130–140\.External Links:[Document](https://dx.doi.org/10.1057/jos.2009.10)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1)\.
- D\. T\. Campbell \(1979\)Assessing the impact of planned social change\.Evaluation and Program Planning2\(1\),pp\. 67–90\.External Links:[Document](https://dx.doi.org/10.1016/0149-7189%2879%2990048-X),[Link](https://doi.org/10.1016/0149-7189(79)90048-X)Cited by:[Table 4](https://arxiv.org/html/2605.30680#A9.T4.13.13.4.1.1)\.
- T\. Cayirli and E\. Veral \(2003\)Outpatient scheduling in health care: a review of literature\.Production and Operations Management12\(4\),pp\. 519–549\.External Links:[Document](https://dx.doi.org/10.1111/j.1937-5956.2003.tb00218.x)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1)\.
- Centers for Medicare & Medicaid Services \(2026\)Acute inpatient prospective payment system\.Note:CMS technical documentation and public use filesExternal Links:[Link](https://www.cms.gov/medicare/payment/prospective-payment-systems/acute-inpatient-pps/fy-2026-ipps-final-rule-home-page#CMS-1833-F)Cited by:[§3\.1](https://arxiv.org/html/2605.30680#S3.SS1.p1.6)\.
- L\. S\. Dafny \(2005\)How do hospitals respond to price changes?\.American Economic Review95\(5\),pp\. 1525–1547\.External Links:[Document](https://dx.doi.org/10.1257/000282805775014236),[Link](https://www.aeaweb.org/articles?id=10.1257/000282805775014236)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.4.4.6.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p2.1),[§2](https://arxiv.org/html/2605.30680#S2.SS0.SSS0.Px2.p1.2)\.
- J\. Dong, A\. Roth, Z\. Schutzman, B\. Waggoner, and Z\. S\. Wu \(2018\)Strategic classification from revealed preferences\.InProceedings of the 2018 ACM Conference on Economics and Computation \(EC\),pp\. 55–70\.External Links:[Document](https://dx.doi.org/10.1145/3219166.3219193)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px4.p1.1)\.
- P\. Dütting, Z\. Feng, H\. Narasimhan, D\. C\. Parkes, and S\. S\. Ravindranath \(2024\)Optimal auctions through deep learning: advances in differentiable economics\.Journal of the ACM71\(1\)\.External Links:[Document](https://dx.doi.org/10.1145/3630749),[Link](https://doi.org/10.1145/3630749)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p3.1),[§4](https://arxiv.org/html/2605.30680#S4.SS0.SSS0.Px1.p1.2)\.
- K\. Eggleston \(2005\)Multitasking and mixed systems for provider payment\.Journal of Health Economics24\(1\),pp\. 211–223\.External Links:[Document](https://dx.doi.org/10.1016/j.jhealeco.2004.09.001)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.25.25.8.1.1)\.
- F\. Eijkenaar, M\. Emmert, M\. Scheppach, and O\. Schöffski \(2013\)Effects of pay for performance in health care: a systematic review of systematic reviews\.Health Policy110\(2–3\),pp\. 115–130\.External Links:[Document](https://dx.doi.org/10.1016/j.healthpol.2013.01.008)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.19.19.8.1.1)\.
- R\. P\. Ellis \(1998\)Creaming, skimping and dumping: provider competition on the intensive and extensive margins\.Journal of Health Economics17\(5\),pp\. 537–555\.External Links:[Document](https://dx.doi.org/10.1016/S0167-6296%2897%2900042-8)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.25.25.8.1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.7.7.5.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p4.1),[§2](https://arxiv.org/html/2605.30680#S2.SS0.SSS0.Px2.p1.2),[§3\.2](https://arxiv.org/html/2605.30680#S3.SS2.SSS0.Px1.p1.5)\.
- M\. Geruso and T\. Layton \(2020\)Upcoding: evidence from Medicare on squishy risk adjustment\.Journal of Political Economy128\(3\),pp\. 984–1026\.External Links:[Document](https://dx.doi.org/10.1086/704756)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.34.36.2.1.1)\.
- O\. Gottesman, F\. Johansson, M\. Komorowski, A\. Faisal, D\. Sontag, F\. Doshi\-Velez, and L\. A\. Celi \(2019\)Guidelines for reinforcement learning in healthcare\.Nature Medicine25\(1\),pp\. 16–18\.External Links:[Document](https://dx.doi.org/10.1038/s41591-018-0310-5),[Link](https://pubmed.ncbi.nlm.nih.gov/30617332/)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p3.1)\.
- L\. V\. Green \(2002\)How many hospital beds?\.Inquiry39\(4\),pp\. 400–412\.External Links:[Document](https://dx.doi.org/10.5034/inquiryjrnl%5F39.4.400)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.28.28.5.1.1)\.
- L\. V\. Green \(2006\)Queueing analysis in healthcare\.InPatient Flow: Reducing Delay in Healthcare Delivery,R\. W\. Hall \(Ed\.\),pp\. 281–307\.External Links:[Document](https://dx.doi.org/10.1007/978-0-387-33636-7%5F10),[Link](https://doi.org/10.1007/978-0-387-33636-7_10)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.28.28.5.1.1)\.
- M\. Hardt, N\. Megiddo, C\. H\. Papadimitriou, and M\. Wootters \(2016\)Strategic classification\.InProceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science \(ITCS\),pp\. 111–122\.External Links:[Document](https://dx.doi.org/10.1145/2840728.2840730)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p2.1)\.
- Healthcare Cost and Utilization Project \(2025\)National inpatient sample overview\.Note:Agency for Healthcare Research and QualityExternal Links:[Link](https://hcup-us.ahrq.gov/nisoverview.jsp)Cited by:[§3\.1](https://arxiv.org/html/2605.30680#S3.SS1.p1.6)\.
- B\. Holmstrom and P\. Milgrom \(1991\)Multitask principal\-agent analyses: incentive contracts, asset ownership, and job design\.The Journal of Law, Economics, and Organization7\(Special Issue\),pp\. 24–52\.External Links:[Document](https://dx.doi.org/10.1093/jleo/7.special%5Fissue.24)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.19.19.8.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p4.1),[§2](https://arxiv.org/html/2605.30680#S2.SS0.SSS0.Px2.p1.2),[§3\.3](https://arxiv.org/html/2605.30680#S3.SS3.p1.4)\.
- S\. Karten, W\. Li, Z\. Ding, S\. Kleiner, Y\. Bai, and C\. Jin \(2025\)LLM Economist: large population models and mechanism design in multi\-agent generative simulacra\.External Links:2507\.15815,[Link](https://arxiv.org/abs/2507.15815)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1)\.
- M\. Komorowski, L\. A\. Celi, O\. Badawi, A\. C\. Gordon, and A\. A\. Faisal \(2018\)The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care\.Nature Medicine24\(11\),pp\. 1716–1720\.External Links:[Document](https://dx.doi.org/10.1038/s41591-018-0213-5),[Link](https://pubmed.ncbi.nlm.nih.gov/30349085/)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p3.1)\.
- R\. Kronick and W\. P\. Welch \(2014\)Measuring coding intensity in the Medicare Advantage program\.Medicare & Medicaid Research Review4\(2\),pp\. E1–E19\.External Links:[Document](https://dx.doi.org/10.5600/mmrr.004.02.a06),[Link](https://www.cms.gov/mmrr/Downloads/MMRR2014_004_02_sa06.pdf)Cited by:[Table 4](https://arxiv.org/html/2605.30680#A9.T4.34.36.2.1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.4.4.6.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p2.1)\.
- M\. Kuhn and L\. Siciliani \(2008\)Upcoding and optimal auditing in health care \(or the economics of DRG creep\)\.CEPR Discussion PaperTechnical Report6689,Centre for Economic Policy Research\.External Links:[Link](https://cepr.org/publications/dp6689)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.11.11.6.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p4.1),[§2](https://arxiv.org/html/2605.30680#S2.SS0.SSS0.Px2.p1.2)\.
- J\. Lehman, J\. Gordon, S\. Jain, K\. Ndousse, C\. Yeh, and K\. O\. Stanley \(2022\)Evolution through large models\.External Links:2206\.08896,[Link](https://arxiv.org/abs/2206.08896)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p4.1)\.
- S\. Levanon and N\. Rosenfeld \(2021\)Strategic classification made practical\.InInternational Conference on Machine Learning \(ICML\),PMLR, Vol\.139,pp\. 6243–6253\.External Links:[Link](https://proceedings.mlr.press/v139/levanon21a.html)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px4.p1.1)\.
- C\. A\. Ma \(1994\)Health care payment systems: cost and quality incentives\.Journal of Economics & Management Strategy3\(1\),pp\. 93–112\.External Links:[Document](https://dx.doi.org/10.1111/j.1430-9134.1994.00093.x)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.25.25.8.1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.7.7.5.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p4.1),[§2](https://arxiv.org/html/2605.30680#S2.SS0.SSS0.Px2.p1.2),[§3\.2](https://arxiv.org/html/2605.30680#S3.SS2.SSS0.Px1.p1.5)\.
- Y\. J\. Ma, W\. Liang, G\. Wang, D\. Huang, O\. Bastani, D\. Jayaraman, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024\)Eureka: human\-level reward design via coding large language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2310.12931)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1)\.
- D\. Manheim and S\. Garrabrant \(2018\)Categorizing variants of Goodhart’s law\.arXiv preprint arXiv:1803\.04585\.External Links:[Link](https://arxiv.org/abs/1803.04585)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.13.13.4.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p2.1)\.
- A\. Novikov, N\. Vu, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. R\. Ruiz, A\. Mehrabian, M\. P\. Kumar, A\. See, S\. Chaudhuri, G\. Holland, A\. Davies, S\. Nowozin, P\. Kohli, and M\. Balog \(2025\)AlphaEvolve: a coding agent for scientific and algorithmic discovery\.External Links:2506\.13131,[Document](https://dx.doi.org/10.48550/arXiv.2506.13131),[Link](https://arxiv.org/abs/2506.13131)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p4.1),[§4](https://arxiv.org/html/2605.30680#S4.SS0.SSS0.Px3.p1.3)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST\),External Links:[Document](https://dx.doi.org/10.1145/3586183.3606763)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1)\.
- J\. C\. Perdomo, T\. Zrnic, C\. Mendler\-Dünner, and M\. Hardt \(2020\)Performative prediction\.InInternational Conference on Machine Learning \(ICML\),PMLR, Vol\.119,pp\. 7599–7609\.External Links:[Link](https://proceedings.mlr.press/v119/perdomo20a.html)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p2.1)\.
- C\. Propper, M\. Sutton, C\. Whitnall, and F\. Windmeijer \(2010\)Incentives and targets in hospital care: evidence from a natural experiment\.Journal of Public Economics94\(3–4\),pp\. 318–335\.External Links:[Document](https://dx.doi.org/10.1016/j.jpubeco.2010.01.002),[Link](https://doi.org/10.1016/j.jpubeco.2010.01.002)Cited by:[Table 4](https://arxiv.org/html/2605.30680#A9.T4.13.13.4.1.1),[§1](https://arxiv.org/html/2605.30680#S1.p2.1)\.
- B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. R\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi, P\. Kohli, and A\. Fawzi \(2024\)Mathematical discoveries from program search with large language models\.Nature625,pp\. 468–475\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06924-6),[Link](https://www.nature.com/articles/s41586-023-06924-6)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p4.1),[§4](https://arxiv.org/html/2605.30680#S4.SS0.SSS0.Px3.p1.3)\.
- C\. Rudin \(2019\)Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead\.Nature Machine Intelligence1\(5\),pp\. 206–215\.External Links:[Document](https://dx.doi.org/10.1038/s42256-019-0048-x),[Link](https://doi.org/10.1038/s42256-019-0048-x)Cited by:[§1](https://arxiv.org/html/2605.30680#S1.p4.1)\.
- T\. Sandholm \(2003\)Automated mechanism design: a new application area for search algorithms\.InPrinciples and Practice of Constraint Programming – CP 2003,Lecture Notes in Computer Science, Vol\.2833,pp\. 19–36\.External Links:[Document](https://dx.doi.org/10.1007/978-3-540-45193-8%5F2),[Link](https://doi.org/10.1007/978-3-540-45193-8_2)Cited by:[§1](https://arxiv.org/html/2605.30680#S1.p3.1)\.
- E\. Silverman and J\. Skinner \(2004\)Medicare upcoding and hospital ownership\.Journal of Health Economics23\(2\),pp\. 369–389\.External Links:[Document](https://dx.doi.org/10.1016/j.jhealeco.2003.09.007)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px2.p1.1),[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.7.7.5.1.1)\.
- J\. Skalse, N\. H\. R\. Howe, D\. Krasheninnikov, and D\. Krueger \(2022\)Defining and characterizing reward hacking\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3d719fee332caa23d5038b8a90e81796-Abstract-Conference.html)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px4.p1.1)\.
- U\.S\. Food and Drug Administration, Health Canada, and Medicines and Healthcare products Regulatory Agency \(2021\)Good machine learning practice for medical device development: guiding principles\.Note:Guidance documentExternal Links:[Link](https://www.fda.gov/media/153486/download)Cited by:[§1](https://arxiv.org/html/2605.30680#S1.p4.1)\.
- P\. Van Herck, D\. De Smedt, L\. Annemans, R\. Remmen, M\. B\. Rosenthal, and W\. Sermeus \(2010\)Systematic review: effects, design choices, and context of pay\-for\-performance in health care\.BMC Health Services Research10,pp\. 247\.External Links:[Document](https://dx.doi.org/10.1186/1472-6963-10-247)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2605.30680#A9.T4.19.19.8.1.1)\.
- A\. S\. Vezhnevets, J\. P\. Agapiou, A\. Aharon, R\. Ziv, J\. Matyas, E\. A\. Duéñez\-Guzmán, W\. A\. Cunningham, S\. Osindero, D\. Karmon, and J\. Z\. Leibo \(2023\)Generative agent\-based modeling with actions grounded in physical, social, or digital space using Concordia\.External Links:2312\.03664,[Link](https://arxiv.org/abs/2312.03664)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2023\)Large language models as optimizers\.arXiv preprint arXiv:2309\.03409\.External Links:2309\.03409,[Document](https://dx.doi.org/10.48550/arXiv.2309.03409),[Link](https://arxiv.org/abs/2309.03409)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1)\.
- C\. Yu, J\. Liu, S\. Nemati, and G\. Yin \(2021\)Reinforcement learning in healthcare: a survey\.ACM Computing Surveys55\(1\),pp\. 1–36\.External Links:[Document](https://dx.doi.org/10.1145/3477600)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p3.1)\.
- S\. Zheng, A\. Trott, S\. Srinivasa, N\. Naik, M\. Gruesbeck, D\. C\. Parkes, and R\. Socher \(2022\)The AI economist: taxation policy design via two\-level deep multiagent reinforcement learning\.Science Advances8\(18\),pp\. eabk2607\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.abk2607)Cited by:[Appendix D](https://arxiv.org/html/2605.30680#A4.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.30680#S1.p3.1),[§4](https://arxiv.org/html/2605.30680#S4.SS0.SSS0.Px1.p1.2)\.

## Appendix Contents

Appendix guide\.

- •Appendix[A](https://arxiv.org/html/2605.30680#A1): execution and search algorithms\.
- •Appendix[B](https://arxiv.org/html/2605.30680#A2): formal Stackelberg definitions\.
- •Appendix[C](https://arxiv.org/html/2605.30680#A3): environment functional forms\.
- •Appendix[D](https://arxiv.org/html/2605.30680#A4): related work\.
- •Appendix[E](https://arxiv.org/html/2605.30680#A5): per\-channel L1 phase\-diagram anatomy\.
- •Appendices[F](https://arxiv.org/html/2605.30680#A6)–[G](https://arxiv.org/html/2605.30680#A7): additional L2 diagnostics\.
- •Appendix[H](https://arxiv.org/html/2605.30680#A8): L1/L2 setup and hyperparameters\.
- •Appendix[I](https://arxiv.org/html/2605.30680#A9): external stylized\-fact validation\.
- •Appendix[J](https://arxiv.org/html/2605.30680#A10): L3 search diagnostics and discovered policies\.
- •Appendix[K](https://arxiv.org/html/2605.30680#A11): implementation details and DSL guardrails\.
- •Appendix[L](https://arxiv.org/html/2605.30680#A12): terminology and arrival\-process note\.

## Appendix AExecution and Search Algorithms

Algorithm[A\.1](https://arxiv.org/html/2605.30680#alg1)expands the IPS diagram in Figure[1](https://arxiv.org/html/2605.30680#S3.F1)into the period\-by\-period simulator loop\. Algorithm[A\.2](https://arxiv.org/html/2605.30680#alg2)gives the outer optimization loop used by L3\.

Algorithm A\.1Medi\-Simsimulator execution loop1:Hospital capacity and queues, hospital provider teams, hospital administrator policy, reimbursement module,KPI/bonus rule, initial state

X1X\_\{1\}\.

2:for

t=1,…,Tt=1,\\ldots,Tdo

3:Hospital policy step:the hospital administrator sets period levers

utu\_\{t\}, including incentives, capacity, bonus\-pool,KPIweights, audit intensity, and optional routing or steering parameters\.

4:Patient generation:sample arrivals

𝒫t\\mathcal\{P\}\_\{t\}from the configured arrival process and assign latent clinical attributes such as true group, urgency, tolerance, and true complexity

CMIi⋆\\mathrm\{CMI\}\_\{i\}^\{\\star\}\.

5:Identify / coding:hospital coding staff map each arriving patient’s clinical presentation to a billable group

g^i,t\\hat\{g\}\_\{i,t\}; the gap between true and coded complexity realizes the coding wedge\.

6:Routing and queueing:register or route patients to hospital care\-team queues, carry over backlog, and increment waiting time for patients not served in the current period\.

7:foreach hospital care team

jjdo

8:Observe local queue

𝒬j,t\\mathcal\{Q\}\_\{j,t\}, capacity, fatigue, incentives, and priorKPIsignals\.

9:Choose provider\-side actions: accept/defer/reject, effort

Ei​j,tE\_\{ij,t\}for accepted cases, and resource request

Ri​j,tR\_\{ij,t\}under the response class

ΠP\\Pi\_\{P\}\.

10:endfor

11:Capacity resolution and treatment:enforce hard capacity constraints, treat selected patients, keep unserved patients queued or rejected according to the triage decision, and realize health output

Hi​j,tH\_\{ij,t\}and cost

Ci​j,tC\_\{ij,t\}\.

12:Settlement:compute reimbursement from the reported billing group

g^i,t\\hat\{g\}\_\{i,t\}, aggregate margin andKPIscores, apply audit penalties, and allocate bonuses\.

13:State update:update funds, reputation, queues, provider fatigue,KPIhistory, and the next hospital state

Xt\+1X\_\{t\+1\}\.

14:endfor

15:Rollout trajectory, hospital administrator return, and channel\-level diagnostics\.

Algorithm A\.2Closed\-loop policy search overMedi\-Simrollouts1:Initial policy library

ℒ0\\mathcal\{L\}\_\{0\}, program\-search operator, validation seeds, held\-out test seeds, rollout horizon

TT\.

2:for

k=1,…,Kk=1,\\ldots,Kdo

3:Deploy:select or propose a typed hospital policy program

πA\(k\)\\pi\_\{A\}^\{\(k\)\}over the allowed DSL fields\.

4:Simulate:evaluate

πA\(k\)\\pi\_\{A\}^\{\(k\)\}by running Algorithm[A\.1](https://arxiv.org/html/2605.30680#alg1)on stochastic validation rollouts\.

5:Score:compute scalar return

GG, safety/distortion penalty

VV, variance penalty, and diagnostic metrics\.

6:Update:retain or mutate policy programs according to validation fitness and the search operator\.

7:endfor

8:Select the best validation candidate and evaluate it once on held\-out seeds\.

9:Searched policy, held\-out rollout profile, and search diagnostics\.

The closed loop contains two feedback channels\. Within\-rollout feedback is economic and operational: hospital incentives and routing rules shape coding, triage, treatment effort, congestion, settlement, and the next hospital state\. Across\-rollout feedback is learning\-driven: the program\-search operator updates the distribution of candidate hospital policies after observing rollout scores and diagnostics, which shifts the strategic and congestion regime explored by later rollouts\.

## Appendix BFormal Definitions for the Stackelberg Formulation

This appendix expands the compact statement of §[2](https://arxiv.org/html/2605.30680#S2)\.

#### Hospital state\.

The state at timettis

Xt=\(Ft,Qt,\{𝒬j,t\}j=1J,KPIt−1,Rept\),X\_\{t\}=\\big\(F\_\{t\},\\ Q\_\{t\},\\ \\\{\\mathcal\{Q\}\_\{j,t\}\\\}\_\{j=1\}^\{J\},\\ \\mathrm\{KPI\}\_\{t\-1\},\\ \\mathrm\{Rep\}\_\{t\}\\big\),\(6\)whereFtF\_\{t\}denotes funds,QtQ\_\{t\}summarizes systemwide congestion,𝒬j,t\\mathcal\{Q\}\_\{j,t\}is the queue registered to teamjj,KPIt−1\\mathrm\{KPI\}\_\{t\-1\}is the previous\-period measured performance vector, andRept\\mathrm\{Rep\}\_\{t\}is public reputation\. The full per\-patient state, the patient stream, and the audit signal are observable through these summaries and the exposed state features described in §[3](https://arxiv.org/html/2605.30680#S3)\.

#### Leader action\.

The hospital administrator’s mechanism action is

ut=\(\\displaystyle u\_\{t\}=\\big\(αt,βt,Bttot,Btflex,Btpool,\\displaystyle\\alpha\_\{t\},\\beta\_\{t\},B^\{\\mathrm\{tot\}\}\_\{t\},B^\{\\mathrm\{flex\}\}\_\{t\},B^\{\\mathrm\{pool\}\}\_\{t\},\(7\)wH,wW,wrej,wC,κ,qt,ξt\),\\displaystyle w\_\{H\},w\_\{W\},w\_\{\\mathrm\{rej\}\},w\_\{C\},\\kappa,q\_\{t\},\\xi\_\{t\}\\big\),with the per\-component semantics given in §[2](https://arxiv.org/html/2605.30680#S2)\. The flexible capacity poolBtflexB^\{\\mathrm\{flex\}\}\_\{t\}is reallocatable across care teams, andKPIsteering throughξt\\xi\_\{t\}assigns that capacity using measured performance scores; see Appendix[L](https://arxiv.org/html/2605.30680#A12)\.

#### Safety/distortion penalty\.

The penalty termVVin Eq\. \([4](https://arxiv.org/html/2605.30680#S2.E4)\) aggregates per\-step penalties for unsafe waiting, excessive high\-complexity deferral, up\-coding, rejection, and insolvency; explicit weights are listed in App\.[K](https://arxiv.org/html/2605.30680#A11)\.

#### Empirical fitness estimator\.

The search fitness used in §[4](https://arxiv.org/html/2605.30680#S4)is the empirical estimator of Eq\. \([4](https://arxiv.org/html/2605.30680#S2.E4)\),

Fitness​\(π\)=\\displaystyle\\mathrm\{Fitness\}\(\\pi\)=𝔼s​\[G​\(π;s\)\]−λunsafe​V​\(π\)\\displaystyle\\ \\mathbb\{E\}\_\{s\}\[G\(\\pi;s\)\]\-\\lambda\_\{\\mathrm\{unsafe\}\}V\(\\pi\)−λvar​Vars​\(G​\(π;s\)\),\\displaystyle\-\\lambda\_\{\\mathrm\{var\}\}\\mathrm\{Var\}\_\{s\}\(G\(\\pi;s\)\),\(8\)whereG​\(π;s\)≡Go​\(πA,πP∗;s\)G\(\\pi;s\)\\equiv G^\{o\}\(\\pi\_\{A\},\\pi\_\{P\}^\{\*\};s\)instantiates the per\-objective discounted return of Eq\. \([3](https://arxiv.org/html/2605.30680#S2.E3)\) withπP∗∈ΠP​\(πA\)\\pi\_\{P\}^\{\*\}\\in\\Pi\_\{P\}\(\\pi\_\{A\}\)taken as the bounded\-rationality best response\.

#### Mixed\-objective per\-step reward\.

For the mixed objective, one\-step reward log\-scales funds before combining them with reputation:

rtmix=0\.5⋅log⁡\(Ft\)10\+0\.5⋅Rept\.r\_\{t\}^\{\\mathrm\{mix\}\}=0\.5\\cdot\\frac\{\\log\(F\_\{t\}\)\}\{10\}\+0\.5\\cdot\\mathrm\{Rep\}\_\{t\}\.\(9\)This return enters the search fitness only and is not interpreted as a standalone welfare or economic\-value measure\.

## Appendix CEnvironment Functional Forms

This appendix collects the closed\-form expressions referenced in §[3](https://arxiv.org/html/2605.30680#S3)\.

#### Arrival process\.

The default arrival process is homogeneous Poisson,

Nt=\|𝒫t\|∼Poisson​\(λarr\),N\_\{t\}=\|\\mathcal\{P\}\_\{t\}\|\\sim\\mathrm\{Poisson\}\(\\lambda\_\{\\mathrm\{arr\}\}\),\(10\)with per\-patient attributes drawn as described in §[3\.1](https://arxiv.org/html/2605.30680#S3.SS1)\. Appendix[L](https://arxiv.org/html/2605.30680#A12)states when non\-Poisson kernels can be substituted\.

#### Coding candidate score\.

For candidate reported groupgg, letΔ​Ri​\(g\)\\Delta R\_\{i\}\(g\)be incremental reimbursement and letΔ​ci​\(g\)\\Delta c\_\{i\}\(g\)be coded\-complexity inflation\. The candidate score is

si​\(g\)=\\displaystyle s\_\{i\}\(g\)=αt​γcode​Δ​Ri​\(g\)\\displaystyle\\ \\alpha\_\{t\}\\,\\gamma\_\{\\mathrm\{code\}\}\\,\\Delta R\_\{i\}\(g\)\(11\)−paudit​\(Δ​ci​\(g\)\)​λpen​\[Δ​Ri​\(g\)\]\+\\displaystyle\-p\_\{\\mathrm\{audit\}\}\(\\Delta c\_\{i\}\(g\)\)\\,\\lambda\_\{\\mathrm\{pen\}\}\\,\[\\Delta R\_\{i\}\(g\)\]\_\{\+\}−r0​ηeth​\[Δ​ci​\(g\)\]\+,\\displaystyle\-r\_\{0\}\\,\\eta\_\{\\mathrm\{eth\}\}\\,\[\\Delta c\_\{i\}\(g\)\]\_\{\+\},wherepaudit​\(⋅;qt\)p\_\{\\mathrm\{audit\}\}\(\\cdot;q\_\{t\}\)is the host\-defined audit\-hit function under the configured audit intensityqtq\_\{t\}\. The coded group is chosen by the configured score\-based rule: a softmax over\{si​\(g\)\}\\\{s\_\{i\}\(g\)\\\}in stochastic response mode, and its zero\-temperaturearg⁡maxg⁡si​\(g\)\\arg\\max\_\{g\}s\_\{i\}\(g\)variant in deterministic L3 search/evaluation\. Both variants use the same candidate score; intermediate aggregate up\-coding arises from patient heterogeneity and incentive\-dependent score comparisons rather than from a separate coding mechanism\.

#### Treatment production and cost\.

For an accepted patientiitreated by unitjj, true health output follows a diminishing\-return production,

Hi​j,t=Skillj​\(1−exp⁡\[−λ​Ei​j,tCMIi⋆\+ϵ\]\)​Ri​j,t,H\_\{ij,t\}=\\mathrm\{Skill\}\_\{j\}\\left\(1\-\\exp\\\!\\left\[\-\\lambda\\,\\frac\{E\_\{ij,t\}\}\{\\mathrm\{CMI\}\_\{i\}^\{\\star\}\+\\epsilon\}\\right\]\\right\)R\_\{ij,t\},\(12\)with effortEi​j,t≥0E\_\{ij,t\}\\geq 0and resource indicatorRi​j,t∈\{0,1\}R\_\{ij,t\}\\in\\\{0,1\\\}\. Cost is convex in effort,

Ci​j,t=Cfixed​Ri​j,t\+ω​Ei​j,tϕ,ϕ\>1\.C\_\{ij,t\}=C\_\{\\mathrm\{fixed\}\}\\,R\_\{ij,t\}\+\\omega\\,E\_\{ij,t\}^\{\\phi\},\\qquad\\phi\>1\.\(13\)

#### Bonus tournament and marginal pressure\.

Bonuses are allocated through a softmax tournament with sharpnessκ\\kappaand poolBtpoolB^\{\\mathrm\{pool\}\}\_\{t\},

sj,t\\displaystyle s\_\{j,t\}=exp⁡\(κ​KPIj,t\)∑ℓ=1Jexp⁡\(κ​KPIℓ,t\),\\displaystyle=\\frac\{\\exp\(\\kappa\\,\\mathrm\{KPI\}\_\{j,t\}\)\}\{\\sum\_\{\\ell=1\}^\{J\}\\exp\(\\kappa\\,\\mathrm\{KPI\}\_\{\\ell,t\}\)\},\(14\)Bonusj,t\\displaystyle\\mathrm\{Bonus\}\_\{j,t\}=Btpool​sj,t,\\displaystyle=B^\{\\mathrm\{pool\}\}\_\{t\}\\,s\_\{j,t\},\(15\)with local marginal bonus pressure

∂Bonusj,t∂KPIj,t=Btpool​κ​sj,t​\(1−sj,t\),\\frac\{\\partial\\mathrm\{Bonus\}\_\{j,t\}\}\{\\partial\\mathrm\{KPI\}\_\{j,t\}\}=B^\{\\mathrm\{pool\}\}\_\{t\}\\,\\kappa\\,s\_\{j,t\}\(1\-s\_\{j,t\}\),\(16\)which feeds back into effort, triage, andKPI\-targeting behavior\.

## Appendix DRelated Work

#### AI and Reinforcement Learning for Clinical Decision\-Making\.

A substantial line of work treats healthcare as a sequential decision problem at the patient level, learning treatment policies from electronic health records under offline RL formulations\. The AI Clinician learns vasopressor and fluid policies for sepsis from MIMIC\-III and reports lower in\-hospital mortality relative to observed clinician behavior\(Komorowskiet al\.,[2018](https://arxiv.org/html/2605.30680#bib.bib27)\), and broader surveys catalogue similar formulations for chronic disease management, anesthesia, and ventilation\(Yuet al\.,[2021](https://arxiv.org/html/2605.30680#bib.bib42)\)\. Methodological guidelines emphasize off\-policy evaluation, distributional shift, and reward specification as the core obstacles to safe clinical deployment\(Gottesmanet al\.,[2019](https://arxiv.org/html/2605.30680#bib.bib26)\)\. This literature optimizes the*clinician’s*action on a single patient and assumes a fixed reward and a fixed environment\.Medi\-Simoperates one level above this layer: the action is a hospital administrative rule \(incentive coefficients, audit probability, capacity allocation, hospitalKPIweights\), the environment is a population of strategic providers, and the realized policy is the composition of the hospital administrator’s mechanism with the providers’ best response\. Patient\-level clinical RL is thus complementary to, not a substitute for, the mechanism\-level question we study\.

#### Healthcare Incentives and Operations\.

Provider behavior under payment rules has been studied for three decades through agency theory and operational flow\. Cost and quality incentives under prospective payment can produce creaming, skimping, dumping, and intensity shifts\(Ellis,[1998](https://arxiv.org/html/2605.30680#bib.bib8); Ma,[1994](https://arxiv.org/html/2605.30680#bib.bib7); Eggleston,[2005](https://arxiv.org/html/2605.30680#bib.bib12)\)\. Empirical studies on Diagnosis\-Related Groups confirm that administrative pricing shifts directly alter case mix and coding\(Dafny,[2005](https://arxiv.org/html/2605.30680#bib.bib9); Silverman and Skinner,[2004](https://arxiv.org/html/2605.30680#bib.bib10)\), and more recent evidence from Medicare Advantage shows that risk\-adjusted diagnosis\-based reimbursement raises reported risk scores by 6–16% without commensurate change in underlying morbidity\(Geruso and Layton,[2020](https://arxiv.org/html/2605.30680#bib.bib33)\)\. On the operational side, hospital flow is highly sensitive to capacity: queueing analyses show wait times escalate non\-linearly near full utilization\(Green,[2002](https://arxiv.org/html/2605.30680#bib.bib18),[2006](https://arxiv.org/html/2605.30680#bib.bib19); Cayirli and Veral,[2003](https://arxiv.org/html/2605.30680#bib.bib44)\), and flexible bed allocations partially mitigate the load under realistic routing constraints\(Bekkeret al\.,[2017](https://arxiv.org/html/2605.30680#bib.bib20)\)\. Healthcare\-specific simulation has been used for decades to study these flows, but predominantly as isolated discrete\-event models without strategic agents\(Brailsfordet al\.,[2009](https://arxiv.org/html/2605.30680#bib.bib43)\)\.Medi\-Simintegrates this macroeconomic payment layer and microeconomic operational layer into a single executable closed loop in which coding, capacity, and revenue interact\.

#### Measurement, Auditing, and Strategic Gaming\.

Prospective payment introduces an informational wedge between ground\-truth clinical complexity and reported billing groups\. Pervasive up\-coding has been documented across ownership structures\(Silverman and Skinner,[2004](https://arxiv.org/html/2605.30680#bib.bib10); Geruso and Layton,[2020](https://arxiv.org/html/2605.30680#bib.bib33)\), and economic audit theory characterizes how monitoring rates and penalties govern this “DRG creep”\(Kuhn and Siciliani,[2008](https://arxiv.org/html/2605.30680#bib.bib11)\)\. Performance measurement creates a parallel friction: when scored metrics are imperfect proxies for the planner’s objective, high\-powered incentives intensify multitasking distortions and target gaming\(Holmstrom and Milgrom,[1991](https://arxiv.org/html/2605.30680#bib.bib13); Baker,[1992](https://arxiv.org/html/2605.30680#bib.bib14); Bevan and Hood,[2006](https://arxiv.org/html/2605.30680#bib.bib15)\)\. Systematic reviews of pay\-for\-performance in healthcare reach a consistent verdict that contract design and local context determine whether bonuses improve care or merely move metrics\(Eijkenaaret al\.,[2013](https://arxiv.org/html/2605.30680#bib.bib16); Van Hercket al\.,[2010](https://arxiv.org/html/2605.30680#bib.bib17)\)\. Formal taxonomies of Goodhart\-style failures \(regressional, extremal, causal, adversarial\) clarify why such gaming is structural\(Manheim and Garrabrant,[2018](https://arxiv.org/html/2605.30680#bib.bib34)\)\.Medi\-Simmodels the hospital coding wedge and the hospital measurement wedge as dynamic, interactive processes, attributing each distortion to the agent that actually makes the choice \(clinician/hospital coder for up\-coding, triage for selection and delay, hospital administrator for routing and incentive design\), so that channel\-level distortion becomes a measurable property of the rollout\.

#### Strategic Machine Learning and Algorithmic Mechanism Design\.

Our problem interfaces directly with the machine\-learning literature on agents that respond to a deployed rule\.*Strategic classification*formalizes Stackelberg learning against a manipulable agent\(Hardtet al\.,[2016](https://arxiv.org/html/2605.30680#bib.bib30)\)and has been extended to revealed\-preference observations\(Donget al\.,[2018](https://arxiv.org/html/2605.30680#bib.bib36)\)and to practical, end\-to\-end differentiable training pipelines\(Levanon and Rosenfeld,[2021](https://arxiv.org/html/2605.30680#bib.bib37)\)\.*Performative prediction*generalizes this to settings in which the deployed model itself shifts the data distribution and characterizes performative equilibria\(Perdomoet al\.,[2020](https://arxiv.org/html/2605.30680#bib.bib31)\)\. The reward\-shaping side of this literature gives formal notions of*reward hacking*: optimizing an imperfect proxy can never be made safe by narrowing the reward function alone under mild structural assumptions\(Skalseet al\.,[2022](https://arxiv.org/html/2605.30680#bib.bib35)\)\. In parallel, differentiable economics has used deep architectures to construct optimal multi\-dimensional auctions from data\(Düttinget al\.,[2024](https://arxiv.org/html/2605.30680#bib.bib29)\)\. These neural mechanisms achieve strong objective values but yield black\-box controllers whose parameters resist verification\.Medi\-Simadopts the strategic\-learning stance \(provider best response is part of the environment\) and the algorithmic\-mechanism\-design objective \(synthesize a rule that is robust to that response\), but constrains the policy class to read\-and\-comply programs, which is what regulated healthcare deployment actually demands\.

#### Multi\-Agent Simulations and Program\-Space Policy Search\.

Designing macro\-policies by modeling micro\-agent adaptation has recently scaled through multi\-agent reinforcement learning and large\-language\-model agents\. The AI Economist frames optimal taxation as a two\-level RL problem in which agents and a planner co\-adapt\(Zhenget al\.,[2022](https://arxiv.org/html/2605.30680#bib.bib23)\)\. LLM\-based generative agents extend this paradigm to believable populations of social actors\(Parket al\.,[2023](https://arxiv.org/html/2605.30680#bib.bib38)\), and frameworks such as Concordia operationalize generative agent\-based modeling with grounded actions and an explicit game\-master adjudication layer\(Vezhnevetset al\.,[2023](https://arxiv.org/html/2605.30680#bib.bib39)\)\. The LLM\-economist line scales mechanism design to large generative simulacra of taxpayers and consumers\(Kartenet al\.,[2025](https://arxiv.org/html/2605.30680#bib.bib28)\)\. However, these systems instantiate consumption, taxation, or social interaction as their domain primitives, not capacity\-constrained clinical care, and they typically train neural controllers whose decision logic is not directly auditable\. To navigate a non\-differentiable and tightly regulated mechanism space without sacrificing inspectability, we build on LLM\-guided evolutionary program search: FunSearch surpassed best\-known constructions in extremal combinatorics and online bin packing by treating an LLM as a structured mutation operator over code\(Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2605.30680#bib.bib21)\); ELM showed that LLM\-based mutation can drive open\-ended evolution in code\-defined domains\(Lehmanet al\.,[2022](https://arxiv.org/html/2605.30680#bib.bib40)\); AlphaEvolve generalizes this to algorithmic discovery across mathematics, hardware, and learning systems\(Novikovet al\.,[2025](https://arxiv.org/html/2605.30680#bib.bib22)\); and Eureka demonstrates that the same recipe can author RL reward functions that outperform expert humans on a majority of robotics tasks\(Maet al\.,[2024](https://arxiv.org/html/2605.30680#bib.bib41)\)\. More broadly, LLMs have been studied as black\-box optimizers in their own right\(Yanget al\.,[2023](https://arxiv.org/html/2605.30680#bib.bib32)\)\.Medi\-Simadapts this paradigm to healthcare mechanism design: a candidate hospital policy is a typed executable program over a constrained hospital policy DSL of administrative levers, and the evaluator scores it on stochastic multi\-agent rollouts under safety, access, and distortion constraints, yielding policies that are simultaneously high\-performing, robust to provider best response, and human\-readable\.

## Appendix EA Per\-Channel Anatomy of the L1 Phase Diagram

The headline phase diagram of §[5\.1](https://arxiv.org/html/2605.30680#S5.SS1)reports six top\-line outcomes on the11×1111\\times 11\(α,β\)\(\\alpha,\\beta\)grid\. Each of those outcomes, however, is a slice through the joint response of five behavioral channels, and reading any single slice in isolation can mislead\. The detailed panels in this appendix, including access decomposition, strategic\-delay decomposition, clinical budget panels, and team\-levelKPIpanels, decompose each headline outcome into its constituent primitives\. Two structural facts emerge that are not visible at the headline resolution\.

First, the four\-regime story is preserved at higher resolution: every additional panel either localizes a known regime more sharply or reveals a previously hidden margin without contradicting the headline\.

Second, the channels are arranged in a*substitution lattice*: in regions where one distortion saturates against a structural floor or ceiling, an adjacent channel becomes active to relieve the same shadow price\. This lattice is the structural reason the headline regimes are separable and is the strongest piece of evidence we have that the IPS decomposition of §[3](https://arxiv.org/html/2605.30680#S3)is the right factorization forMedi\-Sim\.

Regimeα\\alphaβ\\betaΔ\\Deltarej\.High rej\.Low rej\.UpcodeEffortFundsLow\-incentive0\.00\.00\.2410\.5420\.3010\.0142\.12823\.4Profit\-driven0\.80\.20\.1820\.1820\.0000\.2261\.6885451\.7Quality\-driven0\.20\.8\-0\.0510\.0260\.0780\.0272\.43515\.2Balanced0\.50\.50\.1550\.1570\.0020\.0801\.7903486\.6

Table 2:Representative L1 regimes\. Entries are 30\-seed means over horizonT=200T=200;Δ\\Deltarej\. denotes high\-CMIminus low\-CMIrejection\. Colored cells mark regime\-defining extrema among the four rows\.![Refer to caption](https://arxiv.org/html/2605.30680v1/x3.png)Figure 4:Cell\-annotated L1 phase diagram\. Each cell reports the 30\-seed mean of the indicated outcome on the11×1111\\times 11\(α,β\)\(\\alpha,\\beta\)grid at horizonT=200T=200;×\\timesmarks the balanced baseline and△\\triangle/▽\\triangledownmark the per\-panel argmax/argmin\. The numerical annotations make it possible to verify the regime boundaries discussed in §[E\.2](https://arxiv.org/html/2605.30680#A5.SS2)–§[E\.5](https://arxiv.org/html/2605.30680#A5.SS5)cell by cell\.### E\.1Reading the annotated phase diagram

This subsection provides cell\-level annotations \(Figure[4](https://arxiv.org/html/2605.30680#A5.F4)\) for the L1 grid \(Figure[2](https://arxiv.org/html/2605.30680#S5.F2)\), serving as the reference for the detailed decompositions that follow\. Three patterns are immediately visible\.\(i\)The cherry\-picking indexΔrej\\Delta\_\{\\mathrm\{rej\}\}\(panel a\) attains its maximum at\(α,β\)=\(0\.1,0\.0\)\(\\alpha,\\beta\)=\(0\.1,0\.0\)with value0\.310\.31and inverts in the low\-α\\alpha/high\-β\\betaband, reaching−0\.20\-0\.20at\(0\.0,0\.5\)\(0\.0,0\.5\)\. The transition generally weakens as quality weight rises, but it is not strictly monotone cell by cell\.\(ii\)The strategic\-delay gapΔdef∗\\Delta^\{\*\}\_\{\\mathrm\{def\}\}\(panel b\) is*not*monotone in either variable: its maximum is0\.290\.29at\(0\.6,0\.7\)\(0\.6,0\.7\), well inside the mixed\-incentive corridor, and the panel contains a negative stripe at highβ\\beta, lowα\\alphawhere deferral protects*low*\-complexity cases\. This pattern is decomposed in the dedicated delay analysis of §[E\.3](https://arxiv.org/html/2605.30680#A5.SS3)\.\(iii\)The up\-coding rateuu\(panel c\) is monotone inα\\alphaat everyβ\\beta: values run from about0\.010\.01at lowα\\alphato0\.380\.38atα=1\.0\\alpha=1\.0\. Variation alongβ\\betaat fixedα\\alphais much smaller but not zero, reaching about0\.060\.06in the highest\-α\\alpharow\. Coding is therefore the most nearly one\-dimensional channel in the parameter pair\(α,β\)\(\\alpha,\\beta\), a fact that matters when we discuss the audit\-coding\-selection substitution in §[E\.5](https://arxiv.org/html/2605.30680#A5.SS5)\. The remaining panels,Corr​\(KPI,H\)\\mathrm\{Corr\}\(\\mathrm\{KPI\},H\), mean effort, and mean waiting, are each interpreted in the dedicated subsections below\.

### E\.2The access channel: rejection is only half the story

![Refer to caption](https://arxiv.org/html/2605.30680v1/x4.png)Figure 5:Access decomposition on the L1 grid\. Panels \(a\) and \(b\) split rejection by true complexity into high\-CMI and low\-CMI components, panel \(c\) is the resulting cherry\-picking gap, panels \(d\) and \(e\) do the same decomposition for deferral, and panel \(f\) reproduces the realized mean waiting time\. The contrast between \(a,b\) and \(d,e\) is the central content of this appendix: asα\\alpharises, the access burden migrates from rejection to deferral, but only for high\-complexity patients\.Cherry\-picking is conventionally measured as the gap between high\-complexity and low\-complexity rejection rates, and on that summary statistic alone \(Figure[4](https://arxiv.org/html/2605.30680#A5.F4)\(a\)\) the profit\-driven corner looks unambiguous\. The decomposition in Figure[5](https://arxiv.org/html/2605.30680#A5.F5)reveals that this headline hides a subtler structural fact\.

At lowα\\alpha, both rejection rates are high in absolute terms\. The providers lack a positive margin and treat acceptance as a pure fatigue cost, so the gap between them is small \(Figure[5](https://arxiv.org/html/2605.30680#A5.F5), panels a–b\)\. Asα\\alpharises into the profit corner, both rates fall because throughput becomes valuable, but low\-CMI rejection falls faster: cheap cases are unconditionally profitable\. The gap therefore widens mainly because they accept simple ones more eagerly\. The conventional cherry\-picking index captures the*size*of the gap but misattributes its*source*\.

The deferral panels \(d\) and \(e\) tell the complementary story\. Overall deferral is not monotone inα\\alpha; the more stable signal is the high\-CMI minus low\-CMI deferral gap\. That gap becomes positive across the moderate\- and high\-α\\alphaband, reaching 0\.316 at\(0\.6,0\.7\)\(0\.6,0\.7\), while low\-α\\alpha/high\-β\\betacells reverse sign\. This is a substitution pattern: when outright rejection becomes costly, providers shift cost\-shedding into the delay channel, targeting the same patient subpopulation\. Panel \(f\) confirms that this substitution is not free: realized waiting peaks at\(α,β\)=\(0\.3,0\.5\)\(\\alpha,\\beta\)=\(0\.3,0\.5\), in the same moderate\-incentive corridor where access pressure is redistributed across rejection and deferral\.

The headline rejection gap heatmap suggests that the profit corner is the single worst access regime\. Figure[5](https://arxiv.org/html/2605.30680#A5.F5)reveals two distinct worst\-case access regimes\. Between them, a moderate\-incentive corridor produces queue\-mediated congestion as rejection and deferral trade off against each other\. This third failure mode is invisible to audit\-only interventions\.This micro\-level structure underlies the channel substitution results of §[5\.2](https://arxiv.org/html/2605.30680#S5.SS2)\.

### E\.3Strategic delay as a cost\-relief substitute

![Refer to caption](https://arxiv.org/html/2605.30680v1/x5.png)Figure 6:Strategic\-delay decomposition on the L1 grid\. Panels \(a\) and \(b\) report deferral rates separately for high\-CMI and low\-CMI patients; panel \(c\) is the deferral gap\. Panels \(d\)–\(f\) restrict the deferral sample to delays that improve next\-period team utility under Eq\. \([2](https://arxiv.org/html/2605.30680#S2.E2)\) \(“strategic delay”\); panel \(g\) is the strategic\-delay gap; panel \(h\) is the easy\-case CMI tilt of KPI\-lagging teams\.Figure[6](https://arxiv.org/html/2605.30680#A5.F6)isolates the delay channel from queue\-mediated waiting and from involuntary deferral\. The four\-panel block \(d\)–\(g\) restricts the deferral sample to cases where the doctor chooses to defer*despite available local bed capacity*; this is the operationalization of “strategic delay” used in the main text\. Three findings emerge\.

First, strategic delay is not concentrated on high\-complexity patients everywhere\. At lowα\\alpha, low\-CMI strategic delay can exceed high\-CMI strategic delay, producing the negative gaps in panel \(g\)\. In the mixed\-incentive corridor the sign flips: high\-CMI strategic delay dominates, consistent with the cost\-relief motive in the microfoundation\.

Second, the strategic\-delay gap \(g\) is the most clearly interior pattern on the grid\. Its maximum0\.290\.29sits at\(0\.6,0\.7\)\(0\.6,0\.7\), where neither profit\-corner rejection nor quality\-corner effort fully dominates and delay becomes the cheapest cost\-relief lever\. Its minimum−0\.33\-0\.33sits at\(0\.0,0\.6\)\(0\.0,0\.6\), where teams protect low\-complexity throughput against the cost weight inKPI\\mathrm\{KPI\}\. Neither pattern is observable from the headline rejection heatmap alone\.

Third, the lagging\-team easy\-case tilt \(h\) ranges from about−0\.35\-0\.35to\+0\.42\+0\.42and captures a different mechanism from the strategic\-delay gap\. The tilt is a*team\-level*Goodhart signature: teams trailing in the KPI tournament shift toward easier cases to recover bonus share\. This mode becomes salient when bonus pressureBpool​κ​sj​\(1−sj\)B^\{\\mathrm\{pool\}\}\\kappa\\,s\_\{j\}\(1\-s\_\{j\}\)is large \(see Figure[9](https://arxiv.org/html/2605.30680#A6.F9)for the corresponding sensitivity\)\.

Taken together, panels \(g\) and \(h\) furnish quantitative evidence for the multitasking claim in §[5\.1](https://arxiv.org/html/2605.30680#S5.SS1): where the headline KPI–health correlation is most negative, the responsible channels are not effort distortion but patient\-level delay and team\-level case\-mix tilt acting in concert\.

### E\.4Clinical\-budget frontier: bounded health, unbounded cost

![Refer to caption](https://arxiv.org/html/2605.30680v1/x6.png)Figure 7:Clinical and budgetary outcomes on the L1 grid\. Top row: mean clinical effort, effort per unit true CMI, and effort per unit realized health\. Middle row: mean realized health, mean treatment cost, and health per cost\. Bottom row: terminal fundsFTF\_\{T\}and mean waiting time\. The contrast between the narrow range of health \(0\.95–1\.02\) and the multi\-order\-of\-magnitude range of terminal funds \(0\.4–6490\) is the structural reason why interpreting health in isolation is misleading on this grid\.The dominant qualitative feature of Figure[7](https://arxiv.org/html/2605.30680#A5.F7)is a scale asymmetry between clinical gains and financial exposure\. Mean realized health \(panel d\) varies only in the narrow band\[0\.95,1\.02\]\[0\.95,1\.02\], while terminal funds \(panel g\) span more than four orders of magnitude\. Mean effort \(panel a\) ranges from1\.561\.56to2\.742\.74, mean cost \(panel e\) from8\.238\.23to16\.4116\.41, and health\-per\-cost efficiency \(panel f\) collapses from about0\.120\.12to0\.060\.06asβ\\betarises\. The diminishing return treatment production function of §[3\.2](https://arxiv.org/html/2605.30680#S3.SS2)guarantees that health saturates while cost is convex in effort\. The quality\-driven corner therefore becomes insolvent because it keeps buying effort on a saturated\-health, high\-cost segment of the production curve\.

Another two secondary patterns are important for interpretation\. The effort\-per\-CMI panel \(b\) reveals that gold\-plating intensifies alongβ\\betaeven when the true clinical need is held constant in the denominator, ruling out a pure case\-mix explanation\. The effort\-per\-health panel \(c\) is essentially a mirror image of the health\-per\-cost panel \(f\), with the small but real implication that the marginal product of effort drops by a factor of1\.5–2×1\.5\\text\{\-\-\}2\\timesas we cross the saturating region; a mechanism designer that uses health as a target without dividing by cost will keep paying for an effort margin that produces almost no measurable health benefit\. The waiting panel \(h\) reproduces the U\-shape already seen in Figure[4](https://arxiv.org/html/2605.30680#A5.F4)\(f\) but at higher resolution: waiting peaks at2\.212\.21near\(α,β\)=\(0\.3,0\.5\)\(\\alpha,\\beta\)=\(0\.3,0\.5\), where selection has not thinned the queue fully\.

### E\.5KPI proxy management at the team level

![Refer to caption](https://arxiv.org/html/2605.30680v1/x7.png)Figure 8:Team\-level and KPI\-level diagnostics on the L1 grid\. Panels \(a\) and \(b\) report Pearson correlations of measured KPI and realized bonus against true health, on the cross\-team panel\. Panels \(c\)–\(e\) report the up\-coding rate, mean accepted CMI by lagging teams, and the easy\-case tilt of lagging teams\. Panels \(f\)–\(h\) report cross\-team coefficient of variation of load, of accepted CMI, and of clinical outcome\.Figure[8](https://arxiv.org/html/2605.30680#A5.F8)is the team\-level counterpart to the patient\-level decomposition above\. Two panels carry most of the interpretive weight\.

Panel \(a\) plotsCorr​\(KPIj,t,Hj,t\)\\mathrm\{Corr\}\(\\mathrm\{KPI\}\_\{j,t\},\\,H\_\{j,t\}\)across teams and time on each\(α,β\)\(\\alpha,\\beta\)cell\. The correlation is sharply negative in the low\-β\\betaband with low or moderateα\\alpha\(reaching−0\.80\-0\.80at\(0\.2,0\.0\)\(0\.2,0\.0\)\), is close to zero along a diagonal corridor through the interior, and becomes positive only in a high\-β\\betacorridor \(peaking at\+0\.24\+0\.24at\(0\.4,1\.0\)\(0\.4,1\.0\)\)\. The bonus\-health correlation in panel \(b\) repeats this pattern with sharper amplitude because the softmax tournament amplifies any cross\-team ordering present inKPI\\mathrm\{KPI\}\.

Panel \(d\), the mean accepted CMI of KPI\-lagging teams, can be interpreted as conditional workload level\. It ranges from about0\.200\.20to0\.620\.62and records the case mix that lagging teams actually end up carrying\. Panel \(e\) reports the lagging\-team easy\-case tilt, an acceptance\-rate gap between low\- and high\-CMI patients that spans roughly\[−0\.35,\+0\.42\]\[\-0\.35,\+0\.42\]\. This distinction matters in the low\-α\\alpha/high\-β\\betacorner: lagging teams have high accepted CMI there because quality pressure protects clinically complex patients while weak profit pressure removes the incentive to shed them\. The negative tilt in the same region confirms that these teams are not padding KPI with easy cases, but carrying a heavier high\-CMI clinical burden\. Conversely, positive values in panel \(e\) identify KPI\-driven easy\-case selection among lagging teams, a team\-level pattern that complements the aggregate rejection\-gap view in Figure[4](https://arxiv.org/html/2605.30680#A5.F4)\(a\)\.

Panels \(f\)–\(h\) are auxiliary diagnostics\. We report cross\-team dispersion in load, accepted case mix, and outcomes to check whether the patient\-level distortions documented above also leave a team\-level footprint\. The clearest signal is the accepted\-CMI variance, which rises in regions where KPI\-lagging teams tilt toward easier cases\. The bottom row also separates two qualitatively different dispersion regimes\. In the low\-β\\betaband, weak quality pressure makes case\-mix sorting the dominant source of cross\-team heterogeneity: teams differ mainly in which patients they accept\. In the low\-α\\alpha/high\-β\\betacorner, by contrast, profit\-driven shedding is weak and quality pressure protects high\-CMI patients, so cross\-team dispersion reflects uneven clinical burden, effort intensity, and outcome variation under a saturated production function\.

## Appendix FL2 Additional Lever Diagnostics

The main text reports the three L2 mechanism findings\. The remaining one\-at\-a\-time sweeps serve as sanity checks and mechanism decompositions\. Increasing total capacity from 6 to 16 reduces mean waiting in every regime, consistent with high\-utilization queueing predictions\. Raising theKPIhealth\-to\-cost weight ratio increases clinical effort while weakening budget sustainability\. The bonus\-sharpness sweep is non\-monotone because the local derivativeBpool​κ​sj​\(1−sj\)B^\{\\mathrm\{pool\}\}\\kappa s\_\{j\}\(1\-s\_\{j\}\)is weak when the tournament is flat and also weak when shares saturate\.

![Refer to caption](https://arxiv.org/html/2605.30680v1/x8.png)Figure 9:L2 bonus\-pool ablation\. Points are 30\-seed means over horizonT=200T=200; shaded bands are 95% confidence intervals and the dashed line marks the baseline pool\. Larger pools reduce funds and can weaken KPI\-health alignment\.Figure[9](https://arxiv.org/html/2605.30680#A6.F9)isolates the bonus\-pool channel behind the main\-text Goodhart result\. Larger pools reduce final funds in all three regimes\. In the balanced regime,KPI–true\-health correlation falls from−0\.447\-0\.447atBpool=0B^\{\\mathrm\{pool\}\}=0to−0\.839\-0\.839atBpool=15B^\{\\mathrm\{pool\}\}=15, showing that stronger measured incentives can amplify a misaligned proxy rather than repair it\.

## Appendix GL2 Additional Diagnostic: Flexible Capacity as a Coordination Test

The main\-text L2 ablation turns onKPIcapacity steering and shows that adding flexible capacity can raise waiting\. This appendix repeats the flex\-pool sweep with capacity steering turned off \(ξ=0\\xi=0,kpi\_steering\_mode=none, so the flexible subpool is allocated by the static base\-capacity shares rather than by last\-periodKPIscores\.

![Refer to caption](https://arxiv.org/html/2605.30680v1/x9.png)Figure 10:L2 steering\-off flexible\-pool diagnostic\. Mean waiting \(30 seeds,T=200T=200\) as a function of the flexible\-pool sizeBflexB^\{\\mathrm\{flex\}\}, holding routing, team specialization, and total capacity fixed while disablingKPIcapacity steering\. Three regimes are shown; the dashed line marks the L2 baseline\.Figure[10](https://arxiv.org/html/2605.30680#A7.F10)sweepsBflex∈\{0,2,4\}B^\{\\mathrm\{flex\}\}\\in\\\{0,2,4\\\}under the three representative regimes of §[5\.1](https://arxiv.org/html/2605.30680#S5.SS1)\. With steering off, the adverse main\-text slope disappears: mean waiting is essentially unchanged fromBflex=0B^\{\\mathrm\{flex\}\}=0to44in the balanced regime \(1\.88→1\.881\.88\\rightarrow 1\.88\) and the profit\-driven regime \(1\.84→1\.841\.84\\rightarrow 1\.84\), and falls modestly in the quality\-driven regime \(2\.14→2\.092\.14\\rightarrow 2\.09\)\. The intermediateBflex=2B^\{\\mathrm\{flex\}\}=2point rises in the balanced and profit\-driven regimes, but we interpret this as a finite\-team integer\-allocation effect: total capacity is held fixed, so increasingBflexB^\{\\mathrm\{flex\}\}capacity between dedicated and flexible slots, and a small flexible subpool can temporarily worsen queue–team mismatch before a larger subpool covers more teams\. Therefore, flexible capacity by itself does not reproduce the adverse waiting increase seen in the main L2 panel; the increase comes from theKPI\-steering rule that consumes the flexible subpool\.

When capacity steering is active, the flex pool is reallocated toward teams with highKPI\-steering scores, which in these rollouts are teams already adjusting triage and effort to chase bonus share\. When steering is disabled, the flexible subpool is no longer systematically allocated to those teams, so it stops reinforcing the bonus\-driven behavior that raised waiting in the main L2 ablation\. The result shows that flexible capacity is not automatically beneficial: its effect depends on the allocation rule\. In this configuration,BflexB^\{\\mathrm\{flex\}\}helps only when the rule assigning it addresses queue imbalance rather than amplifying the coordination problem created by theKPItournament\.

## Appendix HL1/L2 Experimental Setup

Table[3](https://arxiv.org/html/2605.30680#A8.T3)centralizes the run settings used by the L1 phase diagram and the L2 administrative\-lever ablations\. These settings are reported here for reproducibility; the main text uses only the mechanism\-level findings\.

LayerScopeValuesNotesL1/L2Shared rollout protocolHorizonT=200T=200; seeds\{1,…,30\}\\\{1,\\ldots,30\\\}; reported values are seed means unless a confidence interval is shown\.Both layers use the native hospital administrator and provider\-response rules of §[3](https://arxiv.org/html/2605.30680#S3); noAlphaEvolvepolicy is injected\.L1Incentive phase diagramα,β∈\{0\.0,0\.1,…,1\.0\}\\alpha,\\beta\\in\\\{0\.0,0\.1,\\ldots,1\.0\\\}, giving an11×1111\\times 11grid\.The sweep varies only provider financial sensitivityα\\alphaand quality sensitivityβ\\beta; all other administrative settings stay fixed\.L1Default administrative and environment settingsBtot=10B^\{\\mathrm\{tot\}\}=10,Bflex=0B^\{\\mathrm\{flex\}\}=0,Bpool=5\.0B^\{\\mathrm\{pool\}\}=5\.0,κ=2\.0\\kappa=2\.0,\(wH,wW,wrej,wC\)=\(1\.0,0\.5,0\.5,0\.3\)\(w\_\{H\},w\_\{W\},w\_\{\\mathrm\{rej\}\},w\_\{C\}\)=\(1\.0,0\.5,0\.5,0\.3\),ξ=0\\xi=0\.Strategic routing is disabled;kpi\_steering\_mode=none; probabilistic up\-coding,KPI\-gaming tilt, andKPI\-weighted effort feedback are enabled\.L2Representative regimesProfit\-driven\(α,β\)=\(0\.8,0\.2\)\(\\alpha,\\beta\)=\(0\.8,0\.2\); quality\-driven\(0\.2,0\.8\)\(0\.2,0\.8\); balanced\(0\.5,0\.5\)\(0\.5,0\.5\)\.Each L2 axis is swept one at a time around these three L1 anchor regimes\.L2Main one\-at\-a\-time lever sweepsq∈\{0,0\.05,0\.10,0\.20,0\.35,0\.50\}q\\in\\\{0,0\.05,0\.10,0\.20,0\.35,0\.50\\\};Bpool∈\{0,2\.5,5,7\.5,10,15\}B^\{\\mathrm\{pool\}\}\\in\\\{0,2\.5,5,7\.5,10,15\\\};κ∈\{0,1,2,3,4,5\}\\kappa\\in\\\{0,1,2,3,4,5\\\};Btot∈\{6,8,10,12,14,16\}B^\{\\mathrm\{tot\}\}\\in\\\{6,8,10,12,14,16\\\};wH/wC∈\{0\.5,1,2,3\.333,4,5\}w\_\{H\}/w\_\{C\}\\in\\\{0\.5,1,2,3\.333,4,5\\\};Bflex∈\{0,2,4\}B^\{\\mathrm\{flex\}\}\\in\\\{0,2,4\\\}\.The audit sweep setsaudit\_base\_prob=q=qandaudit\_slope=0=0; thewH/wCw\_\{H\}/w\_\{C\}sweep holdswH\+wCw\_\{H\}\+w\_\{C\}fixed at the L1 baseline\.L2Main baseline and steering settingsBtot=10B^\{\\mathrm\{tot\}\}=10,Bflex=0B^\{\\mathrm\{flex\}\}=0,Bpool=5\.0B^\{\\mathrm\{pool\}\}=5\.0,κ=2\.0\\kappa=2\.0,\(wH,wW,wrej,wC\)=\(1\.0,0\.5,0\.5,0\.3\)\(w\_\{H\},w\_\{W\},w\_\{\\mathrm\{rej\}\},w\_\{C\}\)=\(1\.0,0\.5,0\.5,0\.3\), audit baselineq=0\.10q=0\.10,ξ=1\.0\\xi=1\.0\.Strategic routing is disabled;kpi\_steering\_mode=none; probabilistic up\-coding,KPI\-targeting tilt, andKPI\-weighted effort feedback are enabled\.L2Steering\-off flexible\-capacity diagnosticBflex∈\{0,2,4\}B^\{\\mathrm\{flex\}\}\\in\\\{0,2,4\\\}under the same three regimes, withBtot=10B^\{\\mathrm\{tot\}\}=10,ξ=0\\xi=0, andkpi\_steering\_mode=none\.This diagnostic isolates whether the adverse waiting slope comes from flexible capacity itself or from theKPI\-based allocation rule\.Table 3:L1/L2 experimental setup and hyperparameters\.BtotB^\{\\mathrm\{tot\}\}is total capacity,BflexB^\{\\mathrm\{flex\}\}is the flexible\-capacity pool, andBpoolB^\{\\mathrm\{pool\}\}is theKPIbonus pool\.
## Appendix IExternal Stylized\-Fact Validation

Table[4](https://arxiv.org/html/2605.30680#A9.T4)summarizes the external stylized facts used to validateMedi\-Sim’s provider\-response dynamics\. The table is not intended as a quantitative calibration exercise\. It checks whether simulator interventions move the same provider\-response channels in the same qualitative direction as the healthcare incentive and operations literature predicts\.

External stylized factAnchorSimulation interventionExpected qualitative signatureMedi\-Simobserved signatureMatch strength and caveatDiagnosis\-based payment creates coding rentDafny \([2005](https://arxiv.org/html/2605.30680#bib.bib9)\);Kronick and Welch \([2014](https://arxiv.org/html/2605.30680#bib.bib2)\)Increase profit sensitivity or payment spread; high\-α\\alpha/low\-β\\betaregionReported complexity or up\-coding rises; funds rise; true clinical need need not rise equallyProfit\-driven representative regime: up\-coding=0\.226=0\.226; funds=5451\.7=5451\.7Strong directional match\. We claim qualitative coding\-rent reproduction, not real\-system magnitude calibration\.Risk\-adjusted reimbursement raises reported risk or coded complexityKronick and Welch \([2014](https://arxiv.org/html/2605.30680#bib.bib2)\);Geruso and Layton \([2020](https://arxiv.org/html/2605.30680#bib.bib33)\)Strengthen reimbursement value of the coded group while the coding wedge is activeCoded risk rises relative to true patient complexityHigh\-profit cells show elevated up\-coding; the main tables report the up\-coding rate rather than coded\-minus\-trueCMIdecompositionPartial match\. Stronger if future reports add explicit coded\-CMIinflation\.Profit or ownership\-like incentives shift case mixSilverman and Skinner \([2004](https://arxiv.org/html/2605.30680#bib.bib10)\);Ellis \([1998](https://arxiv.org/html/2605.30680#bib.bib8)\); Ma \([1994](https://arxiv.org/html/2605.30680#bib.bib7)\)Move toward highα\\alphaat lowβ\\beta; compare profit\-driven cells with quality and balanced cellsMore profitable or easier patients are favored; high\-cost/high\-CMIpatients face worse accessProfit\-driven representative regime: high\-CMIrejection gap=0\.182=0\.182alongside elevated coding and fundsStrong directional match\. In L1/L2, active hospital routing is disabled, so the selection signature comes from provider triage\.Audit suppresses coding but can shift pressure elsewhereKuhn and Siciliani \([2008](https://arxiv.org/html/2605.30680#bib.bib11)\)Increase audit probabilityqqfrom 0 to 0\.5Up\-coding falls; selection or delay pressure may rise when the billing channel closesProfit\-driven up\-coding0\.851→0\.0030\.851\\rightarrow 0\.003; balanced up\-coding0\.636→0\.0010\.636\\rightarrow 0\.001; balanced cherry\-picking0\.100→0\.2330\.100\\rightarrow 0\.233Strong match\. The result validates channel substitution, not a calibrated audit schedule\.Measured performance targets induce gamingBevan and Hood \([2006](https://arxiv.org/html/2605.30680#bib.bib15)\); Propperet al\.\([2010](https://arxiv.org/html/2605.30680#bib.bib5)\);Campbell \([1979](https://arxiv.org/html/2605.30680#bib.bib1)\); Manheim and Garrabrant \([2018](https://arxiv.org/html/2605.30680#bib.bib34)\)IncreaseKPIsalience through balanced\-interior incentives, bonus pressure, orKPIsteeringMeasured score becomes behaviorally salient while true\-health alignment can weakenBalanced interior: lagging teams tilt accepted case mix toward easier cases \(tilt=0\.341=0\.341\);corr​\(KPI,health\)=−0\.659\\mathrm\{corr\}\(\\mathrm\{KPI\},\\mathrm\{health\}\)=\-0\.659Strong match\. The simulator does not reproduce a specific NHS target; it reproduces proxy\-objective decoupling\.Stronger pay\-for\-performance can worsen true alignment when the proxy is misalignedHolmstrom and Milgrom \([1991](https://arxiv.org/html/2605.30680#bib.bib13)\); Baker \([1992](https://arxiv.org/html/2605.30680#bib.bib14)\); Eijkenaaret al\.\([2013](https://arxiv.org/html/2605.30680#bib.bib16)\); Van Hercket al\.\([2010](https://arxiv.org/html/2605.30680#bib.bib17)\)Increase the bonus poolBpoolB^\{\\mathrm\{pool\}\}The measured proxy becomes more consequential;KPI–health alignment may worsenEndpoint comparison: balancedcorr​\(KPI,health\)\\mathrm\{corr\}\(\\mathrm\{KPI\},\\mathrm\{health\}\)is−0\.447\-0\.447atBpool=0B^\{\\mathrm\{pool\}\}=0and−0\.839\-0\.839atBpool=15B^\{\\mathrm\{pool\}\}=15; the smallest positive pool gives a small non\-monotone uptickStrong match\. The caveat is conditional: stronger incentives are harmful here because the measured proxy is misaligned\.Quality\-oriented incentives suppress coding/selection but raise effort and budget pressureMa \([1994](https://arxiv.org/html/2605.30680#bib.bib7)\); Ellis \([1998](https://arxiv.org/html/2605.30680#bib.bib8)\); Eggleston \([2005](https://arxiv.org/html/2605.30680#bib.bib12)\)Lowα\\alpha, highβ\\betaregimeLower up\-coding and less cream\-skimming; higher effort and weaker solvencyQuality\-driven regime: up\-coding=0\.027=0\.027; rejection gap=−0\.051=\-0\.051; effort=2\.435=2\.435; funds=15\.2=15\.2Strong match\. This is a multitasking trade\-off in the simulator’s simplified clinical\-production function\.Total capacity expansion reduces waiting under queueing pressureGreen \([2002](https://arxiv.org/html/2605.30680#bib.bib18),[2006](https://arxiv.org/html/2605.30680#bib.bib19)\)Increase total capacity from 6 to 16Mean waiting should fall across regimesMean waiting falls in all three L2 regimes: profit2\.313→1\.7352\.313\\rightarrow 1\.735, quality2\.325→1\.8962\.325\\rightarrow 1\.896, balanced2\.303→1\.7202\.303\\rightarrow 1\.720Strong operations sanity check\. This row validates the queueing layer rather than a strategic distortion channel\.Flexible capacity helps only under an appropriate allocation ruleBekkeret al\.\([2017](https://arxiv.org/html/2605.30680#bib.bib20)\)and hospital\-flow allocation logicCompareKPI\-steered flexible pool with steering\-off flexible poolFlex capacity may fail if allocated by the wrong proxy; the adverse effect should disappear when steering no longer follows the proxyWithKPIsteering, balanced waiting rises1\.88→2\.231\.88\\rightarrow 2\.23asBflexB^\{\\mathrm\{flex\}\}grows0→40\\rightarrow 4\. With steering off, balanced and profit\-driven waiting remain1\.88→1\.881\.88\\rightarrow 1\.88and1\.84→1\.841\.84\\rightarrow 1\.84; quality falls2\.14→2\.092\.14\\rightarrow 2\.09Strong special finding\. Total capacity is fixed here; the experiment tests the allocation rule for the flexible subpool\.Table 4:Full external stylized\-fact validation map\. Rows report qualitative matches between external healthcare incentive/operations facts andMedi\-Simsignatures\. Match strength refers to directional reproduction under fixed simulator rules, not quantitative calibration to any real hospital system\.
## Appendix JStrategic Policy\-as\-Code: Search Diagnostics

This part provides four pieces of structural evidence behind the L3 results of §[5\.3](https://arxiv.org/html/2605.30680#S5.SS3): validation search curves, a warm\-start ablation that pinpoints the role of the curated library, method comparisons under each social objective that contextualize the headline numbers in Table[1](https://arxiv.org/html/2605.30680#S5.T1), and aK=300K=300diagnostic that characterizes the shape of the search trajectory\.

The main L3 runs use ChatGPT\-5\.4 as the code\-mutation model withK=200K=200iterations, 3 islands, 30 individuals per island, LLM temperature 0\.4, migration size 2, evolution seeds\{101,202,303\}\\\{101,202,303\\\}, validation seeds\{404,505,606,707,808\}\\\{404,505,606,707,808\\\}, and held\-out test seeds\{909,1001,1103,1207,1301\}\\\{909,1001,1103,1207,1301\\\}\. The warm\-start library and its diversity ablations are described in §[J\.1](https://arxiv.org/html/2605.30680#A10.SS1)\.

![Refer to caption](https://arxiv.org/html/2605.30680v1/x10.png)Figure 11:L3AlphaEvolvevalidation search curves\. Each panel shows generation\-best and running\-best fitness under a different evaluator with K=200\. These curves are diagnostic: the main\-text L3 claim rests on held\-out rollout profiles in Table[1](https://arxiv.org/html/2605.30680#S5.T1), while the curves show how each objective shapes the search trajectory\.Figure[11](https://arxiv.org/html/2605.30680#A10.F11)records the validation dynamics for theK=200K=200runs\. Welfare and profit objectives show short early improvements followed by plateaus, while the mixed objective reaches its best validation candidate quickly and then mostly explores lower\-fitness variants\. We use these curves as search\-process diagnostics; the policy comparison itself is based on the held\-out rollout profiles in Table[1](https://arxiv.org/html/2605.30680#S5.T1)\.

### J\.1Warm\-start library and the diversity ablation

The L3 warm\-start library contains nine policies: a*neutral*template \(constants equal to balanced defaults\), a*fixed*heuristic that mirrors the L1 baseline, a*greedy\-profit*policy with highα\\alphaand aggressive coding, a*greedy\-quality*policy with highβ\\betaand conservative coding, three access\-oriented variants \(high\-α\\alphaacceptance, high\-capacity, high\-flex\), and two coding\-aggression variants \(aggressive and conservative\)\. Library diversity is the only configuration parameter that meaningfully changes the mixed\-objective outcome atK=200K=200\.

The ablation is reported as three nested libraries\. With the*neutral\-only*library, selection on validation fitness returns the seed itself at13\.54513\.545and the best evolved candidate reaches only13\.35113\.351; search cannot improve over its starting point\. After removing the profit\-side seeds, which the library still contains the welfare\-leaning policies but no aggressive\-coding warm start, the process produces the best evolved candidate at13\.40113\.401\. The*full*nine\-policy library produces a selected aggressive\-coding warm start at13\.60713\.607and then refines to13\.87613\.876on held\-out seeds, the value reported in Table[1](https://arxiv.org/html/2605.30680#S5.T1)\. The reading we take from this nested set is clear: the gain from search over the best warm start \(\+0\.27\+0\.27fitness\) is comparable to the gain from library expansion alone \(\+0\.21\+0\.21\), and search produces no improvement above the seed when the seed is not diverse\. We therefore presentAlphaEvolveoverMedi\-Simas a feasibility demonstration of program search over the policy class of §[4](https://arxiv.org/html/2605.30680#S4)and do not claim that current search procedures can rediscover the mixed family from scratch\.

### J\.2Method comparison under each social objective

![Refer to caption](https://arxiv.org/html/2605.30680v1/figures/fig_app_alphaevolve_mixed_k200.png)Figure 12:K=200K=200mixed\-objective four\-method comparison\. Bars are means over the five held\-out test seeds; whiskers are 95% bootstrap intervals\. TheAlphaEvolvecolumn is the central qualitative finding: it achieves profit\-comparable discounted return while reducing mean bonus and bonus pressure by about 60% relative toFixed, with doctor margin recovered to within 25% of theProfitbaseline\.Figure[12](https://arxiv.org/html/2605.30680#A10.F12)is the main supporting figure for the mixed\-objective L3 result\. The discounted\-return panel showsAlphaEvolveessentially tied with the profit baseline; the doctor\-utility, mean\-effort, and doctor\-margin panels show that this tie is achieved with a substantially different operational profile: lower mean effort, higher doctor utility, and higher doctor margin thanFixed\. The two bonus\-related panels are the most informative\. Mean bonus and bonus pressure both fall by about 60% relative to theFixedbaseline, even thoughFixedandAlphaEvolveproduce comparable discounted returns\. The searched mixed policy therefore retains the macroeconomic return while structurally lowering the local bonus pressureBpool​κ​sj​\(1−sj\)B^\{\\mathrm\{pool\}\}\\kappa s\_\{j\}\(1\-s\_\{j\}\)that drives theKPI\-targeting residual identified in §[E\.5](https://arxiv.org/html/2605.30680#A5.SS5)\.

![Refer to caption](https://arxiv.org/html/2605.30680v1/figures/fig_app_alphaevolve_profit_five.png)Figure 13:K=200K=200pure\-profit five\-method comparison\. Bars are means over the five held\-out test seeds\. Adding theNeutralbaseline reveals thatNeutralachieves a lower but still nontrivial return despite substantially higher mean effort and bonus pressure;AlphaEvolvemakes a small profit\-family refinement visible in return, doctor utility, and doctor margin\.![Refer to caption](https://arxiv.org/html/2605.30680v1/figures/fig_app_alphaevolve_welfare_five.png)Figure 14:K=200K=200pure\-welfare five\-method comparison\. Bars are means over the five held\-out test seeds\. TheAlphaEvolverefinement of the welfare family lifts discounted return above all four baselines while keeping mean effort belowQualityand lifting doctor margin from near zero to∼1\.4\\sim\\\!1\.4, indicating that the search has trimmed the gold\-plating slack identified in §[E\.4](https://arxiv.org/html/2605.30680#A5.SS4)\.The single\-objective comparisons in Figures[13](https://arxiv.org/html/2605.30680#A10.F13)and[14](https://arxiv.org/html/2605.30680#A10.F14)sharpen the interpretive picture in two ways\. Under the pure\-profit objective, theProfitwarm start is already close to a local optimum onFitness\\mathrm\{Fitness\};AlphaEvolvecaptures only a small additional margin, visible as minor gains in discounted return, doctor utility, and doctor margin, while pushing the coding\-heavy risk profile slightly further\. Adding theNeutralbaseline shows that the return gap betweenNeutralandProfitis moderate, but the operational gap is much larger:Neutraluses substantially more effort, bonus, and bonus pressure while still remaining below the profit\-oriented return\. Under the pure\-welfare objective,AlphaEvolvealso improves the operational profile relative to the Quality baseline\. The doctor\-margin panel rises from near zero to about1\.41\.4, indicating that the searched welfare policy reduces the excessive effort cost of the quality\-driven baseline\. At the same time, it preserves the access improvements rewarded by the welfare objective, rather than gaining margin by rejecting or delaying patients\.

### J\.3K=300K=300comparison panels

![Refer to caption](https://arxiv.org/html/2605.30680v1/figures/fig_app_alphaevolve_mixed_comp.png)Figure 15:K=300K=300mixed\-objective four\-method comparison, the longer\-budget counterpart of Figure[12](https://arxiv.org/html/2605.30680#A10.F12)\. Bars are means over the five held\-out test seeds\. The qualitative picture is unchanged fromK=200K=200:AlphaEvolveachieves discounted return on par withProfitwhile leaving mean bonus and bonus pressure suppressed, consistent with the trajectory in Figure[18](https://arxiv.org/html/2605.30680#A10.F18)\.![Refer to caption](https://arxiv.org/html/2605.30680v1/figures/fig_app_alphaevolve_profit_comp.png)Figure 16:K=300K=300pure\-profit four\-method comparison\. Bars are means over the five held\-out test seeds\. The longer budget givesAlphaEvolvea small gain overProfitin discounted return, doctor utility, and doctor margin, while preserving the same profit\-oriented risk profile\.![Refer to caption](https://arxiv.org/html/2605.30680v1/figures/fig_app_alphaevolve_welfare_comp.png)Figure 17:K=300K=300pure\-welfare four\-method comparison\. Bars are means over the five held\-out test seeds\. The welfareAlphaEvolverefinement extends the discounted\-return lift seen atK=200K=200and continues to suppressQuality\-style gold\-plating on the effort and bonus panels\.The four\-method comparisons in Figures[15](https://arxiv.org/html/2605.30680#A10.F15)–[17](https://arxiv.org/html/2605.30680#A10.F17)reproduce the same qualitative policy families at the longer search budget\. TheFixed,Profit, andQualitybars are unchanged because they are evaluations of fixed warm starts\. TheAlphaEvolvebars move in objective\-specific directions: the pure\-profit and pure\-welfare runs push further along their respective objectives, while the mixed run improves waiting and high\-complexity deferral but does not dominate theK=200K=200mixed policy on held\-out fitness or aggregate violations\. We therefore treatK=300K=300as a diagnostic budget check rather than as a replacement for the mainK=200K=200result\.

### J\.4K=300K=300diagnostic: trajectory shape

![Refer to caption](https://arxiv.org/html/2605.30680v1/figures/fig_app_alphaevolve_mixed_steps.png)Figure 18:K=300K=300mixed\-objective running\-best trace\. Each step is a search iteration; the first marker is the neutral seed and later markers indicate iterations at whichAlphaEvolvediscovered a policy that improved the incumbent search fitness\. The trajectory is piecewise constant with three update events at iterations198198,213213, and273273; Table[5](https://arxiv.org/html/2605.30680#A10.T5)records the code\-level edits associated with each event\.Extending the search budget fromK=200K=200toK=300K=300does not change the held\-out conclusion: theK=300K=300mixed run lowers high\-complexity deferral but does not beat theK=200K=200policy on held\-out fitness or overall violations, and we therefore keepK=200K=200as the main mixed result\. TheK=300K=300trace is nonetheless diagnostic about the*shape*of program\-space search overΠA\\Pi\_\{A\}\. Figure[18](https://arxiv.org/html/2605.30680#A10.F18)shows that improvement is concentrated in three discrete events at iterations198198,213213, and273273, separated by long intervals of zero improvement\. The fitness increments are small \(\+0\.047\+0\.047,\+0\.002\+0\.002, and\+0\.066\+0\.066\), and only two of the three events touch the administrative side of the policy bundle\. This piecewise\-constant trajectory shape is a feature of the regulated DSL of §[4](https://arxiv.org/html/2605.30680#S4): the program space contains many syntactically valid local perturbations whose simulated rollouts produce indistinguishable evaluator returns, and meaningful improvements arrive only at the rare iterations where a coordinated multi\-field edit moves the mechanism across one of the regime boundaries identified in §[E](https://arxiv.org/html/2605.30680#A5)\.

Table 5:K=300K=300mixed running\-best trace\. Update codes summarize the policy fields changed relative to the previous running\-best step\. “Gap” is the strategic\-delay gap\.StepIter\.UpdateFitnessWaitRejectGap10neutral13\.7351\.7970\.0220\.3282198admin \+ doctor13\.7821\.6280\.0500\.2333213doctor only13\.7841\.6060\.0500\.2364273admin \+ doctor13\.8501\.6020\.0490\.238

Table[5](https://arxiv.org/html/2605.30680#A10.T5)decomposes these update events at the metric level\. Each improvement event corresponds to a clear trade: rejection rises from0\.0220\.022to∼0\.05\\sim 0\.05as the policy stops absorbing patients it cannot treat profitably, the strategic\-delay gap falls from0\.3280\.328to∼0\.24\\sim 0\.24as the rejection channel takes over the cost\-shedding role the delay channel had carried, and mean waiting falls in lockstep\. The qualitative pattern is the same channel substitution documented in §[E\.2](https://arxiv.org/html/2605.30680#A5.SS2), now observed inside a single search trajectory\. The complete diagnostic policy sketch is given in Listing[4](https://arxiv.org/html/2605.30680#LST4)below\.

### J\.5Discovered policy structures

The first three listings below summarize the final code structures produced by theK=200K=200search under each social objective\. Each listing reproduces the final field\-level policy code; auxiliary dictionary syntax is omitted for readability and the full executable policies are saved by the experiment pipeline\. The structures should be read against the warm\-start ablation of §[J\.1](https://arxiv.org/html/2605.30680#A10.SS1): search refines, but does not invent, the families inherited from the library\.

#### Welfare family \(Listing[1](https://arxiv.org/html/2605.30680#LST1)\)\.

The welfare family pushesα\\alphadown to0\.200\.20andβ\\betato1\.001\.00, then adds state\-conditional capacity expansion \(\+1\+1if any of queue, waiting, or utilization stress is detected\) and a flex pool that grows under the same conditions\. The hospitalKPIweight vector tilts toward health, waiting, and rejection \(wH=1\.80w\_\{H\}=1\.80,wW=0\.90w\_\{W\}=0\.90,wrej=1\.70w\_\{\\mathrm\{rej\}\}=1\.70\) while suppressing the cost weight towC=0\.05w\_\{C\}=0\.05\. The coding\-side expression is dominated by the ethics term \(−2\.50\-2\.50\), which keeps up\-coding at zero across all rollouts\.

Listing 1:Pure\-welfare policy sketch\.alpha=0\.20

beta=1\.00

total\_capacity=max\(9,min\(13,current\_total\_capacity

\+I\(\(queue\_total\>6\.0\)or\(wbar\>1\.4\)or\(utilization\>0\.88\)\)

\-I\(\(queue\_total<2\.0\)and\(utilization<0\.50\)\)\)\)

bonus\_pool=5\.0

flex\_pool=min\(max\(1,3\+I\(\(wbar\>1\.2\)or\(queue\_total\>5\.0\)

or\(queue\_max\>2\.0\)\)\+I\(utilization\>0\.85\)\),

max\(1,current\_total\_capacity\)\)

xi=0\.0

wH=1\.80;wW=0\.90;wrej=1\.70;wC=0\.05;kappa=1\.7

effort=base\_effort\+effort\_quality\_pressure

triage\_accept=\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure\+kpi\_gaming\_pressure

\+0\.15\*urgency\+0\.05\*wait\)\>=accept\_threshold

triage\_defer=\(not\(\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure\+kpi\_gaming\_pressure

\+0\.15\*urgency\+0\.05\*wait\)\>=accept\_threshold\)\)and

\(\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure

\+0\.5\*kpi\_gaming\_pressure\+0\.10\*urgency\)\+defer\_pressure

\>=defer\_threshold\)

request\_bed=bed\_availableand\(\(clinical\_bed\_score\+0\.20\*quality\_pressure

\+bed\_quality\_pressure\-bed\_cost\_pressure\+bed\_availability\_signal

\-bed\_fatigue\_pressure\+0\.05\*urgency\)\>=bed\_threshold\)

candidate\_score=0\.20\*upcode\_pressure\-1\.50\*audit\_penalty

\-2\.50\*ethics\_pressure\-0\.45\*coding\_gap

#### Profit family \(Listing[2](https://arxiv.org/html/2605.30680#LST2)\)\.

The profit family inverts every one of these moves\. The hospital administrator commits toα≈0\.95\\alpha\\approx 0\.95except when waiting or rejection signals a near\-failure mode \(in which caseα\\alphafalls andβ\\betarises by0\.150\.15–0\.200\.20as a safety reflex\), keepsKPIsteering active atξ=1\.5\\xi=1\.5, and lets a small flex pool expand under queue and rejection stress\. The bonus pool scales linearly with last\-period profit, and the hospitalKPIweight on cost \(wC=1\.70w\_\{C\}=1\.70\) dominates all other weights\. The coding\-side expression doubles its weight on up\-coding pressure and gain and halves its ethics\-pressure coefficient: the searched profit policy actively manages coding aggressiveness as a continuous variable\.

Listing 2:Pure\-profit policy sketch\.alpha=clip\(0\.95\-0\.15\*I\(\(wbar\>4\.0\)or\(rbar\>0\.15\)\),0\.0,1\.0\)

beta=clip\(0\.05\+0\.20\*I\(\(wbar\>4\.0\)or\(rbar\>0\.15\)\),0\.0,1\.0\)

total\_capacity=max\(10,min\(12,current\_total\_capacity

\+I\(\(\(\(wbar\>5\.0\)or\(rbar\>0\.18\)or\(queue\_total\>24\.0\)\)

and\(utilization\>0\.92\)\)and\(funds\>70\.0\)

and\(profit\_last\>0\.0\)\)

\-I\(\(utilization<0\.62\)and\(wbar<0\.5\)

and\(current\_total\_capacity\>10\)\)\)\)

bonus\_pool=clip\(1\.20\+0\.032\*max\(0\.0,profit\_last\)

\-0\.70\*I\(\(profit\_last<0\.0\)or\(rbar\>0\.14\)\),0\.25,4\.1\)

flex\_pool=min\(2,min\(max\(1,current\_total\_capacity\),

max\(0,current\_flex\_pool

\+I\(\(\(queue\_total\>9\.0\)or\(queue\_max\>4\.0\)\)and\(rbar\>0\.05\)\)

\-I\(\(not\(\(\(queue\_total\>9\.0\)or\(queue\_max\>4\.0\)\)

and\(rbar\>0\.05\)\)\)and\(current\_flex\_pool\>0\)\)\)\)\)

xi=1\.5

wH=0\.15;wW=0\.05;wrej=0\.15;wC=1\.70;kappa=3\.5

effort=base\_effort\+effort\_quality\_pressure

triage\_accept=\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure\+kpi\_gaming\_pressure

\+0\.22\*urgency\+0\.06\*wait\)\>=accept\_threshold

triage\_defer=\(not\(\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure\+kpi\_gaming\_pressure\)

\>=accept\_threshold\)\)and\(\(clinical\_triage\_score\+margin\_pressure

\+quality\_pressure\+bed\_availability\_signal\-triage\_fatigue\_pressure

\+kpi\_gaming\_pressure\)\+defer\_pressure\>=defer\_threshold\)

request\_bed=bed\_availableand\(\(clinical\_bed\_score\+bed\_quality\_pressure

\-bed\_cost\_pressure\+bed\_availability\_signal\-bed\_fatigue\_pressure\)

\>=bed\_threshold\)

candidate\_score=0\.22\*upcode\_gain\+0\.78\*upcode\_pressure

\-1\.14\*audit\_penalty\-1\.16\*ethics\_pressure\-0\.26\*coding\_gap

#### Mixed family \(Listing[3](https://arxiv.org/html/2605.30680#LST3)\)\.

The calibrated mixed family is the cleanest qualitative result of the L3 layer and the only policy in our search that simultaneously achieves three properties: profit\-comparable discounted return, zero up\-coding on held\-out rollouts, and a halving of rejection relative to the profit baseline\. Its structure is informative\. The hospital administrator fixesα=0\.5\\alpha=0\.5, adaptsβ\\betaupward only when access stress is detected, and leaves total capacity, the flex pool, andKPIsteering unchanged; the bonus pool grows with both profit and access stress but contracts sharply under insolvency; hospitalKPIweights are balanced with a primary tilt toward health and rejection\. The candidate\-coding score is the structural innovation: it carries a moderate positive weight on𝚞𝚙𝚌𝚘𝚍𝚎​\_​𝚙𝚛𝚎𝚜𝚜𝚞𝚛𝚎\\mathtt\{upcode\\\_pressure\}as a continuous lever, balanced against an explicit−100\-100indicator penalty that snaps to a regime that strictly forbids coding deviations above0\.200\.20\. This indicator term is what drives up\-coding to exactly zero on held\-out seeds while leaving the gradient signal usable in early\-training rollouts—a hand\-engineered version of the same hard\-threshold\-plus\-smooth\-shaping idiom that has emerged in several published reward\-design studies\.

Listing 3:K=200K=200calibrated mixed policy sketch\.alpha=0\.5

beta=clip\(0\.35\+0\.20\*I\(\(wbar\>4\.0\)or\(rbar\>0\.15\)\),0\.0,1\.0\)

bonus\_pool=clip\(1\.45\+0\.012\*max\(0\.0,profit\_last\)

\+0\.25\*I\(\(wbar\>1\.8\)or\(rbar\>0\.06\)\)

\-0\.55\*I\(profit\_last<0\.0\),0\.75,4\.25\)

total\_capacity=current\_total\_capacity

flex\_pool=current\_flex\_pool

xi=0\.0

wH=1\.0;wW=0\.20;wrej=1\.10;wC=0\.10;kappa=2\.0

effort=base\_effort\+effort\_quality\_pressure

triage\_accept=\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure

\+kpi\_gaming\_pressure\)\>=accept\_threshold

triage\_defer=\(not\(\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure

\+kpi\_gaming\_pressure\)\>=accept\_threshold\)\)and\(\(clinical\_triage\_score

\+margin\_pressure\+quality\_pressure\+bed\_availability\_signal

\-triage\_fatigue\_pressure\+kpi\_gaming\_pressure\)\+defer\_pressure

\>=defer\_threshold\)

request\_bed=bed\_availableand\(\(clinical\_bed\_score\+bed\_quality\_pressure

\-bed\_cost\_pressure\+bed\_availability\_signal\-bed\_fatigue\_pressure\)

\>=bed\_threshold\)

candidate\_score=0\.85\*upcode\_pressure\+0\.08\*upcode\_gain

\-1\.00\*audit\_penalty\-0\.85\*ethics\_pressure

\-0\.15\*coding\_gap\-100\.0\*I\(coding\_gap\>0\.20\)

#### K=300K=300diagnostic trace \(Listing[4](https://arxiv.org/html/2605.30680#LST4)\)\.

TheK=300K=300trace produces a closely related but distinct mixed family\. Listing[4](https://arxiv.org/html/2605.30680#LST4)gives the neutral seed and then records only the code\-level deltas at each running\-best update in Table[5](https://arxiv.org/html/2605.30680#A10.T5); the complete executable policies are stored in the experiment artifacts\. The main edits occur in two stages: iteration198198moves the administrative policy into an access\-stress rule with activeKPIsteering, and iterations213213and273273refine the bonus, rejection\-weight, triage, and coding expressions\. TheK=300K=300family improves waiting and high\-complexity deferral, but it does not dominate theK=200K=200family on held\-out fitness or the aggregate violation score; we therefore keepK=200K=200as the main result, while using the trace to show that both budgets converge on the same coding\-gap idiom\.

Listing 4:K=300K=300mixed running\-best trace deltas\.\#Iteration0:neutralseed\.

alpha=0\.5;beta=0\.5

total\_capacity=current\_total\_capacity;flex\_pool=current\_flex\_pool

bonus\_pool=5\.0;xi=0\.0

wH=1\.0;wW=0\.20;wrej=1\.0;wC=0\.10;kappa=2\.0

triage\_accept=\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure

\+kpi\_gaming\_pressure\)\>=accept\_threshold

triage\_defer=\(not\(\(clinical\_triage\_score\+margin\_pressure\+quality\_pressure

\+bed\_availability\_signal\-triage\_fatigue\_pressure

\+kpi\_gaming\_pressure\)\>=accept\_threshold\)\)and\(\(clinical\_triage\_score

\+margin\_pressure\+quality\_pressure\+bed\_availability\_signal

\-triage\_fatigue\_pressure\+kpi\_gaming\_pressure\)\+defer\_pressure

\>=defer\_threshold\)

request\_bed=bed\_availableand\(\(clinical\_bed\_score\+bed\_quality\_pressure

\-bed\_cost\_pressure\+bed\_availability\_signal\-bed\_fatigue\_pressure\)

\>=bed\_threshold\)

candidate\_score=upcode\_pressure\-audit\_penalty\-ethics\_pressure

\#Iteration198:deltarelativetoneutralseed\.

alpha\-\>clip\(0\.65\-0\.08\*I\(\(wbar\>4\.0\)or\(rbar\>0\.15\)\),0\.0,1\.0\)

beta\-\>clip\(0\.30\+0\.25\*I\(\(wbar\>4\.0\)or\(rbar\>0\.15\)\),0\.0,1\.0\)

total\_capacity\-\>max\(10,min\(12,current\_total\_capacity

\+I\(\(\(\(wbar\>5\.0\)or\(rbar\>0\.18\)or\(queue\_total\>24\.0\)\)

and\(utilization\>0\.92\)\)and\(funds\>70\.0\)

and\(profit\_last\>0\.0\)\)

\-I\(\(utilization<0\.62\)and\(wbar<0\.5\)

and\(current\_total\_capacity\>10\)\)\)\)

bonus\_pool\-\>clip\(1\.50\+0\.015\*max\(0\.0,profit\_last\)

\-0\.50\*I\(profit\_last<0\.0\),0\.75,4\.25\)

flex\_pool\-\>min\(2,min\(max\(1,current\_total\_capacity\),

max\(0,current\_flex\_pool

\+I\(\(queue\_total\>10\.0\)and\(\(rbar\>0\.06\)or\(wbar\>2\.0\)\)\)

\-I\(\(not\(\(queue\_total\>10\.0\)

and\(\(rbar\>0\.06\)or\(wbar\>2\.0\)\)\)\)

and\(current\_flex\_pool\>0\)\)\)\)\)

xi\-\>1\.5

wH,wW,wrej,wC,kappa\-\>0\.85,0\.05,1\.05,0\.80,2\.2

triage\_accept:add0\.14\*urgency\+0\.04\*wait

triage\_defer:add0\.14\*urgency\+0\.04\*waittoacceptgate;

add0\.08\*urgency\+0\.03\*waittodeferscore

request\_bed:add0\.08\*urgency\+0\.04\*wait

candidate\_score\-\>0\.85\*upcode\_pressure\+0\.08\*upcode\_gain

\-1\.00\*audit\_penalty\-0\.85\*ethics\_pressure

\-0\.15\*coding\_gap\-100\.0\*I\(coding\_gap\>0\.20\)

\#Iteration213:deltarelativetoiteration198\.

triage\_accept:urgencycoefficient0\.14\-\>0\.16

triage\_defer:accept\-gateurgency0\.14\-\>0\.16;

defer\-scoreurgency0\.08\-\>0\.09

candidate\_score\-\>0\.80\*upcode\_pressure\+0\.10\*upcode\_gain

\-1\.10\*audit\_penalty\-0\.95\*ethics\_pressure

\-0\.25\*coding\_gap\-100\.0\*I\(coding\_gap\>0\.20\)

\#Iteration273:deltarelativetoiteration213\.

bonus\_pool\-\>clip\(1\.35\+0\.012\*max\(0\.0,profit\_last\)

\-0\.60\*I\(\(profit\_last<0\.0\)or\(rbar\>0\.08\)\),0\.75,3\.75\)

wrej\-\>1\.15

triage\_accept:urgency0\.16\-\>0\.22;wait0\.04\-\>0\.05;

add0\.03\*complexity

candidate\_score\-\>0\.70\*upcode\_pressure\+0\.12\*upcode\_gain

\-1\.15\*audit\_penalty\-1\.00\*ethics\_pressure

\-0\.35\*coding\_gap\-100\.0\*I\(coding\_gap\>0\.20\)

## Appendix KImplementation Details and DSL Guardrails

L1 and L2 use the native hospital administrator and provider rules of §[3](https://arxiv.org/html/2605.30680#S3); noAlphaEvolvepolicy is injected into those layers\. The main L2 ablation sweeps audit probability, bonus pool, bonus sharpness, total capacity, the health\-to\-cost hospitalKPIweight ratio, and flexible capacity; the static\-flex diagnostic of §[G](https://arxiv.org/html/2605.30680#A7)isolates the flexible pool under fixed routing and specialization\. L3 selects candidates on validation seeds and reports held\-out test results on disjoint seed splits\. Unlike L1/L2, L3 may edit selected doctor\-side expression constants, but only through the typed fields in Listing[5](https://arxiv.org/html/2605.30680#LST5); host\-side dynamics, audit application, and metric computation remain fixed\. The mixed\-objective scalarizer is the safety\-penalized fitness of §[4](https://arxiv.org/html/2605.30680#S4)withλunsafe=0\.06\\lambda\_\{\\mathrm\{unsafe\}\}=0\.06andλvar=0\.25\\lambda\_\{\\mathrm\{var\}\}=0\.25, with explicit per\-step penalties for high\-urgency rejection, insolvency, unsafe up\-coding, excessive waiting, and high\-complexity deferral\.

#### DSL guardrails\.

The released L3 prompt files use a small replayable interface\. Listing[5](https://arxiv.org/html/2605.30680#LST5)summarizes the shared guardrails\. The full artifacts include the system prompt, the full\-rewrite template, and the diff template; themedisim\_dsl\_v1tag identifies the DSL schema and is not itself a simulator mechanism\. Candidates can edit only fixed policy expression fields and can read only exposed DSL features plus safe scalar helper calls\. Simulator dynamics, patient generation, metric computation, host\-side clipping, and feasibility projection are held fixed across candidates, which is what closes the audit\-evasion gap discussed under desideratum \(ii\) of §[4](https://arxiv.org/html/2605.30680#S4)\. For coding, the host computesaudit\_penaltyfrompaudit​\(Δ​CMI\)p\_\{\\mathrm\{audit\}\}\(\\Delta\\mathrm\{CMI\}\)and the fixed penalty schedule before the candidate expression is evaluated;candidate\_scoremay change how strongly the coder responds to that feature, but cannot change the audit probability, penalty application, or reported metrics\.

Listing 5:L3 DSL guardrail summary\.Oneexecutable/testableassignment\-onlymedisim\_dsl\_v1

module,orexactSEARCH/REPLACEblocks\.

Top\-level:kind,version,admin\_policy,doctor\_policy\.

Adminexpr:alpha,beta,total\_capacity,bonus\_pool,

flex\_pool,xi,wH,wW,wrej,wC,kappa\.

Doctorexpr:effort,triage\_accept,triage\_defer,

request\_bed,candidate\_score\.

Calls:I,clip,min,max,abs,round,soft\_gt,soft\_lt\.

Nonewstate/schemafields/mechanisms/hiddencalls\.

Hostclips/projectsoutputs;dynamicsandmetricsfixed\.

## Appendix LTerminology and Arrival\-Process Note

#### Glossary\.

The main text keeps the terminology short; this appendix gives the operational meanings used in the simulator\.

HospitalDRG\.Diagnosis\-related group: a billable case category used in prospective hospital payment\. InMedi\-Sim, each patient has a true clinical group and a reported group used for settlement\.

HospitalDRG\-style arrivals\.Hospital patient arrivals with clinical type, urgency, tolerance, and reimbursement\-relevant case weight\. The phrase distinguishes the stream from identical jobs in a generic queue\.

HospitalCMI\.Case\-mix index\. We use normalized hospitalCMIas the patient’s clinical complexity and as the payment\-relevant weight that coding can distort\.

Hospital administrator\.The leader in the Stackelberg formulation\. This actor sets hospital\-level levers such as incentives, audit intensity, capacity, bonus pools,KPIweights, routing, and steering\.

Hospital Stackelberg game\.A leader\-follower model in which the hospital administrator commits to a mechanism and hospital providers respond strategically to it\.

Hospital coder and hospital coding policy\.A hospital coder maps clinical evidence to a reported billable group\. The hospital coding policy is the fixed or searched rule that scores candidate reported groups before the configured coding choice rule\.

HospitalKPI\.A measured hospital performance score assembled from health, waiting, rejection, and cost\. It affects bonuses and can diverge from true clinical value\.

Hospital provider\-response channels\.Up\-coding changes the reported billing group; selection changes which patients are accepted; delay keeps patients waiting when serving them is unattractive; effort changes treatment intensity; triage is the local accept/defer/reject and resource\-request rule\.

Hospital coding wedge and hospital measurement wedge\.The coding wedge is the gap between true clinical complexity and the reported billing group\. The measurement wedge is the gap between true clinical value and the measuredKPIused for bonuses or steering\.

HospitalKPItargeting \(Goodhart\-style proxy gaming\)\.Provider behavior that improves a measured hospitalKPIwhile moving away from the hospital’s true clinical or access objective\.

Gold\-plating, skimping, and cream\-skimming\.Gold\-plating is excessive treatment intensity, skimping is under\-provision of care effort or resources, and cream\-skimming is accepting easier or more profitable patients disproportionately\.

Hospital Identify–Produce–Settle \(IPS\)\.The simulator order: identify patients and billing groups, produce care under capacity constraints, then settle reimbursement,KPIscores, and bonuses\.

Hospital policy DSL\.The restricted domain\-specific language used to write candidate hospital policies over approved administrative levers and provider\-response expressions\.

Flexible hospital capacity pool andKPIsteering\.The flexible capacity pool is capacity that can be reallocated across hospital care teams\.KPIsteering assigns that capacity or routing priority using measured hospital performance scores\.

#### Arrival\-process scope note\.

The experiments use the homogeneous Poisson law in Eq\. \([10](https://arxiv.org/html/2605.30680#A3.E10)\) because it is transparent and reproducible\. The simulator does not rely on Poisson arrivals for the channel definitions\. LetAtA\_\{t\}denote the full arrival batch at periodtt, including patient count and attributes, and letht−1h\_\{t\-1\}be the realized pre\-period history\. For any fixed arrival kernel𝒦t​\(d​At∣ht−1\)\\mathcal\{K\}\_\{t\}\(dA\_\{t\}\\mid h\_\{t\-1\}\)sampled before provider actions, the one\-step rollout law under policyπ\\pifactors as

Prπ⁡\(d​At,d​Yt,d​Xt\+1∣ht−1\)\\displaystyle\\Pr\_\{\\pi\}\(dA\_\{t\},dY\_\{t\},dX\_\{t\+1\}\\mid h\_\{t\-1\}\)\(17\)=𝒦t​\(d​At∣ht−1\)​Prπ⁡\(d​Yt,d​Xt\+1∣ht−1,At\),\\displaystyle\\quad=\\mathcal\{K\}\_\{t\}\(dA\_\{t\}\\mid h\_\{t\-1\}\)\\,\\Pr\_\{\\pi\}\(dY\_\{t\},dX\_\{t\+1\}\\mid h\_\{t\-1\},A\_\{t\}\),whereYtY\_\{t\}collects the identify, routing, triage, treatment, and settlement decisions\. The second factor is the strategic hospital\-response part studied in the paper\. Replacing the Poisson kernel with a nonhomogeneous Poisson process, an over\-dispersed count model, correlated service\-line arrivals, seasonal mixtures, or an empirical bootstrap stream changes the distribution of states the policy sees, but it leaves the IPS order and the definitions of coding, selection, delay, effort, andKPItargeting intact, provided the same arrival kernel is used for the policies being compared\.

This is the scope condition\. If patient demand responds directly to the hospital policy or to reputation generated by earlier policies, the arrival kernel is no longer fixed\. That feedback can be modeled as an additional demand\-response channel, but it is outside the present L1/L2/L3 experiments\.

Similar Articles

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

arXiv cs.LG

This paper proposes a strategic robustness objective for learning simulators in model-based reinforcement learning, formulated as a minimax game between a model player and an adversarial policy player. Theoretical guarantees and a provably convergent algorithm are provided, with experiments showing reduced prediction error and improved real-world policy transfer.

Code as Agent Harness

Hugging Face Daily Papers

This survey paper presents a unified view of code as the operational substrate for agent reasoning and execution in agentic systems, organizing the discussion around three layers: harness interface, mechanisms, and scaling.

Off-Policy Evaluation with Strategic Agents via Local Disclosure

arXiv cs.AI

This paper studies off-policy evaluation (OPE) when decision subjects (agents) strategically modify their covariates in response to a policy. It proposes a method that uses local disclosure via post-hoc explanations to reveal agents' pre-strategic covariates and construct a doubly robust estimator for policy value.

Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

arXiv cs.AI

Researchers from the University of Michigan introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework that enables LLM agents to reason about the internal assumptions, dependencies, and execution behavior of scientific simulators rather than treating them as black boxes. The framework improves explanation quality and decision-making reliability across high-stakes domains like healthcare, finance, and public policy.