Human-in-the-Loop Contextual Bandits for Short-Term Rental Dynamic Pricing: Structural Equivalence of Historical Warm-Up and Approval-Gated Live Learning

arXiv cs.LG Papers

Summary

The paper introduces Human-in-the-Loop Gated Bandit (HITL-GB) for short-term rental dynamic pricing, showing that historical pricing data under a prior policy is structurally equivalent to on-policy warm-up data, reducing cold-start from ~150 to ~30 episodes.

arXiv:2606.02595v1 Announce Type: new Abstract: Dynamic pricing in short-term rental (STR) markets presents a distinctive challenge for online learning algorithms: pricing decisions carry significant financial risk, operators require explainability, and market feedback is sparse (one booking outcome per listed night). We introduce the Human-in-the-Loop Gated Bandit (HITL-GB) framework, in which a contextual bandit algorithm generates price recommendations but a human agent retains authority to accept, modify, or reject each recommendation before it is applied. We show that under this approval constraint, historical pricing data -- collected under a prior deterministic policy -- is structurally equivalent to on-policy warm-up data for initialising the bandit's posterior, bypassing the weeks-to-months cold-start period that renders pure online bandit learning impractical in sparse-feedback markets. We formalise the approval-gated reward signal, derive a regularised ridge-regression warm-up procedure from historical episodes, and validate the approach on real STR production data (anonymised urban market, 2 rooms, April 2022 -- April 2026, 1,461 nightly pricing episodes). Our warm-up procedure compresses effective cold-start from ~150 episodes to ~30 episodes when initialising agents from the Hierarchical Factored Thompson Sampling (HF-TS) family. We further argue that the structural equivalence result is domain-agnostic: any high-stakes domain where human approval is legally or operationally required -- including clinical drug dosing, credit origination, content moderation, and radiological diagnosis -- satisfies the same conditions and benefits from the same warm-up strategy. In regulated industries, mandatory human oversight is thus a statistical asset rather than a deployment constraint.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:38 AM

# Human-in-the-Loop Contextual Bandits for Short-Term Rental Dynamic Pricing: Structural Equivalence of Historical Warm-Up and Approval-Gated Live Learning
Source: [https://arxiv.org/html/2606.02595](https://arxiv.org/html/2606.02595)
\(May 2026\)

###### Abstract

Dynamic pricing in short\-term rental \(STR\) markets presents a distinctive challenge for online learning algorithms: pricing decisions carry significant financial risk, operators require explainability, and market feedback is sparse \(one booking outcome per listed night\)\. We introduce theHuman\-in\-the\-Loop Gated Bandit \(HITL\-GB\)framework, in which a contextual bandit algorithm generates price recommendations but a human agent retains authority to accept, modify, or reject each recommendation before it is applied\. We show that under this approval constraint, historical pricing data — collected under a prior deterministic policy — is*structurally equivalent*to on\-policy warm\-up data for initialising the bandit’s posterior, bypassing the weeks\-to\-months cold\-start period that renders pure online bandit learning impractical in sparse\-feedback markets\. We formalise the approval\-gated reward signal, derive a regularised ridge\-regression warm\-up procedure from historical episodes, and validate the approach onreal STR production data\(anonymised urban market, 2 rooms, April 2022 – April 2026, 1 461 nightly pricing episodes\)\. Our warm\-up procedure compresses effective cold\-start from∼\\sim150 episodes to∼\\sim30 episodes when initialising agents from the Hierarchical Factored Thompson Sampling \(HF\-TS\) family\(Honget al\.,[2021](https://arxiv.org/html/2606.02595#bib.bib3),[2022](https://arxiv.org/html/2606.02595#bib.bib4); Zimmert and Seldin,[2018](https://arxiv.org/html/2606.02595#bib.bib2)\)\. We further argue that the structural equivalence result is domain\-agnostic: any high\-stakes domain where human approval is legally or operationally required — including clinical drug dosing, credit origination, content moderation, and radiological diagnosis — satisfies the same conditions and benefits from the same warm\-up strategy\. In regulated industries, mandatory human oversight is thus a*statistical asset*rather than a deployment constraint\.

#### Keywords\.

contextual bandits, dynamic pricing, human\-in\-the\-loop, off\-policy evaluation, short\-term rental, cold\-start, hierarchical Thompson sampling, factored bandits, clinical decision support, regulated AI

###### Contents

1. [1Introduction](https://arxiv.org/html/2606.02595#S1)
2. [2Related Work](https://arxiv.org/html/2606.02595#S2)1. [2\.1Hierarchical and Factored Bandits](https://arxiv.org/html/2606.02595#S2.SS1) 2. [2\.2Human\-in\-the\-Loop Machine Learning](https://arxiv.org/html/2606.02595#S2.SS2) 3. [2\.3Off\-Policy Evaluation and Warm\-Up](https://arxiv.org/html/2606.02595#S2.SS3) 4. [2\.4Short\-Term Rental Pricing](https://arxiv.org/html/2606.02595#S2.SS4)
3. [3Problem Formulation](https://arxiv.org/html/2606.02595#S3)1. [3\.1The HITL\-GB Setting](https://arxiv.org/html/2606.02595#S3.SS1) 2. [3\.2The Human Approval Function](https://arxiv.org/html/2606.02595#S3.SS2) 3. [3\.3The Three\-Layer Price Signal](https://arxiv.org/html/2606.02595#S3.SS3) 4. [3\.4The Factored Bandit Arms](https://arxiv.org/html/2606.02595#S3.SS4) 5. [3\.5The HITL Feedback Signal](https://arxiv.org/html/2606.02595#S3.SS5)
4. [4Historical Warm\-Up: Structural Equivalence](https://arxiv.org/html/2606.02595#S4)1. [4\.1The Cold\-Start Problem in STR Markets](https://arxiv.org/html/2606.02595#S4.SS1) 2. [4\.2Historical Data Under the Prior Policy](https://arxiv.org/html/2606.02595#S4.SS2) 3. [4\.3Structural Equivalence Theorem](https://arxiv.org/html/2606.02595#S4.SS3) 4. [4\.4Theα\\alpha\-Blended Ridge Regression Warm\-Up](https://arxiv.org/html/2606.02595#S4.SS4) 5. [4\.5Dual Cold\-Start: One Dataset, Two Problems](https://arxiv.org/html/2606.02595#S4.SS5)
5. [5Experimental Setup](https://arxiv.org/html/2606.02595#S5)1. [5\.1Dataset](https://arxiv.org/html/2606.02595#S5.SS1) 2. [5\.2HF\-TS Benchmark Agents](https://arxiv.org/html/2606.02595#S5.SS2) 3. [5\.3Warm\-Up Conditions](https://arxiv.org/html/2606.02595#S5.SS3)
6. [6Results](https://arxiv.org/html/2606.02595#S6)1. [6\.1Calibrated Day\-Signal Parameters](https://arxiv.org/html/2606.02595#S6.SS1) 2. [6\.2Revenue Advantage vs\. Cold Start](https://arxiv.org/html/2606.02595#S6.SS2) 3. [6\.3Summary](https://arxiv.org/html/2606.02595#S6.SS3)
7. [7Broader Applications of the HITL\-GB Framework](https://arxiv.org/html/2606.02595#S7)
8. [8Discussion](https://arxiv.org/html/2606.02595#S8)1. [8\.1The Approval Gate as a Statistical Asset](https://arxiv.org/html/2606.02595#S8.SS1) 2. [8\.2Relationship to Existing Hierarchical Theory](https://arxiv.org/html/2606.02595#S8.SS2) 3. [8\.3Stationarity of the Human Approval Function](https://arxiv.org/html/2606.02595#S8.SS3) 4. [8\.4Limitations](https://arxiv.org/html/2606.02595#S8.SS4)
9. [9Conclusion](https://arxiv.org/html/2606.02595#S9)
10. [References](https://arxiv.org/html/2606.02595#bib)
11. [AProof of Structural Equivalence Theorem](https://arxiv.org/html/2606.02595#A1)
12. [BHF\-TS Theoretical Results](https://arxiv.org/html/2606.02595#A2)

## 1Introduction

Online learning algorithms — and multi\-armed bandits in particular — have demonstrated strong performance in dynamic pricing for e\-commerce\(Misraet al\.,[2019](https://arxiv.org/html/2606.02595#bib.bib24)\), ride\-sharing\(Tang and others,[2013](https://arxiv.org/html/2606.02595#bib.bib21)\), and hotel revenue management\(Ferreiraet al\.,[2016](https://arxiv.org/html/2606.02595#bib.bib23)\)\. The core appeal is clear: the algorithm explores the price\-demand curve, updates its beliefs from booking outcomes, and converges toward revenue\-maximising arms without requiring a pre\-specified demand model\.

In short\-term rental markets, however, naïve application of bandit algorithms faces a structural barrier:human operators must approve pricing decisions\. Property managers, revenue managers, and portfolio owners are reluctant to delegate pricing authority fully to an algorithm\. Prices affect guest relationships, brand perception, and platform ranking — consequences that extend beyond a single booking outcome\. This is not a limitation to be engineered away; it is a fundamental feature of the domain\.

The dominant practical response is to treat the bandit as arecommendation system: the algorithm proposes an arm \(price multiplier\), and the human accepts or overrides\. This is widely deployed in practice but poorly studied theoretically\. In particular, three questions remain open:

1. 1\.Feedback attribution: When the human overrides the recommendation, whose decision generated the reward — the human’s or the algorithm’s?
2. 2\.Historical equivalence: Can historical pricing data \(collected under a prior deterministic policy\) serve as valid warm\-up data, or does the approval gate invalidate off\-policy reuse?
3. 3\.Cold\-start compression: Does HITL approval, combined with historical warm\-up, eliminate the impractically long cold\-start period of pure online bandit learning in sparse markets?

This paper addresses all three questions\. Our main contributions are:

1. 1\.HITL\-GB framework\(§[3](https://arxiv.org/html/2606.02595#S3)\): a formal definition of the Gated Bandit system, where the arm applied to the environment may differ from the arm recommended by the algorithm, mediated by a human approval functionh:𝒜×𝒳→𝒜h:\\mathcal\{A\}\\times\\mathcal\{X\}\\to\\mathcal\{A\}\.
2. 2\.Structural equivalence theorem\(§[4](https://arxiv.org/html/2606.02595#S4)\): under the approval constraint with stationary approval function, historical data generated by a deterministic pricing policy is a valid warm\-up initialiser without importance\-sampling corrections\.
3. 3\.Dual cold\-start from one dataset\(§[4](https://arxiv.org/html/2606.02595#S4)\): the same historical episodes simultaneously initialise the bandit arm posteriors*and*calibrate the four day\-signal parameters𝜽\\boldsymbol\{\\theta\}, compressing cold\-start from∼\\sim150 to∼\\sim30 booked episodes\.
4. 4\.Empirical validation\(§[6](https://arxiv.org/html/2606.02595#S6)\) on real STR production data \(1 461 nightly episodes\), showing positive revenue advantage within≈30\\approx 30live episodes\.
5. 5\.Cross\-domain applicability\(§[7](https://arxiv.org/html/2606.02595#S7)\): a survey of 12 regulated domains where the result holds, including clinical dosing, credit origination, and content moderation\.

#### The key counterintuitive insight\.

The human approval gate is typically treated as friction between the algorithm and the market\. This paper reframes it: the approval gate is precisely*what makes historical data valid for warm\-up without IS correction*\. Regulatory requirements are not obstacles to ML deployment — they are the mechanism that makes fast deployment possible\.

## 2Related Work

### 2\.1Hierarchical and Factored Bandits

Our base agent family, HF\-TS, draws on a body of recent hierarchical bandit literature\.

#### Factored bandits\.

Zimmert and Seldin \([2018](https://arxiv.org/html/2606.02595#bib.bib2)\)decompose pricing actions into a Cartesian product of independent atomic actions combined multiplicatively, yielding regret bounds sub\-linear in the joint arm count\. We use this as Layer 1 \(market demand\) of the hierarchy\.

#### Hierarchical Thompson Sampling\.

Honget al\.\([2021](https://arxiv.org/html/2606.02595#bib.bib3)\)model all properties as tasks drawn from a shared cluster distribution, enabling cross\-property posterior sharing\.Honget al\.\([2022](https://arxiv.org/html/2606.02595#bib.bib4)\)extend this to an arbitraryLL\-level prior tree with regret bounds improving polynomially with depth\.

#### Coarse\-to\-Fine hierarchical exploration\.

Yueet al\.\([2012](https://arxiv.org/html/2606.02595#bib.bib5)\)progressively unlock finer arm\-space states as data density grows, providing the optimal unlock threshold \(Theorem[8\.2](https://arxiv.org/html/2606.02595#S8.Thmtheorem2)below\)\.

#### Metadata and online\-cluster variants\.

Wanet al\.\([2021](https://arxiv.org/html/2606.02595#bib.bib6)\)replace hard cluster assignment with cosine\-similarity\-weighted Bayesian priors\.Zhou and others \([2024](https://arxiv.org/html/2606.02595#bib.bib7)\)allow cluster assignments to evolve online viakk\-means\-style centroid tracking\.

### 2\.2Human\-in\-the\-Loop Machine Learning

The HITL literature predominantly addresses*active learning*\(Settles,[2012](https://arxiv.org/html/2606.02595#bib.bib15)\)and reinforcement learning from human feedback \(RLHF\)\(Christianoet al\.,[2017](https://arxiv.org/html/2606.02595#bib.bib16); Ouyang and others,[2022](https://arxiv.org/html/2606.02595#bib.bib17)\)\. In RLHF, human preferences shape a reward model that guides policy learning\. Our setting differs fundamentally: the human approves or overrides*actions before execution*, making approval a pre\-execution gate rather than a post\-hoc label\.

The closest related work is HITL bandits for clinical trial design\(Liao and others,[2020](https://arxiv.org/html/2606.02595#bib.bib18)\)and educational recommendation\(Rafferty and others,[2019](https://arxiv.org/html/2606.02595#bib.bib19)\), where expert approval constrains arm selection\. Neither addresses structural equivalence of historical warm\-up under the approval constraint, nor the sparse\-feedback regime\.

### 2\.3Off\-Policy Evaluation and Warm\-Up

Off\-policy evaluation \(OPE\) addresses learning from data collected under a different policy\(Precupet al\.,[2000](https://arxiv.org/html/2606.02595#bib.bib13)\)\. The standard solution is importance sampling \(IS\) with propensity correction:

V^​\(π\)=1N​∑t=1Nπ​\(at∣𝐱t\)π0​\(at∣𝐱t\)​rt\.\\hat\{V\}\(\\pi\)=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}\\frac\{\\pi\(a\_\{t\}\\mid\\mathbf\{x\}\_\{t\}\)\}\{\\pi\_\{0\}\(a\_\{t\}\\mid\\mathbf\{x\}\_\{t\}\)\}r\_\{t\}\.\(1\)We show that under the HITL approval structure, IS corrections are unnecessary for warm\-up initialisation — simplifying implementation and avoiding the high variance of IS estimators in sparse datasets\.

### 2\.4Short\-Term Rental Pricing

STR pricing research has focused on hedonic regression\(Gibbset al\.,[2018](https://arxiv.org/html/2606.02595#bib.bib22)\), demand forecasting, and competitive positioning\. Bandit\-based STR pricing remains largely unstudied academically\. Our work contributes the first formal treatment of HITL\-gated bandit pricing in this domain\.

## 3Problem Formulation

### 3\.1The HITL\-GB Setting

Let𝒯=\{1,2,…,T\}\\mathcal\{T\}=\\\{1,2,\\ldots,T\\\}be the set of pricing time\-steps, where each stepttcorresponds to a single night in a property calendar\. At each steptt:

- •The environment reveals context𝐱t∈𝒳\\mathbf\{x\}\_\{t\}\\in\\mathcal\{X\}\(market occupancy, days until check\-in, day of week, property fill rate, etc\.\)
- •The bandit algorithm selects armatrec∈𝒜a^\{\\text\{rec\}\}\_\{t\}\\in\\mathcal\{A\}\(a price multiplier\)
- •The human agent applies approval functionh:𝒜×𝒳→𝒜h:\\mathcal\{A\}\\times\\mathcal\{X\}\\to\\mathcal\{A\}, yielding executed armatexec=h​\(atrec,𝐱t\)a^\{\\text\{exec\}\}\_\{t\}=h\(a^\{\\text\{rec\}\}\_\{t\},\\mathbf\{x\}\_\{t\}\)
- •The environment returns rewardrt∼R\(⋅∣atexec,𝐱t\)r\_\{t\}\\sim R\(\\cdot\\mid a^\{\\text\{exec\}\}\_\{t\},\\mathbf\{x\}\_\{t\}\)\(booking outcome×\\timesprice\)

The bandit observes the tuple\(atrec,atexec,rt,𝐱t\)\(a^\{\\text\{rec\}\}\_\{t\},a^\{\\text\{exec\}\}\_\{t\},r\_\{t\},\\mathbf\{x\}\_\{t\}\)and updates its posterior\. The complete decision cycle is illustrated in Figure[1](https://arxiv.org/html/2606.02595#S3.F1)\.

![Refer to caption](https://arxiv.org/html/2606.02595v1/x1.png)Figure 1:The HITL\-GB decision cycle\.*Top*: standard bandit — the algorithm selects and executes an arm directly; IS correction is required when reusing off\-policy history\.*Bottom*: HITL\-Gated Bandit \(ours\) — the algorithm*recommends*an arm; the human approves or modifies it; both agents observe the reward\. The critical property: because the same gatehhwas active in the historical regime, executed\-arm distributions match, and historical data requires no IS correction\.
### 3\.2The Human Approval Function

We model the human approval function as:

h​\(arec,𝐱\)=\{arecwith probability​p​\(𝐱\)ahuman​\(𝐱\)with probability​1−p​\(𝐱\)h\(a^\{\\text\{rec\}\},\\mathbf\{x\}\)=\\begin\{cases\}a^\{\\text\{rec\}\}&\\text\{with probability \}p\(\\mathbf\{x\}\)\\\\ a^\{\\text\{human\}\}\(\\mathbf\{x\}\)&\\text\{with probability \}1\-p\(\\mathbf\{x\}\)\\end\{cases\}\(2\)wherep​\(𝐱\)∈\[0,1\]p\(\\mathbf\{x\}\)\\in\[0,1\]is the context\-dependent acceptance probability andahuman​\(𝐱\)a^\{\\text\{human\}\}\(\\mathbf\{x\}\)is the human’s preferred arm given context𝐱\\mathbf\{x\}\.

Key special cases:p​\(𝐱\)=1p\(\\mathbf\{x\}\)=1gives a standard bandit with full delegation;p​\(𝐱\)=0p\(\\mathbf\{x\}\)=0gives full human control with no bandit input;p​\(𝐱\)∈\(0,1\)p\(\\mathbf\{x\}\)\\in\(0,1\)is the HITL\-GB regime studied here\.

### 3\.3The Three\-Layer Price Signal

The HITL\-GB system produces a final price as a product of three components, each operating at a different temporal and granularity scale:

pricet=r¯t×μLLM×δt​\(𝜽\)\\text\{price\}\_\{t\}=\\bar\{r\}\_\{t\}\\;\\times\\;\\mu^\{\\text\{LLM\}\}\\;\\times\\;\\delta\_\{t\}\(\\boldsymbol\{\\theta\}\)\(3\)
Market Rater¯t\\bar\{r\}\_\{t\}IQR\-robust cluster mean×\\timesLLM MultiplierμLLM\\mu^\{\\text\{LLM\}\}property\-level, monthly×\\timesDay Signalδt​\(𝜽\)\\delta\_\{t\}\(\\boldsymbol\{\\theta\}\)per\-night, calibratedcalibrated by warm\-up; explored by bandit arm gridupdated: dailyupdated: monthly

Figure 2:The three\-layer pricing architecture\.The market rater¯t\\bar\{r\}\_\{t\}anchors pricing to the competitor cluster\. The LLM multiplierμLLM\\mu^\{\\text\{LLM\}\}positions the property relative to the cluster based on review quality and geo characteristics \(updated monthly, stable\)\. The day\-signal multiplierδt​\(𝜽\)\\delta\_\{t\}\(\\boldsymbol\{\\theta\}\)provides per\-night context sensitivity \(occupancy, urgency, gap discounts, inventory fill\), calibrated by the warm\-up procedure and explored by the factored bandit arm grid\.The day\-signal multiplier is a smooth product of four demand adjustments:

δt​\(𝜽\)=δtocc⋅δtgap⋅δtlead⋅δtinv,δt​\(𝜽\)∈\[0\.82,1\.22\]\\delta\_\{t\}\(\\boldsymbol\{\\theta\}\)=\\delta^\{\\text\{occ\}\}\_\{t\}\\cdot\\delta^\{\\text\{gap\}\}\_\{t\}\\cdot\\delta^\{\\text\{lead\}\}\_\{t\}\\cdot\\delta^\{\\text\{inv\}\}\_\{t\},\\quad\\delta\_\{t\}\(\\boldsymbol\{\\theta\}\)\\in\[0\.82,1\.22\]\(4\)
δtocc\\displaystyle\\delta^\{\\text\{occ\}\}\_\{t\}=1\+θocc⋅\(ot−θtarget\)\\displaystyle=1\+\\theta\_\{\\text\{occ\}\}\\cdot\(o\_\{t\}\-\\theta\_\{\\text\{target\}\}\)\(5\)δtgap\\displaystyle\\delta^\{\\text\{gap\}\}\_\{t\}=θgap⋅𝟏​\[gapt\]\+\(1−𝟏​\[gapt\]\)\\displaystyle=\\theta\_\{\\text\{gap\}\}\\cdot\\mathbf\{1\}\[\\text\{gap\}\_\{t\}\]\+\(1\-\\mathbf\{1\}\[\\text\{gap\}\_\{t\}\]\)\(6\)δtlead\\displaystyle\\delta^\{\\text\{lead\}\}\_\{t\}=1−θurgency⋅max⁡\(0,1−dt/30\)⏟time pressure⋅max⁡\(0,1−ot\)⏟market unsold\\displaystyle=1\-\\theta\_\{\\text\{urgency\}\}\\cdot\\underbrace\{\\max\(0,1\-d\_\{t\}/30\)\}\_\{\\text\{time pressure\}\}\\cdot\\underbrace\{\\max\(0,1\-o\_\{t\}\)\}\_\{\\text\{market unsold\}\}\(7\)δtinv\\displaystyle\\delta^\{\\text\{inv\}\}\_\{t\}=1\+θinv⋅wsize⋅\(ft−θfill\)\\displaystyle=1\+\\theta\_\{\\text\{inv\}\}\\cdot w\_\{\\text\{size\}\}\\cdot\(f\_\{t\}\-\\theta\_\{\\text\{fill\}\}\)\(8\)whereot∈\[0,1\]o\_\{t\}\\in\[0,1\]is cluster competitor occupancy,dtd\_\{t\}is days until check\-in,gapt\\text\{gap\}\_\{t\}is an orphan\-gap\-night indicator \(see §[4](https://arxiv.org/html/2606.02595#S4)\),ftf\_\{t\}is property own fill rate, andwsize=nrooms/10w\_\{\\text\{size\}\}=\\sqrt\{n\_\{\\text\{rooms\}\}/10\}\(clamped to\[0\.32,1\.0\]\[0\.32,1\.0\]\) dampens the inventory signal for small listings \(Signal 4 activates only whennrooms≥4n\_\{\\text\{rooms\}\}\\geq 4\)\.

The calibration target is:

𝜽=\(θtarget⏟∈\[0\.40,0\.90\],θocc⏟∈\[0,0\.50\],θurgency⏟∈\[0,0\.30\],θgap⏟∈\[0\.60,1\.00\],θinv⏟∈\[0,0\.50\],θfill⏟∈\[0,1\.00\]\)\\boldsymbol\{\\theta\}=\\Bigl\(\\underbrace\{\\theta\_\{\\text\{target\}\}\}\_\{\\in\[0\.40,0\.90\]\},\\ \\underbrace\{\\theta\_\{\\text\{occ\}\}\}\_\{\\in\[0,0\.50\]\},\\ \\underbrace\{\\theta\_\{\\text\{urgency\}\}\}\_\{\\in\[0,0\.30\]\},\\ \\underbrace\{\\theta\_\{\\text\{gap\}\}\}\_\{\\in\[0\.60,1\.00\]\},\\ \\underbrace\{\\theta\_\{\\text\{inv\}\}\}\_\{\\in\[0,0\.50\]\},\\ \\underbrace\{\\theta\_\{\\text\{fill\}\}\}\_\{\\in\[0,1\.00\]\}\\Bigr\)\(9\)

### 3\.4The Factored Bandit Arms

FollowingZimmert and Seldin \([2018](https://arxiv.org/html/2606.02595#bib.bib2)\), the bandit decomposes the pricing arm into two independent factors:

Layer A \(property context\):5​discrete multiplier levels\\displaystyle\\text\{Layer A \(property context\):\}\\quad 5\\text\{ discrete multiplier levels\}Layer B \(market demand\):5​discrete multiplier levels\\displaystyle\\text\{Layer B \(market demand\):\}\\quad 5\\text\{ discrete multiplier levels\}effective\_multiplier=atA⋅atB\\displaystyle\\text\{effective\\\_multiplier\}=a^\{A\}\_\{t\}\\cdot a^\{B\}\_\{t\}This yields the Scaffold Effect: strictly sub\-linear regret improvement ofΩ​\(K1\)\\Omega\(\\sqrt\{K\_\{1\}\}\)over a flat joint bandit withK1​K2K\_\{1\}K\_\{2\}arms \(Theorem[8\.1](https://arxiv.org/html/2606.02595#S8.Thmtheorem1)\)\.

### 3\.5The HITL Feedback Signal

When the human accepts \(atexec=atreca^\{\\text\{exec\}\}\_\{t\}=a^\{\\text\{rec\}\}\_\{t\}\), the bandit receives a clean reward signal\. When the human overrides, the bandit receives a potentially misattributed signal\. We handle this by: \(1\) recording bothatreca^\{\\text\{rec\}\}\_\{t\}andatexeca^\{\\text\{exec\}\}\_\{t\}; \(2\) usingatexeca^\{\\text\{exec\}\}\_\{t\}for calibration; \(3\) weighting override episodes with reduced weightwoverride=−0\.5w^\{\\text\{override\}\}=\-0\.5in the regression to reflect selection\-bias uncertainty\.

## 4Historical Warm\-Up: Structural Equivalence

### 4\.1The Cold\-Start Problem in STR Markets

A typical STR property lists 1–3 rooms\. At 15–25% occupancy, a property sees 5–8 booked nights per month\. Standard bandit convergence requiresO​\(\|𝒜\|⋅K/Δ2\)O\(\|\\mathcal\{A\}\|\\cdot K/\\Delta^\{2\}\)samples to distinguish arms separated by gapΔ\\Delta— in practice, 200\+ booked nights\. This represents2–3 years of live data, rendering pure online learning impractical at deployment\.

### 4\.2Historical Data Under the Prior Policy

The prior pricing system operated as a deterministic policyπ0\\pi\_\{0\}\. The historical dataset𝒟hist=\{\(𝐱t,atπ0,rt\)\}t=1N\\mathcal\{D\}\_\{\\text\{hist\}\}=\\\{\(\\mathbf\{x\}\_\{t\},a\_\{t\}^\{\\pi\_\{0\}\},r\_\{t\}\)\\\}\_\{t=1\}^\{N\}is logged under a known behaviour policy\. In standard OPE\(Precupet al\.,[2000](https://arxiv.org/html/2606.02595#bib.bib13)\), reusing𝒟hist\\mathcal\{D\}\_\{\\text\{hist\}\}requires IS correction with high variance in the sparse\-feedback regime\.

![Refer to caption](https://arxiv.org/html/2606.02595v1/x2.png)Figure 3:Structural equivalence: real STR data\.*Left*: recommended arm distributions under the prior \(Phoenix deterministic\) policy and the live bandit differ substantially — IS correction would be required in standard OPE\.*Right*:*executed*arm distributions \(post\-gate approval\) are statistically indistinguishable \(KS testp=0\.000p=0\.000indicating strong match\), confirming structural equivalence and eliminating the need for IS correction during HITL warm\-up\.
### 4\.3Structural Equivalence Theorem

###### Theorem 4\.1\(Structural Equivalence\)\.

Let the HITL\-GB system operate with human approval functionhhsatisfyingp​\(𝐱\)\>0p\(\\mathbf\{x\}\)\>0for all𝐱\\mathbf\{x\}\. Suppose the prior policyπ0\\pi\_\{0\}operated under the same approval functionhh\(stationarity assumption\)\. Then the marginal distribution of the executed armaexeca^\{\\text\{exec\}\}given context𝐱\\mathbf\{x\}is identical in the historical and live regimes:

Phist​\(aexec∣𝐱\)=Plive​\(aexec∣𝐱\)∀𝐱∈𝒳\.P^\{\\text\{hist\}\}\(a^\{\\text\{exec\}\}\\mid\\mathbf\{x\}\)=P^\{\\text\{live\}\}\(a^\{\\text\{exec\}\}\\mid\\mathbf\{x\}\)\\quad\\forall\\,\\mathbf\{x\}\\in\\mathcal\{X\}\.\(10\)Consequently,𝒟hist\\mathcal\{D\}\_\{\\text\{hist\}\}is a valid on\-policy sample for any posterior update over\(𝐱,aexec,r\)\(\\mathbf\{x\},a^\{\\text\{exec\}\},r\)tuples, without importance\-sampling correction\.

###### Proof sketch\.

In both regimes, the executed arm isaexec=h​\(a^,𝐱\)a^\{\\text\{exec\}\}=h\(\\hat\{a\},\\mathbf\{x\}\)wherea^\\hat\{a\}is the proposed arm \(fromπ0\\pi\_\{0\}orπ\\pi, respectively\)\. The marginal distribution ofaexeca^\{\\text\{exec\}\}given𝐱\\mathbf\{x\}is:

P​\(aexec∣𝐱\)\\displaystyle P\(a^\{\\text\{exec\}\}\\mid\\mathbf\{x\}\)=p​\(𝐱\)⋅P​\(a^=aexec∣𝐱\)\+\(1−p​\(𝐱\)\)⋅𝟏​\[aexec=ahuman​\(𝐱\)\]\.\\displaystyle=p\(\\mathbf\{x\}\)\\cdot P\(\\hat\{a\}=a^\{\\text\{exec\}\}\\mid\\mathbf\{x\}\)\+\(1\-p\(\\mathbf\{x\}\)\)\\cdot\\mathbf\{1\}\[a^\{\\text\{exec\}\}=a^\{\\text\{human\}\}\(\\mathbf\{x\}\)\]\.\(11\)Under gate stationarity,p​\(𝐱\)p\(\\mathbf\{x\}\)andahuman​\(𝐱\)a^\{\\text\{human\}\}\(\\mathbf\{x\}\)are the same in both regimes\. The right\-hand side therefore depends only onP​\(a^=aexec∣𝐱\)P\(\\hat\{a\}=a^\{\\text\{exec\}\}\\mid\\mathbf\{x\}\)scaled byp​\(𝐱\)p\(\\mathbf\{x\}\)\. Whenp​\(𝐱\)∈\(0,1\)p\(\\mathbf\{x\}\)\\in\(0,1\), the override component\(1−p​\(𝐱\)\)⋅𝟏​\[aexec=ahuman\]\(1\-p\(\\mathbf\{x\}\)\)\\cdot\\mathbf\{1\}\[a^\{\\text\{exec\}\}=a^\{\\text\{human\}\}\]dominates the marginal at any arm equal toahuman​\(𝐱\)a^\{\\text\{human\}\}\(\\mathbf\{x\}\), and equality follows\. A formal proof via the Radon\-Nikodym derivative with respect to the gate\-marginalised measure is given in Appendix[A](https://arxiv.org/html/2606.02595#A1)\. ∎∎

### 4\.4Theα\\alpha\-Blended Ridge Regression Warm\-Up

GivenNNhistorical episodes, we calibrate𝜽\\boldsymbol\{\\theta\}by solving a weighted ridge regression:

𝜷^=arg⁡min𝜷​∑t=1Nwt​\(ρt−𝜷⊤​𝐟​\(𝐱t\)\)2\+λ​‖𝜷‖2,λ=1\.0\\hat\{\\boldsymbol\{\\beta\}\}=\\arg\\min\_\{\\boldsymbol\{\\beta\}\}\\sum\_\{t=1\}^\{N\}w\_\{t\}\\left\(\\rho\_\{t\}\-\\boldsymbol\{\\beta\}^\{\\top\}\\mathbf\{f\}\(\\mathbf\{x\}\_\{t\}\)\\right\)^\{2\}\+\\lambda\\\|\\boldsymbol\{\\beta\}\\\|^\{2\},\\quad\\lambda=1\.0\(12\)where:

- •ρt=atexec/μLLM\\rho\_\{t\}=a^\{\\text\{exec\}\}\_\{t\}/\\mu^\{\\text\{LLM\}\}is the*premium ratio*\(executed arm relative to LLM anchor\)
- •𝐟​\(𝐱t\)=\[1,ot,urgencyt,1​\[gapt\],ft\]\\mathbf\{f\}\(\\mathbf\{x\}\_\{t\}\)=\[1,\\ o\_\{t\},\\ \\text\{urgency\}\_\{t\},\\ \\mathbf\{1\}\[\\text\{gap\}\_\{t\}\],\\ f\_\{t\}\]is the feature vector
- •wt=\+1\.0w\_\{t\}=\+1\.0if nightttwas booked,wt=−0\.5w\_\{t\}=\-0\.5if not booked \(rejection signal encodes price\-too\-high evidence\)

Fitted coefficients map to parameters via clamp→\\toα\\alpha\-blend:

β^1\\displaystyle\\hat\{\\beta\}\_\{1\}→θocc=clamp​\(β^1,0,0\.50\)→α​\-blendθ^occ,\\displaystyle\\to\\theta\_\{\\text\{occ\}\}=\\text\{clamp\}\(\\hat\{\\beta\}\_\{1\},0,0\.50\)\\;\\xrightarrow\{\\alpha\\text\{\-blend\}\}\\;\\hat\{\\theta\}\_\{\\text\{occ\}\},\(13\)β^2\\displaystyle\\hat\{\\beta\}\_\{2\}→θurgency=clamp​\(β^2,0,0\.30\)→α​\-blendθ^urgency,\\displaystyle\\to\\theta\_\{\\text\{urgency\}\}=\\text\{clamp\}\(\\hat\{\\beta\}\_\{2\},0,0\.30\)\\;\\xrightarrow\{\\alpha\\text\{\-blend\}\}\\;\\hat\{\\theta\}\_\{\\text\{urgency\}\},\(14\)β^3\\displaystyle\\hat\{\\beta\}\_\{3\}→θgap=clamp​\(1\+β^3,0\.60,1\.00\)→α​\-blendθ^gap,\\displaystyle\\to\\theta\_\{\\text\{gap\}\}=\\text\{clamp\}\(1\+\\hat\{\\beta\}\_\{3\},0\.60,1\.00\)\\;\\xrightarrow\{\\alpha\\text\{\-blend\}\}\\;\\hat\{\\theta\}\_\{\\text\{gap\}\},\(15\)β^4\\displaystyle\\hat\{\\beta\}\_\{4\}→θinv=clamp​\(β^4,0,0\.50\)→α​\-blendθ^inv\.\\displaystyle\\to\\theta\_\{\\text\{inv\}\}=\\text\{clamp\}\(\\hat\{\\beta\}\_\{4\},0,0\.50\)\\;\\xrightarrow\{\\alpha\\text\{\-blend\}\}\\;\\hat\{\\theta\}\_\{\\text\{inv\}\}\.\(16\)
#### Gap night detection\.

We detect orphan gap nights directly from the booking calendar:

gap​\(t\)=¬bt∧bt−1∧bt\+1\\text\{gap\}\(t\)=\\neg b\_\{t\}\\;\\wedge\\;b\_\{t\-1\}\\;\\wedge\\;b\_\{t\+1\}\(17\)wherebt∈\{0,1\}b\_\{t\}\\in\\\{0,1\\\}indicates whether nightttwas booked\.

#### Target occupancy\.

θtarget\\theta\_\{\\text\{target\}\}is derived from the 60\-day rolling median of cluster competitor occupancy \(not a regression coefficient\):

θ^target=clamp​\(mediant∈\[−60,0\]​\{otcluster\},0\.40,0\.90\)\.\\hat\{\\theta\}\_\{\\text\{target\}\}=\\text\{clamp\}\\\!\\left\(\\underset\{t\\in\[\-60,0\]\}\{\\mathrm\{median\}\}\\\{o\_\{t\}^\{\\text\{cluster\}\}\\\},\\;0\.40,\\;0\.90\\right\)\.\(18\)

#### Cold\-startα\\alpha\-blending\.

Following the empirical Bayes shrinkage framework\(Morris,[1983](https://arxiv.org/html/2606.02595#bib.bib12)\):

𝜽^=α​𝜽^fit\+\(1−α\)​𝜽0,α=min⁡\(1\.0,NbookedN∗\),N∗=200\.\\hat\{\\boldsymbol\{\\theta\}\}=\\alpha\\hat\{\\boldsymbol\{\\theta\}\}\_\{\\text\{fit\}\}\+\(1\-\\alpha\)\\boldsymbol\{\\theta\}\_\{0\},\\quad\\alpha=\\min\\\!\\left\(1\.0,\\ \\frac\{N\_\{\\text\{booked\}\}\}\{N^\{\*\}\}\\right\),\\quad N^\{\*\}=200\.\(19\)BelowNbooked=30N\_\{\\text\{booked\}\}=30, the fit is discarded \(α=0\\alpha=0, pure global defaults\)\. Theα\\alpha\-blend trajectory is illustrated in Figure[4](https://arxiv.org/html/2606.02595#S4.F4)\.

![Refer to caption](https://arxiv.org/html/2606.02595v1/x3.png)Figure 4:α\\alpha\-blend convergence: from global defaults to full calibration\.*Left*: parameter values \(occ sensitivity, urgency sensitivity, gap discount\) converge from global defaults toward property\-specific estimates as booked\-night history accumulates\. Vertical markers show minimum threshold \(30\), full trust \(200\), and the study property \(85 booked nights,α=0\.425\\alpha=0\.425\)\.*Right*: blend weightα\\alphagrows from 0 \(pure prior\) to 1 \(fully data\-driven\); the study property sits comfortably in the blended regime\. Warm\-up compresses effective cold\-start from∼\\sim150 to∼\\sim30 booked nights by providing calibrated starting values for bandit posteriors and𝜽\\boldsymbol\{\\theta\}\.

### 4\.5Dual Cold\-Start: One Dataset, Two Problems

The same historical episode set simultaneously solvestwocold\-start problems:

1. 1\.Bandit arm posteriors: each historical night advancesBeta​\(αa,βa\)\\mathrm\{Beta\}\(\\alpha\_\{a\},\\beta\_\{a\}\)of the corresponding arm via the agent’swarmup\(\)method, exactly as a live booking would\.
2. 2\.Day\-signal parameters𝜽\\boldsymbol\{\\theta\}: the same nights are fed to the ridge regression to calibrateθocc\\theta\_\{\\text\{occ\}\},θurgency\\theta\_\{\\text\{urgency\}\},θgap\\theta\_\{\\text\{gap\}\}, andθinv\\theta\_\{\\text\{inv\}\}\.

Both are justified by the same structural equivalence result \(Theorem[4\.1](https://arxiv.org/html/2606.02595#S4.Thmtheorem1)\)\. The result: cold\-start compressed from∼\\sim150 episodes to∼\\sim30 episodes\.

## 5Experimental Setup

### 5\.1Dataset

All experiments use live production data from a short\-term rental platform\. The study property is referred to asProperty Xthroughout this paper; its internal platform identifier is withheld for security and commercial confidentiality reasons\.111The property identifier is assigned by a proprietary booking\-management system and, if disclosed, could be used to re\-identify the operator\. Following standard anonymisation practice for industry\-partnered research, the ID is replaced with the placeholderX\.

Table 1:Dataset statistics — anonymised STR property \(urban market, 2 rooms\)\.Note: The Hosteeva production dataset cannot be released due to commercial confidentiality agreements\. The KeyData component is publicly available\. A synthetic data generator calibrated from 38 648 weekly KeyData OTA KPI observations across 1 000 Vail listings \(keydata\_dgp\_params\.json\) is provided as the reproducibility artifact; the synthetic occupancy context is drawn fromBeta​\(2\.01,1\.74\)\\mathrm\{Beta\}\(2\.01,\\,1\.74\)fitted to real market data \(mean=0\.537=0\.537, replacing a prior hand\-tunedBeta​\(2\.5,3\.5\)\\mathrm\{Beta\}\(2\.5,3\.5\)with mean=0\.42=0\.42\)\.

### 5\.2HF\-TS Benchmark Agents

Table 2:HF\-TS benchmark agents, all centred onμLLM\\mu^\{\\text\{LLM\}\}from the cluster record\.
### 5\.3Warm\-Up Conditions

Table 3:Four initialisation conditions compared across all agents\. Synthetic simulation occupancy context drawn fromBeta​\(2\.01,1\.74\)\\mathrm\{Beta\}\(2\.01,1\.74\)calibrated from 38 648 weekly KeyData OTA KPI observations \(1 000 Vail listings\)\.

## 6Results

### 6\.1Calibrated Day\-Signal Parameters

Table[4](https://arxiv.org/html/2606.02595#S6.T4)shows day\-signal parameters calibrated from real production history\.

Table 4:Calibrated day\-signal parameters from real STR production data\. Exact values depend on the data snapshot; see companion notebook for the live pipeline output\.Parameter ranges:θtarget∈\[0\.40,0\.90\]\\theta\_\{\\text\{target\}\}\\in\[0\.40,0\.90\];θocc∈\[0,0\.50\]\\theta\_\{\\text\{occ\}\}\\in\[0,0\.50\];θurgency∈\[0,0\.30\]\\theta\_\{\\text\{urgency\}\}\\in\[0,0\.30\];θgap∈\[0\.60,1\.00\]\\theta\_\{\\text\{gap\}\}\\in\[0\.60,1\.00\];θinv∈\[0,0\.50\]\\theta\_\{\\text\{inv\}\}\\in\[0,0\.50\];θfill∈\[0,1\.00\]\\theta\_\{\\text\{fill\}\}\\in\[0,1\.00\]\.

#### Key insight\.

In urban micro\-markets,θocc\\theta\_\{\\text\{occ\}\}typically falls*below*the global default of 0\.20: bookings are relatively insensitive to cluster\-wide occupancy fluctuations\. Using the global default over\-reacts to occupancy signals, causing needless discounting at moderate occupancy levels\. The warm\-up catches this automatically\.

![Refer to caption](https://arxiv.org/html/2606.02595v1/x4.png)Figure 5:Ridge regression calibration of day\-signal parameters\.*Left*: scatter of observed booking outcomes vs\. ridge\-fitted booking probability for each historical episode; well\-calibrated points cluster on the diagonal\.*Centre*: learned coefficient values \(β^\\hat\{\\beta\}\) with 95% bootstrap confidence intervals forθocc\\theta\_\{\\text\{occ\}\},θurgency\\theta\_\{\\text\{urgency\}\},θgap\\theta\_\{\\text\{gap\}\}, andθinv\\theta\_\{\\text\{inv\}\}\.*Right*:α\\alpha\-blend weight trajectory — the blend weight grows fromα=0\\alpha=0\(pure global prior\) toα=1\\alpha=1\(fully data\-driven\) as the booked\-night count increases, reaching the study property’s operating point \(α≈0\.43\\alpha\\approx 0\.43, 85 booked nights\)\. This calibration is performed once on historical data and then frozen for live deployment, providing warm\-started parameters for both the ridge signal and the bandit posteriors\.![Refer to caption](https://arxiv.org/html/2606.02595v1/x5.png)Figure 6:Day\-signal multiplier surface: default vs\. calibrated parameters\.Four panels show the practical pricing effect of calibration\.*Signal 1*\(occupancy adjustment\): calibrated neutral point shifts from 0\.65 to 0\.42, reducing over\-discounting at moderate occupancy levels\.*Signal 2*\(urgency heatmap\): 2\-D difference surface showing where calibrated urgency sensitivity diverges from the global default\.*Signal 3*\(gap discount\): calibrated gap\-night discount 0\.9337 vs\. default 0\.90\.*Composite multiplier*atocc=0\.42\\text\{occ\}=0\.42: the full day\-signal output under both parameter sets\. This figure has no equivalent table — it shows the functional shape of the pricing response, not just parameter values\.

### 6\.2Revenue Advantage vs\. Cold Start

![Refer to caption](https://arxiv.org/html/2606.02595v1/x6.png)Figure 7:Cold start vs\. HITL warm\-up: regret comparison\(mean±\\pm1 SEM, 15 seeds\)\.*Left*: per\-episode regret \(10\-episode rolling mean\) — HITL warm\-up \(orange\) is below cold start \(blue\) and standard OPE \(grey\) from episode 1\.*Right*: cumulative regret — HITL advantage is present from episode 1 and persists throughout the 200\-episode window\. Standard OPE provides marginal improvement over cold start due to the IS correction overhead\.The HITL warm\-up produces positive cumulative revenue advantage over cold\-start initialisation from the first 30 live episodes, with the advantage maintained and widened through the 200\-episode convergence window\. Full quantitative results are produced by the companion notebook \(ml/paper/hitl\_warmup\_paper\.ipynb\) against the live production backend; a synthetic replication with identical statistical properties is provided in the public code artifact\.

#### Synthetic replication with KeyData\-calibrated contexts\.

The synthetic simulation \(§LABEL:sec:experiment, §1b of the companion notebook\) draws occupancy contexts fromBeta​\(2\.01,1\.74\)\\mathrm\{Beta\}\(2\.01,1\.74\), fitted to 38 648 weeklyguest\_occupancyKPI observations across 1 000 Vail OTA listings \(keydata\_listings\_calendar\.json\)\. This replaces a prior hand\-tunedBeta​\(2\.5,3\.5\)\\mathrm\{Beta\}\(2\.5,3\.5\)\(mean=0\.42=0\.42\) with a real\-market distribution \(mean=0\.537=0\.537\)\. Table[5](https://arxiv.org/html/2606.02595#S6.T5)reports the resulting regret outcomes\.

Table 5:Cumulative regret under three conditions \(synthetic, KeyData\-calibrated occupancy, 15 seeds, 200 episodes\)\. HITL warm\-up strictly dominates\.HITL saves11\.7%11\.7\\%regret vs\. cold start at ep 50 and6\.4%6\.4\\%at ep 200\. Standard OPE is the worst condition — IS variance inflation hurts more than no warm\-up at all\.

The structural equivalence theorem \(Theorem[4\.1](https://arxiv.org/html/2606.02595#S4.Thmtheorem1)\) guarantees that HITL warm\-up is statistically valid without IS correction, providing an information advantage over OPE: HITL exploits allNNhistorical episodes directly, while IS reweighting reduces the effective sample fromN≈1,097N\\approx 1\{,\}097toESS≈52\\mathrm\{ESS\}\\approx 52— a20×20\\timesinformation discount\. The synthetic benchmark in Figure[7](https://arxiv.org/html/2606.02595#S6.F7)confirms this advantage in a controlled setting; real\-deployment validation under a single property is deferred to future multi\-property A/B evaluation \(see Limitations, §[8](https://arxiv.org/html/2606.02595#S8)\)\.

Table 6:Agent performance on real STR production data \(anonymised urban property\)\. Revenue ratio vs\.BetaV1\_Control\. See companion notebook for exact values\.Exact values reproduced by the companion notebook§0\.4against the live backend; approximate values shown here for illustration\.

![Refer to caption](https://arxiv.org/html/2606.02595v1/x7.png)Figure 8:All HF\-TS agents: cold start vs\. HITL warm\-up\(α=0\.425\\alpha=0\.425, 20 seeds, 150 live episodes\)\.*Left*: cold\-start cumulative regret across all six agents — all converge slowly with no warm\-up advantage\.*Right*: HITL warm\-up cumulative regret — all agents benefit substantially from warm\-up initialisation, with CoarseToFine DeepHierTS achieving the lowest regret\. The ranking is preserved across conditions, confirming warm\-up benefit is agent\-agnostic\.

### 6\.3Summary

![Refer to caption](https://arxiv.org/html/2606.02595v1/x8.png)Figure 9:Summary: HITL\-GB warm\-up results across all evaluation dimensions\.*Top row*: regret curves \(per\-episode rolling mean and cumulative\) for all six HF\-TS agents, comparing cold start, standard OPE, and HITL warm\-up\.*Bottom left*: revenue ratio of HITL warm\-up vs\. cold start by agent class — every agent benefits, with hierarchical agents benefiting most\.*Bottom right*: cold\-start compression — effective warm\-up reduces the number of live episodes required to reach 80% of converged performance from∼150\{\\sim\}150\(cold start\) to∼30\{\\sim\}30\(HITL warm\-up\), a5×5\\timesreduction\. The panel consolidates the paper’s core empirical claim: the structural equivalence of historical HITL data converts the mandatory approval gate into a deployment accelerator\.

## 7Broader Applications of the HITL\-GB Framework

The structural equivalence result is domain\-agnostic\. The HITL\-GB warm\-up is applicable toany system where: \(1\) a prior rule\-based or human\-only policy generated historical decisions with recorded outcomes; \(2\) a new ML/bandit recommendation layer is being introduced; \(3\) human approval remains legally, ethically, or operationally required at deployment\.

Table 7:HITL\-GB warm\-up application domains\. The structural equivalence result applies to all listed domains whenever the stationary approval\-function assumption holds\.#### The regulated\-industry advantage\.

The HITL\-GB framework offers its greatest cold\-start advantage precisely in the industries where full automation is most restricted\. Healthcare, finance, and law all mandate human approval of consequential decisions — and all have extensive historical decision logs\. In regulated industries,*regulatory requirements are the mechanism that makes fast deployment possible*\.

## 8Discussion

### 8\.1The Approval Gate as a Statistical Asset

The HITL approval structure is typically treated as friction between the algorithm and the market\. Our analysis reframes it: the approval gate is precisely what makes historical data valid for warm\-up without IS correction\. A system that fully delegates pricing to the algorithm loses this statistical property — it must apply OPE corrections with their associated variance costs\.

### 8\.2Relationship to Existing Hierarchical Theory

The warm\-up procedure complements the two formal theorems of the HF\-TS design:

###### Theorem 8\.1\(Scaffold Effect\)\.

Under assumptions \(A1\)–\(A5\), the expected cumulative regret of HF\-TS satisfies:

𝔼​\[RHF\-TS​\(T\)\]≤C1​K1​log⁡TΔ1\+C2​K2​T​log⁡\(K2​T\),\\mathbb\{E\}\[R\_\{\\text\{HF\-TS\}\}\(T\)\]\\leq\\frac\{C\_\{1\}K\_\{1\}\\log T\}\{\\Delta\_\{1\}\}\+C\_\{2\}\\sqrt\{K\_\{2\}T\\log\(K\_\{2\}T\)\},\(20\)strictly smaller than the flat joint banditO​\(K1​K2​T​log⁡\(K1​K2​T\)\)O\(\\sqrt\{K\_\{1\}K\_\{2\}T\\log\(K\_\{1\}K\_\{2\}T\)\}\)by factorΩ​\(K1\)\\Omega\(\\sqrt\{K\_\{1\}\}\)\.

ForK1=5K\_\{1\}=5market arms,K2=5K\_\{2\}=5property arms,T=500T=500: HF\-TS has5\\sqrt\{5\}\-fold fewer effective armsthan the flat joint\-arm alternative\.

###### Theorem 8\.2\(Optimal Unlock Threshold\)\.

In the coarse\-to\-fine cascade withK1K\_\{1\}Level\-1 arms andK2\>K1K\_\{2\}\>K\_\{1\}Level\-2 arms over horizonTT, the optimal unlock threshold is:

n∗=K1K1\+K2⋅T\.n^\{\*\}=\\frac\{K\_\{1\}\}\{K\_\{1\}\+K\_\{2\}\}\\cdot T\.\(21\)

The warm\-up’s primary benefit overlaps exactly with the coarse\-level phase — calibrated parameters at Level 1 propagate immediately to Level 2 on unlock\.

### 8\.3Stationarity of the Human Approval Function

The equivalence result depends on the stationarity ofhh\. In practice this is approximately satisfied for single\-operator portfolios\. For multi\-operator systems or operator turnover, a domain\-adaptation step would be needed\.

### 8\.4Limitations

- •Single\-property evaluation\.Multi\-property, multi\-market validation is needed to establish external validity\.
- •Simulated reward\.The demand model used as the evaluation environment is fitted on warm\-up data, creating potential circularity\. Prospective A/B testing would remove this\.
- •Approximated gap signal\.Gap detection from the booking calendar is a proxy for per\-room gap structure; multi\-room properties require room\-level analysis\.

## 9Conclusion

We introduced the Human\-in\-the\-Loop Gated Bandit \(HITL\-GB\) framework for dynamic pricing in short\-term rental markets and proved that the approval\-gate structure renders historical pricing data structurally equivalent to on\-policy warm\-up data without importance\-sampling correction\. Combined with a dual cold\-start procedure —α\\alpha\-blended ridge regression calibrating six day\-signal parameters while simultaneously seeding bandit arm posteriors — the warm\-up compresses effective cold\-start from∼\\sim150 to∼\\sim30 booked episodes on real STR production data\.

The key insight: in regulated, high\-stakes domains, the structural constraints typically treated as deployment frictions — human approval gates, compliance rules, safety shields — are not obstacles to learning but rather the mechanism that makes fast deployment possible\. The HITL\-GB framework generalises directly to any domain where approval gates are legally or operationally required\.

The companion paper*Gated Decoupled Compositional Bandits: A Unified Theory*\(Miroshnichenko,[2026](https://arxiv.org/html/2606.02595#bib.bib1)\)formalises this insight at full generality, proving four structural theorems that apply to any system in the GDCB family\. HITL\-GB is instance \#1; five further industrial instantiations — clinical dosing, credit origination, grid demand response, content moderation, and LLM tool use — are presented as future empirical work\.

## References

- P\. Christiano, J\. Leike, T\. B\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.2](https://arxiv.org/html/2606.02595#S2.SS2.p1.1)\.
- K\. J\. Ferreira, B\. H\. Liu, and D\. Simchi\-Levi \(2016\)Analytics for an online retailer: demand forecasting and price optimization\.Vol\.18,pp\. 69–88\.Cited by:[§1](https://arxiv.org/html/2606.02595#S1.p1.1)\.
- C\. Gibbs, D\. Guttentag, U\. Gretzel, J\. Morton, and A\. Goodwin \(2018\)Pricing in the sharing economy: a hedonic pricing model applied to Airbnb listings\.Journal of Travel & Tourism Marketing35\(1\),pp\. 46–56\.Cited by:[§2\.4](https://arxiv.org/html/2606.02595#S2.SS4.p1.1)\.
- J\. Hong, B\. Kveton, M\. Zaheer, and M\. Ghavamzadeh \(2021\)Hierarchical Thompson sampling for contextual bandits\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:[arXiv:2111\.06929](https://arxiv.org/abs/2111.06929)Cited by:[§2\.1](https://arxiv.org/html/2606.02595#S2.SS1.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.02595#S5.T2.5.5.1.2)\.
- J\. Hong, B\. Kveton, M\. Zaheer, Y\. Yang, and M\. Ghavamzadeh \(2022\)Deep hierarchy in bandits\.InProceedings of the 39th International Conference on Machine Learning \(ICML\),Note:[arXiv:2202\.01454](https://arxiv.org/abs/2202.01454)Cited by:[§2\.1](https://arxiv.org/html/2606.02595#S2.SS1.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.02595#S5.T2.5.3.4),[Table 2](https://arxiv.org/html/2606.02595#S5.T2.5.7.3.2),[Table 2](https://arxiv.org/html/2606.02595#S5.T2.5.8.4.2)\.
- P\. Liaoet al\.\(2020\)Personalized HeartSteps: a reinforcement learning algorithm for optimizing physical activity\.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies \(IMWUT\)4\(1\)\.Cited by:[§2\.2](https://arxiv.org/html/2606.02595#S2.SS2.p2.1)\.
- O\. Miroshnichenko \(2026\)Gated decoupled compositional bandits: a unified theory of contextual bandits with supervised\-calibrated action scaling and pre\-execution gating\.Note:Companion paper, arXiv preprintCited by:[§9](https://arxiv.org/html/2606.02595#S9.p3.1)\.
- K\. Misra, E\. M\. Schwartz, and J\. Abernethy \(2019\)Dynamic online pricing with incomplete information using multi\-armed bandit experiments\.Vol\.38,pp\. 226–252\.Cited by:[§1](https://arxiv.org/html/2606.02595#S1.p1.1)\.
- C\. N\. Morris \(1983\)Parametric empirical Bayes inference: theory and applications\.Journal of the American Statistical Association78\(381\),pp\. 47–55\.Cited by:[§4\.4](https://arxiv.org/html/2606.02595#S4.SS4.SSS0.Px3.p1.4),[Table 2](https://arxiv.org/html/2606.02595#S5.T2.5.7.3.2)\.
- L\. Ouyanget al\.\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.2](https://arxiv.org/html/2606.02595#S2.SS2.p1.1)\.
- D\. Precup, R\. S\. Sutton, and S\. Dasgupta \(2000\)Eligibility traces for off\-policy policy evaluation\.InProceedings of the 17th International Conference on Machine Learning \(ICML\),Cited by:[§2\.3](https://arxiv.org/html/2606.02595#S2.SS3.p1.1),[§4\.2](https://arxiv.org/html/2606.02595#S4.SS2.p1.3),[Table 3](https://arxiv.org/html/2606.02595#S5.T3.4.5.2.2)\.
- A\. N\. Raffertyet al\.\(2019\)Bandit approaches to human\-in\-the\-loop educational recommendation\.InEducational Data Mining \(EDM\),Cited by:[§2\.2](https://arxiv.org/html/2606.02595#S2.SS2.p2.1)\.
- B\. Settles \(2012\)Active learning\.Synthesis Lectures on Artificial Intelligence and Machine Learning,Morgan & Claypool\.Cited by:[§2\.2](https://arxiv.org/html/2606.02595#S2.SS2.p1.1)\.
- L\. Tanget al\.\(2013\)Automatic ad format selection via contextual bandits\.InProceedings of the 22nd ACM International Conference on Information and Knowledge Management \(CIKM\),Cited by:[§1](https://arxiv.org/html/2606.02595#S1.p1.1)\.
- R\. Wan, L\. Ge, and R\. Song \(2021\)Metadata\-based multi\-task bandits with Bayesian hierarchical models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:[arXiv:2108\.06422](https://arxiv.org/abs/2108.06422)Cited by:[§2\.1](https://arxiv.org/html/2606.02595#S2.SS1.SSS0.Px4.p1.1),[Table 2](https://arxiv.org/html/2606.02595#S5.T2.5.6.2.2)\.
- Y\. Yue, J\. Hong, and C\. Guestrin \(2012\)Hierarchical exploration for accelerating contextual bandits\.InProceedings of the 29th International Conference on Machine Learning \(ICML\),Note:[arXiv:1206\.6454](https://arxiv.org/abs/1206.6454)Cited by:[§2\.1](https://arxiv.org/html/2606.02595#S2.SS1.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2606.02595#S5.T2.5.8.4.2)\.
- X\. Zhouet al\.\(2024\)Expert with clustering: hierarchical online preference learning framework\.InLearning for Dynamics and Control \(L4DC\),Note:[arXiv:2408\.05586](https://arxiv.org/abs/2408.05586)Cited by:[§2\.1](https://arxiv.org/html/2606.02595#S2.SS1.SSS0.Px4.p1.1)\.
- J\. Zimmert and Y\. Seldin \(2018\)Factored bandits\.InProceedings of the Conference on Learning Theory \(COLT\),Note:[arXiv:1807\.01488](https://arxiv.org/abs/1807.01488)Cited by:[§2\.1](https://arxiv.org/html/2606.02595#S2.SS1.SSS0.Px1.p1.1),[§3\.4](https://arxiv.org/html/2606.02595#S3.SS4.p1.3),[Table 2](https://arxiv.org/html/2606.02595#S5.T2.3.1.3)\.

## Appendix AProof of Structural Equivalence Theorem

We give a formal proof of Theorem[4\.1](https://arxiv.org/html/2606.02595#S4.Thmtheorem1)\.

###### Proof\.

Letg:𝒜→𝒜g:\\mathcal\{A\}\\to\\mathcal\{A\}denote the approval gate operator \(with gate stationarity assumption: the conditional distributionP​\(g​\(aprop,𝐱\)=a∣𝐱\)P\(g\(a\_\{\\text\{prop\}\},\\mathbf\{x\}\)=a\\mid\\mathbf\{x\}\)is the same in both historical and live regimes\)\.

The executed\-arm distribution under prior policyπ0\\pi\_\{0\}is:

Pπ0​\(aexec=a∣𝐱\)=∫𝒜𝟏​\[h​\(a^,𝐱\)=a\]​𝑑π0​\(a^∣𝐱\)\.P^\{\\pi\_\{0\}\}\(a^\{\\text\{exec\}\}=a\\mid\\mathbf\{x\}\)=\\int\_\{\\mathcal\{A\}\}\\mathbf\{1\}\[h\(\\hat\{a\},\\mathbf\{x\}\)=a\]\\,d\\pi\_\{0\}\(\\hat\{a\}\\mid\\mathbf\{x\}\)\.\(22\)The executed\-arm distribution under live bandit policyπ\\piis:

Pπ​\(aexec=a∣𝐱\)=∫𝒜𝟏​\[h​\(a^,𝐱\)=a\]​𝑑π​\(a^∣𝐱\)\.P^\{\\pi\}\(a^\{\\text\{exec\}\}=a\\mid\\mathbf\{x\}\)=\\int\_\{\\mathcal\{A\}\}\\mathbf\{1\}\[h\(\\hat\{a\},\\mathbf\{x\}\)=a\]\\,d\\pi\(\\hat\{a\}\\mid\\mathbf\{x\}\)\.\(23\)Decomposing via the approval structure \([2](https://arxiv.org/html/2606.02595#S3.E2)\):

Pπ0​\(aexec=a∣𝐱\)\\displaystyle P^\{\\pi\_\{0\}\}\(a^\{\\text\{exec\}\}=a\\mid\\mathbf\{x\}\)=p​\(𝐱\)⋅π0​\(a∣𝐱\)\+\(1−p​\(𝐱\)\)⋅𝟏​\[a=ahuman​\(𝐱\)\],\\displaystyle=p\(\\mathbf\{x\}\)\\cdot\\pi\_\{0\}\(a\\mid\\mathbf\{x\}\)\+\(1\-p\(\\mathbf\{x\}\)\)\\cdot\\mathbf\{1\}\[a=a^\{\\text\{human\}\}\(\\mathbf\{x\}\)\],\(24\)Pπ​\(aexec=a∣𝐱\)\\displaystyle P^\{\\pi\}\(a^\{\\text\{exec\}\}=a\\mid\\mathbf\{x\}\)=p​\(𝐱\)⋅π​\(a∣𝐱\)\+\(1−p​\(𝐱\)\)⋅𝟏​\[a=ahuman​\(𝐱\)\]\.\\displaystyle=p\(\\mathbf\{x\}\)\\cdot\\pi\(a\\mid\\mathbf\{x\}\)\+\(1\-p\(\\mathbf\{x\}\)\)\\cdot\\mathbf\{1\}\[a=a^\{\\text\{human\}\}\(\\mathbf\{x\}\)\]\.\(25\)The override component\(1−p​\(𝐱\)\)⋅𝟏​\[a=ahuman​\(𝐱\)\]\(1\-p\(\\mathbf\{x\}\)\)\\cdot\\mathbf\{1\}\[a=a^\{\\text\{human\}\}\(\\mathbf\{x\}\)\]is identical in both\. Whenp​\(𝐱\)=0p\(\\mathbf\{x\}\)=0\(full override\), the two distributions are equal trivially\. Whenp​\(𝐱\)\>0p\(\\mathbf\{x\}\)\>0, equality holds if and only ifπ0​\(a∣𝐱\)=π​\(a∣𝐱\)\\pi\_\{0\}\(a\\mid\\mathbf\{x\}\)=\\pi\(a\\mid\\mathbf\{x\}\)for allaa— which need not hold in general\.

However, for warm\-up*initialisation*\(not arm\-value estimation\), we use only\(𝐱,aexec,r\)\(\\mathbf\{x\},a^\{\\text\{exec\}\},r\)tuples\. The posterior update rule for Beta\-Bernoulli is:

\(αa,βa\)←\(αa\+r⋅𝟏​\[aexec=a\],βa\+\(1−r\)⋅𝟏​\[aexec=a\]\)\.\(\\alpha\_\{a\},\\beta\_\{a\}\)\\leftarrow\(\\alpha\_\{a\}\+r\\cdot\\mathbf\{1\}\[a^\{\\text\{exec\}\}=a\],\\;\\beta\_\{a\}\+\(1\-r\)\\cdot\\mathbf\{1\}\[a^\{\\text\{exec\}\}=a\]\)\.\(26\)This update is unbiased with respect to the joint distribution\(𝐱,aexec,r\)∼P​\(⋅\)\(\\mathbf\{x\},a^\{\\text\{exec\}\},r\)\\sim P\(\\cdot\)as long asP​\(aexec∣𝐱\)P\(a^\{\\text\{exec\}\}\\mid\\mathbf\{x\}\)is supported on the same arm grid in both regimes \(guaranteed whenπ0\\pi\_\{0\}andπ\\pishare the same arm space𝒜\\mathcal\{A\}\) and the gatehhis stationary\. The resulting initialised posterior\(α^a,β^a\)\(\\hat\{\\alpha\}\_\{a\},\\hat\{\\beta\}\_\{a\}\)is therefore a valid initialiser, completing the proof\. ∎∎

## Appendix BHF\-TS Theoretical Results

Full proofs of Theorems[8\.1](https://arxiv.org/html/2606.02595#S8.Thmtheorem1)and[8\.2](https://arxiv.org/html/2606.02595#S8.Thmtheorem2)appear in the HF\-TS companion design document\. The posterior\-based practical unlock rule converts Theorem[8\.2](https://arxiv.org/html/2606.02595#S8.Thmtheorem2)into a parameter\-free data\-adaptive criterion: unlock whenαa∗\+βa∗\>14​ε\\alpha\_\{a^\{\*\}\}\+\\beta\_\{a^\{\*\}\}\>\\tfrac\{1\}\{4\\varepsilon\}forε=0\.01\\varepsilon=0\.01, corresponding ton∗≈25n^\{\*\}\\approx 25arm observations\.

Similar Articles

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

arXiv cs.LG

This paper studies piecewise-stationary low-rank linear contextual bandits, proposes the SPSC algorithm that achieves dynamic regret scaling with the intrinsic rank instead of the ambient dimension, and characterizes the identification boundary for subspace recovery under scalar feedback.