Continuous-time Optimal Stopping through Deep Reinforcement Learning

arXiv cs.LG 06/17/26, 04:00 AM Papers
Summary
This paper introduces CARLOS, a deep reinforcement learning algorithm that learns continuous-time optimal stopping rules for American-style options using an aggregate deep neural network, effectively closing the Bermudan-American value gap with high computational efficiency.
arXiv:2606.17545v1 Announce Type: new Abstract: Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, approximation errors accumulate through the backward recursion. To remove this limitation, we develop a new reinforcement-learning inspired algorithm that enables us to learn the exercise rule at arbitrarily fine time resolution. Our CARLOS (Continuous-time Adaptive Reinforcement Learning for Optimal Stopping) algorithm utilizes an aggregate deep neural network (ADNN) to learn a joint space-time decision boundary. Starting from a coarse time grid, we progressively increase the frequency of stopping opportunities, while in parallel training the ADNN to refine its timing-value estimates. We moreover design an adaptive sampling strategy that gradually concentrates training effort near the stopping boundary. Benchmarked results show that CARLOS delivers higher prices than existing Bermudan solvers, approaching the American upper bound, and achieves high computational efficiency relative to non-RL comparators.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:41 AM
# Continuous-time Optimal Stopping through Deep Reinforcement Learning
Source: [https://arxiv.org/html/2606.17545](https://arxiv.org/html/2606.17545)
Cosmin Borsa and Mike Ludkovski

\(First announced: June 15, 2026\)

###### Abstract

Simulation based solvers for optimal stopping problems must discretize the stopping decision\. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, approximation errors accumulate through the backward recursion\. To remove this limitation, we develop a new reinforcement\-learning inspired algorithm that enables us to learn the exercise rule at arbitrarily fine time resolution\. Our CARLOS algorithm utilizes an aggregate deep neural network \(ADNN\) to learn a joint space\-time decision boundary\. Starting from a coarse time grid, we progressively increase the frequency of stopping opportunities, while in parallel training the ADNN to refine its timing\-value estimates\. We moreover design an adaptive sampling strategy that gradually concentrates training effort near the stopping boundary\. Benchmarked results show that CARLOS delivers higher prices than existing Bermudan solvers, approaching the American upper bound, and achieves high computational efficiency relative to to non\-RL comparators\.

## 1Introduction

Simulation\-based solvers for Optimal Stopping problems \(OSP\), exemplified by the Longstaff\-Schwartz \(LSMC\) framework\[[24](https://arxiv.org/html/2606.17545#bib.bib5)\], have been a core part of the quantitative finance toolbox over the past 25 years\. Indeed, OSPs are ubiquitous, for example arising in the pricing of all single\-name Puts in U\.S\. markets\. Since these solvers operate with simulated trajectories of the underlying stochastic processes, time discretization is a necessary step in their implementation\. Consequently, the convention is to focus on the Bermudan formulation where a pre\-specified exercise frequencyΔt\\Delta tis given\.

Real\-life contracts are American\-style and can be exercised at any point\. Mathematically, the role of the time\-stepΔt\\Delta tis well understood\[[12](https://arxiv.org/html/2606.17545#bib.bib43)\]and the LSMC\-inspired methods can be implemented at any frequency, so that in theory one can recover the American\-style solution at arbitrary precision\. In practice, however, LSMC methods have an intrinsic issue of error back\-propagation, which tends to be severe\. As a result, one is usually limited to quite coarseΔt\\Delta t’s, creating a material gap to exercise\-anytime value\. Despite the vast extant literature on machine learning for OSPs\[[26](https://arxiv.org/html/2606.17545#bib.bib8)\], to our knowledge this gap between feasible Bermudan solvers and the original American specification has never been adequately addressed\.

In this paper we propose a new algorithm that explicitly targets American\-style contracts\. To this end, we employ Reinforcement Learning \(RL\) to learn the continuous\-exercise optimal stopping strategy\. We start from a coarse LSMC solver and gradually refine the exercise frequency, employing neural network \(NN\) surrogates to approximate the underlying timing values\. This refinement can be done up to any specified frequencyΔtex\\Delta t^\{ex\}and because no backward iteration is involved, it essentially eliminates error accumulation\. We demonstrate that our method \(i\) effectively closes the Bermudan\-American value gap; \(ii\) is much more efficient \(i\.e\., faster\) than a brute force approach running LSMC at a high exercise frequency\.

The starting idea of our solver is to aggregate the collection of time\-discretized stopping rules—indexed by the time step—into a single deep NN surrogate that approximates the entire timing value hypersurface, continuously acrossttand input statexx\. This NN provides a stopping decision for anyt∈\[0,T\]t\\in\[0,T\]\(rather than on a discrete grid like in traditional solvers\), and the main task becomes to train it to learn the continuous\-time stopping rule\. This shift from the original coarse\-grained exercise rule to the desired continuous exercise frequency, raises two novel challenges whose resolution is a key part of our methodological contribution\. The first challenge we must contend with is of*concept drift*—the distribution of the training data changes as training proceeds\. This is because the pathwise rewards that underlie the LSMC paradigm are intrinsically tied to the exercise frequency\. Hence, even on the same simulated trajectory, the stopping time and the collected reward will shift as exercise frequency is refined\. To solve this challenge, in parallel with the RL training, we gradually traverse a collection of exercise grids\. In particular, we show that a good rule of thumb is to iteratively*halve*the time stepsΔt\(b\)\\Delta t^\{\(b\)\}every few RL iterations\.

The second challenge relates to the fact that the stopping region intrinsically shrinks asΔt→0\\Delta t\\to 0\. This implies that some inputs might be in the stopping region for a givenΔt\\Delta tbut end up in the continuation region for a smallerΔt′\\Delta t^\{\\prime\}\. However, the base approach immediately stops any trajectory that is in the stopping region, which effectively prevents re\-training the NN to expand the continuation region\. To solve this issue, we introduce a novel “delayed stopping” technique that adds an exploratory aspect to the training\.

In all, our algorithm, dubbed CARLOS \(Continuous\-time Adaptive Reinforcement Learning for Optimal Stopping\), is initialized via a regular \(but coarse\) LSMC step and then trains the neural network surrogate via about a dozen RL loops while traversing 3\-6 time\-discretization levels\. The final decision rule treatsttas a continuous input and can be evaluated at arbitrary exercise frequency\.

### 1\.1Reinforcement Learning for Optimal Stopping

The dominant approach to pricing of Bermudan options relies on Dynamic Programming \(DP\) which consists of backward recursion\. The LSMC strategy\[[24](https://arxiv.org/html/2606.17545#bib.bib5)\]moves backward from option maturitytK=Tt\_\{K\}=Ttot0=0t\_\{0\}=0, using \(linear\) regression to learn the continuation value which corresponds to a conditional expectation of future expected payoff\. Thus, a new regression is done at each time steptkt\_\{k\}and is coupled to previous regressions \(fortℓ\>tkt\_\{\\ell\}\>t\_\{k\}\) that determine those future payoffs, causing error back\-propagation\.

The alternative to DP borrows from the Markov Decision Process literature, namely policy\- and value\-iteration techniques\. RL dispenses with the recursive logic and aims to learn the global stopping policy from forward samples\. Training data across different time steps is used jointly to improve the approximation in space and time, resembling transfer learning acrosstkt\_\{k\}’s\. For optimal stopping, this learning is “reinforcing” because the training samples, i\.e\., the pathwise rewards, are simulated based on the current stopping rule and the stochastic environment\. Unlike LSMC, RL therefore seeks a single “aggregate” state\-action emulatorQ\(x,a\)Q\(x,a\), where states≡\(t,x\)s\\equiv\(t,x\)now refers both to the stochastic state like underlying asset values, and timett\. Early versions of such approaches appeared in\[[33](https://arxiv.org/html/2606.17545#bib.bib49),[38](https://arxiv.org/html/2606.17545#bib.bib46),[23](https://arxiv.org/html/2606.17545#bib.bib48)\]using linear approximation \(least squares regression against a fixed set of basis functions\)\. Li et al\.\[[23](https://arxiv.org/html/2606.17545#bib.bib48)\]derived bounds on least\-squares policy iteration which is the implementation of RL with*linear*representation of action\-state map in terms of basis functions\. The linear structure allows to express RL policy errors in terms of underlying \(finite\-sample\) projection error\. Such Q\-learning is the basis for our RL framework\.

The special feature of optimal stopping is that the action space is particularly simple, being binary\. Denoting bya=0a=0stopping anda=1a=1continuation, the reward froma=0a=0is explicit, so one only needs to modelQ\(s,1\)Q\(s,1\)\. Herrera et al\.\[[17](https://arxiv.org/html/2606.17545#bib.bib21)\]proposed the RRLM variant which uses randomized Q\-fitting iterations to learn the state\-action emulator\. Another deep\-learning inspired implementation of RL for discrete\-time optimal stopping is in\[[22](https://arxiv.org/html/2606.17545#bib.bib35)\]\.

One attraction of RL is the ability to handle fully data\-driven setups where no such model is available\. For American option pricing this corresponds to directly training on past stock trajectories without specifying stochastic dynamics\. This “model free” idea was explored in\[[10](https://arxiv.org/html/2606.17545#bib.bib32)\]and most recently in\[[9](https://arxiv.org/html/2606.17545#bib.bib47),[7](https://arxiv.org/html/2606.17545#bib.bib55)\]\. Related control settings where RL is applied include\[[8](https://arxiv.org/html/2606.17545#bib.bib64),[37](https://arxiv.org/html/2606.17545#bib.bib61)\]\. In our setup, the RL is model\-based: the stochastic environment is fully specified and hence arbitrarily many samples can be generated\. In particular, we are able to employ adaptive sampling during our training, to preferentially explore regions of interest\.

### 1\.2Deep Learning for Optimal Stopping

RL is naturally intertwined with deep learning as policy or value learning go hand in hand with the iterative training of a NN surrogate\. Deep learning has been applied extensively to optimal stopping and in this section we summarize the relevant literature\.

To our knowledge, the first application of neural networks in LSMC was in Kohler et al\.\[[19](https://arxiv.org/html/2606.17545#bib.bib12)\]who employed shallow single\-layered NNs to approximate the conditional expectation underlying Snell envelopes\. More recently, Lapeyre and Lelong\[[20](https://arxiv.org/html/2606.17545#bib.bib23)\]and Becker et al\.\[[3](https://arxiv.org/html/2606.17545#bib.bib24)\]considered more advanced deep learning methods, in particular to deal with high\-dimensional settings\. A key motivation for NNs is to circumvent the well\-known challenge of basis function selection in traditional LSMC\. Thus, deep NN emulators are employed as flexible approximators that provably converge\[[20](https://arxiv.org/html/2606.17545#bib.bib23),[15](https://arxiv.org/html/2606.17545#bib.bib65)\]\(in the regime of increasing the network size\) by a suitable variant of the universal approximation theorem\. While offering high expressivity, training a NN is a non\-convex objective and requires gradient descent iterations\. As an intermediate approach between classical least squares regression and NN training, Herrera et al\.\[[17](https://arxiv.org/html/2606.17545#bib.bib21), Section 2\]proposed the RLSM algorithm where the inner layer weights of the NN are randomly sampled and only the last layer is optimized\. This randomized approach allows to retain a convex objective, solved via classical linear regression equations, and can be understood as picking expressive random bases\.

All the above works maintain the DP logic of backward recursion, constructing a separate NN emulator at eachtkt\_\{k\}\. In practice, these emulators are very similar, since the stopping policies at two adjacent time steps are so\. This observation is not new; for example both\[[20](https://arxiv.org/html/2606.17545#bib.bib23)\]and\[[3](https://arxiv.org/html/2606.17545#bib.bib24)\]exploit it during backward recursion, by re\-using the same NN object and gradually updating it\. Such warm starts leverage the gradient descent paradigm of deep learning and substantially speed up training time\. Taking this logic a step further, Guo et al\.\[[16](https://arxiv.org/html/2606.17545#bib.bib31)\]proposed a single NN forQ\(s,1\)Q\(s,1\)that takes timettand locationxxas inputs to approximate continuation values across space\-time\. This variation increases the prediction accuracy while decreasing the computational time\. To train their NN,\[[16](https://arxiv.org/html/2606.17545#bib.bib31)\]initially set the stopping time to maturity and perform training\-updating loops in a basic reinforcement learning setting until a predefined criterion is met\.

Additional NN\-based approaches were proposed by\[[4](https://arxiv.org/html/2606.17545#bib.bib2)\]who targeted learning the 0/1 stopping decision rule and by\[[30](https://arxiv.org/html/2606.17545#bib.bib22)\]who approximated the epigraph of the stopping set\. Finally, we mention the body of work that utilizes NNs to solve a free\-boundary partial differential equation for the option price derived via Feynman Kac formulas, see e\.g\. deep Galerkin methods\[[31](https://arxiv.org/html/2606.17545#bib.bib52)\]and backward stochastic differential equations\[[6](https://arxiv.org/html/2606.17545#bib.bib53),[14](https://arxiv.org/html/2606.17545#bib.bib60),[35](https://arxiv.org/html/2606.17545#bib.bib57)\]both of which specifically tackle American option pricing\.

From the implementation side, the precise architecture of the NN can make a significant difference\. Dense feed\-forward networks have been used in\[[19](https://arxiv.org/html/2606.17545#bib.bib12)\]\(a shallow 1\-layer version\), as well as\[[17](https://arxiv.org/html/2606.17545#bib.bib21),[20](https://arxiv.org/html/2606.17545#bib.bib23),[3](https://arxiv.org/html/2606.17545#bib.bib24)\]\.\[[32](https://arxiv.org/html/2606.17545#bib.bib34)\]proposed convolutional NNs, while\[[9](https://arxiv.org/html/2606.17545#bib.bib47)\]proposed recurrent NNs\. For RL\-type methods,\[[10](https://arxiv.org/html/2606.17545#bib.bib32)\]applied customized Double Deep Q\-Network \(DDQN\), Categorical Distributional RL, and Implicit Quantile Networks, deploying LSTM architecture with a dynamic layer and a dropout wrapper to capture long\-term dependencies in sequential data\.\[[13](https://arxiv.org/html/2606.17545#bib.bib33)\]employ a fully connected feed\-forward NN for a Q\-learning approach to recover the optimal stopping times\.

One of the motivations for our study is to control error back\-propagation in LSMC\. In that vein, we also mention various modified LSMC methods\[[39](https://arxiv.org/html/2606.17545#bib.bib26),[2](https://arxiv.org/html/2606.17545#bib.bib25)\]that correct the errors accumulated during the backward iteration steps\.

The rest of the paper is organized as follows\. Section[2](https://arxiv.org/html/2606.17545#S2)sets up the simulation\-based framework for solving optimal stopping problems\. Section[3](https://arxiv.org/html/2606.17545#S3)presents our new CARLOS algorithm\. After a couple of illustrations with a classical 1\-dim Put and 2\-dim Max Call options, in Section[3\.3](https://arxiv.org/html/2606.17545#S3.SS3)we present the benchmarked results over a collection of American options that have been considered in the literature\. Section[4](https://arxiv.org/html/2606.17545#S4)provides the full methodology of our method, including input selection, output generation, and neural network construction details\. Along the way, we discuss the key tuning parameters of CARLOS through several comparative statics experiments and gives a guidance for users\. Section[5](https://arxiv.org/html/2606.17545#S5)concludes\.

## 2Optimal Stopping via Dynamic Programming

We adopt a state\-space framework: let\(Xt\)\(X\_\{t\}\)be thedd\-dimensional Markov stochastic state process on a probability space\(Ω,ℙ,ℱ\)\(\\Omega,\\mathbb\{P\},\\mathcal\{F\}\), taking values in𝒳⊂ℝd\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\}and adapted to a filtration𝔽:=\(ℱt\)t∈\[0,T\]\\mathbb\{F\}:=\(\\mathcal\{F\}\_\{t\}\)\_\{t\\in\[0,T\]\}\. The reward at timettis given byh\(Xt\)h\(X\_\{t\}\), whereh:𝒳↦ℝh:\\mathcal\{X\}\\mapsto\\mathbb\{R\}satisfies

𝔼\[sup0≤t≤T\|h\(Xt\)\|\]<∞\.\\mathbb\{E\}\\big\[\\sup\_\{0\\leq t\\leq T\}\|h\(X\_\{t\}\)\|\\big\]<\\infty\.\(1\)We interpreth\(x\)h\(x\)as the \(time\-stationary\) payoff obtained if the state isx∈𝒳x\\in\\mathcal\{X\}, andℙ\\mathbb\{P\}as the pricing \(*risk\-neutral*\) measure\. For instance, for a Put optionh\(x\)=\(K−x\)\+h\(x\)=\(K\-x\)\_\{\+\}\. Letrrrepresent the discount rate, assumed to be constant; stochastic discounting or time\-dependent payoffs can be embedded as a coordinate ofXXand then subsumed inh\(⋅\)h\(\\cdot\)\.

Consider a contract maturityTT\. We use𝕋\\mathbb\{T\}to denote a*time grid*; in what follows for simplicity we restrict attention to uniform grids with a constant step\. Indexing grids by their sizeNN, we write𝕋\(N\)=\{t0=0,t1,…,tN=T\}\\mathbb\{T\}^\{\(N\)\}=\\\{t\_\{0\}=0,t\_\{1\},\\ldots,t\_\{N\}=T\\\}with

Δt\(N\)=tn\+1−tn=TN,n=0,…,N−1\.\\Delta t^\{\(N\)\}=t\_\{n\+1\}\-t\_\{n\}=\\frac\{T\}\{N\},\\qquad n=0,\\ldots,N\-1\.\(2\)Let𝒯t\\mathcal\{T\}\_\{t\}denote the set of all𝔽\\mathbb\{F\}\-stopping times taking values in\[t,T\]\[t,T\], and similarly𝒯n\(N\)\\mathcal\{T\}^\{\(N\)\}\_\{n\}for the𝕋\(N\)\\mathbb\{T\}^\{\(N\)\}\-valued stopping times that take values inτ∈\{tn,tn\+1\.…,tN\}\\tau\\in\\\{t\_\{n\},t\_\{n\+1\}\.\\ldots,t\_\{N\}\\\}\. WhenNNis clear we drop the superscript and just write𝒯n\\mathcal\{T\}\_\{n\}\.

The Bermudan formulation of the OSP for a given frequencyΔt\(N\)=T/N\\Delta t^\{\(N\)\}=T/Nis to compute the value functionV\(N\):𝕋\(N\)×𝒳↦ℝV^\{\(N\)\}:\\mathbb\{T\}^\{\(N\)\}\\times\\mathcal\{X\}\\mapsto\\mathbb\{R\},

V\(N\)\(tn,x\):=supτ∈𝒯n\(N\)𝔼\[e−r\(τ−tn\)h\(Xτ\)\|Xtn=x\],x∈𝒳,n∈\{0,…,N−1\},V^\{\(N\)\}\(t\_\{n\},x\):=\\sup\_\{\\tau\\in\\mathcal\{T\}^\{\(N\)\}\_\{n\}\}\\mathbb\{E\}\\left\[e^\{\-r\(\\tau\-t\_\{n\}\)\}h\(X\_\{\\tau\}\)\|\\ X\_\{t\_\{n\}\}=x\\right\],\\qquad x\\in\\mathcal\{X\},\\ n\\in\\\{0,\\ldots,N\-1\\\},\(3\)by finding an optimal stopping timeτn⋆∈𝒯n\(N\)\\tau\_\{n\}^\{\\star\}\\in\\mathcal\{T\}^\{\(N\)\}\_\{n\}at which the supremum of the discounted reward process\{e−rtkh\(Xtk\)\}k=nN\\\{e^\{\-rt\_\{k\}\}h\(X\_\{t\_\{k\}\}\)\\\}\_\{k=n\}^\{N\}is attained\[[29](https://arxiv.org/html/2606.17545#bib.bib6), p\. 12\]\. The American formulation is to solve for

V\(t,x\):=supτ∈𝒯t𝔼\[e−r\(τ−t\)h\(Xτ\)\|Xt=x\]\.\\displaystyle V\(t,x\):=\\sup\_\{\\tau\\in\\mathcal\{T\}\_\{t\}\}\\mathbb\{E\}\\left\[e^\{\-r\(\\tau\-t\)\}h\(X\_\{\\tau\}\)\|\\ X\_\{t\}=x\\right\]\.\(4\)
Classical results\[[12](https://arxiv.org/html/2606.17545#bib.bib43)\]imply thatV\(N\)\(t,x\)↑V\(t,x\)V^\{\(N\)\}\(t,x\)\\uparrow V\(t,x\)asN→∞N\\to\\infty, with a difference on the order of𝒪\(N−1\)\\mathcal\{O\}\(N^\{\-1\}\)\. Hence, a sufficiently fine\-grained Bermudan option can be used to approximate arbitrarily well the value of the American contract\.

### 2\.1Backward Learning

In this section we fix the grid𝕋\(N\)\\mathbb\{T\}^\{\(N\)\}and review the sequential LSMC approach for approximatingV\(N\)\(0,x0\)V^\{\(N\)\}\(0,x\_\{0\}\)\. This is both to contrast with RL and because LSMC is a way to initialize our overall algorithm\. The idea of LSMC is to approximate the optimal stopping strategy on𝕋\(N\)\\mathbb\{T\}^\{\(N\)\}by pathwisedynamic programming: deriving a sequence of stopping timesτn⋆\\tau\_\{n\}^\{\\star\}, indexed by the time steptnt\_\{n\}, by backward induction fromtN=Tt\_\{N\}=T\. We focus on thetiming values𝒯\(tn,x\)\{\\mathscr\{T\}\}\(t\_\{n\},x\)defined as the difference between the continuation valueq~\(t,x\)\\tilde\{q\}\(t,x\)and the immediate rewardh\(x\)h\(x\):

𝒯\(tn,x\)\\displaystyle\{\\mathscr\{T\}\}\(t\_\{n\},x\):=q~\(tn,x\)−h\(x\)where\\displaystyle:=\\tilde\{q\}\(t\_\{n\},x\)\-h\(x\)\\qquad\\text\{where\}\(5\)q~\(tn,Xtn\)\\displaystyle\\tilde\{q\}\(t\_\{n\},X\_\{t\_\{n\}\}\):=𝔼\[e−r\(τn⋆−tn\)h\(Xτn⋆\)∣ℱtn\],\\displaystyle:=\\mathbb\{E\}\\left\[e^\{\-r\(\\tau\_\{n\}^\{\\star\}\-t\_\{n\}\)\}h\(X\_\{\\tau\_\{n\}^\{\\star\}\}\)\\mid\\mathcal\{F\}\_\{t\_\{n\}\}\\right\],\(6\)whereτn⋆\\tau\_\{n\}^\{\\star\}is an optimal stopping time in𝒯n\\mathcal\{T\}\_\{n\}\. The theory of Snell envelopes identifiesτn⋆\\tau\_\{n\}^\{\\star\}with the first instance when the timing value becomes non\-positive,

τn⋆=min⁡\{tk∈𝒯n:q~\(tk,Xtk\)≤h\(Xtk\)\}=min⁡\{tk∈𝒯n:𝒯\(tk,Xtk\)≤0\}\.\\tau\_\{n\}^\{\\star\}=\\min\\left\\\{t\_\{k\}\\in\\mathcal\{T\}\_\{n\}:\\tilde\{q\}\(t\_\{k\},X\_\{t\_\{k\}\}\)\\leq h\(X\_\{t\_\{k\}\}\)\\right\\\}=\\min\\left\\\{t\_\{k\}\\in\\mathcal\{T\}\_\{n\}:\{\\mathscr\{T\}\}\(t\_\{k\},X\_\{t\_\{k\}\}\)\\leq 0\\right\\\}\.\(7\)
Using

q~\(tn,Xtn\)\\displaystyle\\tilde\{q\}\(t\_\{n\},X\_\{t\_\{n\}\}\)=e−rΔt𝔼\[q~\(tn\+1,Xtn\+1\)∣ℱtn\],\\displaystyle=e^\{\-r\\Delta t\}\\mathbb\{E\}\\left\[\\tilde\{q\}\(t\_\{n\+1\},X\_\{t\_\{n\+1\}\}\)\\mid\\mathcal\{F\}\_\{t\_\{n\}\}\\right\],\(8\)one\-step\-ahead pathwise samples of𝒯\(tn,⋅\)\\mathscr\{T\}\(t\_\{n\},\\cdot\)are obtained via

w\(tn,xnk\):=e−rΔtq~\(tn\+1,xn\+1k\)−h\(xnk\),w\(t\_\{n\},x\_\{n\}^\{k\}\):=e^\{\-r\\Delta t\}\\tilde\{q\}\(t\_\{n\+1\},x\_\{\{n\+1\}\}^\{k\}\)\-h\(x\_\{n\}^\{k\}\),\(9\)wherexn\+1kx\_\{\{n\+1\}\}^\{k\}is a sample ofXtn\+1X\_\{t\_\{n\+1\}\}conditional onXtn=xnkX\_\{t\_\{n\}\}=x^\{k\}\_\{n\}, for a collectionk∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}\. The typical construction is to createKKtrajectories of the processXXlabeled cross\-sectionally as𝐱n=\{xnk,k=1,…,K\}\\mathbf\{x\}\_\{n\}=\\\{x^\{k\}\_\{n\},k=1,\\ldots,K\\\}, forn=0,…,Nn=0,\\ldots,N\.

One may lift from these samples to the overallL2L^\{2\}\-prediction by regressing theKKpathwise timing valuesw\(tn,𝐱n\)w\(t\_\{n\},\\mathbf\{x\}\_\{\{n\}\}\)against𝐱n\\mathbf\{x\}\_\{n\}\. The least\-squares regression yields a surrogateRnR\_\{n\}whose predicted timing values𝒯^\(tn,𝐱n\)\\hat\{\\mathscr\{T\}\}\(t\_\{n\},\\mathbf\{x\}\_\{n\}\)define the pathwise stopping times\{τnk\}\\\{\\tau\_\{n\}^\{k\}\\\}and pathwise continuation valuesqnk≡q~\(tn,xnk\)q^\{k\}\_\{n\}\\equiv\\tilde\{q\}\(t\_\{n\},x\_\{n\}^\{k\}\)as follows

\(τnk,qnk\)=\{\(tn,h\(xnk\)\)if𝒯^\(tn,xnk\)≤0andh\(xnk\)\>0;\(τn\+1k,e−rΔtqn\+1k\)otherwise\.\(\\tau\_\{n\}^\{k\},q^\{k\}\_\{n\}\)=\\begin\{cases\}\(t\_\{n\},h\(x\_\{n\}^\{k\}\)\)&\\text\{if \}\\hat\{\\mathscr\{T\}\}\(t\_\{n\},x\_\{n\}^\{k\}\)\\leq 0\\text\{ and \}h\(x\_\{n\}^\{k\}\)\>0;\\\\ \(\\tau\_\{n\+1\}^\{k\},e^\{\-r\\Delta t\}q^\{k\}\_\{n\+1\}\)&\\text\{otherwise\}\.\\end\{cases\}\(10\)The recursion forqnkq\_\{n\}^\{k\}is based uponq~\(tn,xtnk\)≃e−r\(τnk−tn\)h\(xτnkk\)\\tilde\{q\}\(t\_\{n\},x\_\{t\_\{n\}\}^\{k\}\)\\simeq e^\{\-r\(\\tau\_\{n\}^\{k\}\-t\_\{n\}\)\}h\(x\_\{\\tau\_\{n\}^\{k\}\}^\{k\}\), where the latter expression is an unbiased estimator of \([6](https://arxiv.org/html/2606.17545#S2.E6)\)\. Another familiar functional representation is

q~\(tn,x\)=max⁡\(𝔼\[e−rΔtq~\(tn\+1,Xtn\+1\)\|Xtn=x\],h\(x\)\)\.\\tilde\{q\}\(t\_\{n\},x\)=\\max\\left\(\\mathbb\{E\}\[e^\{\-r\\Delta t\}\\tilde\{q\}\(t\_\{n\+1\},X\_\{t\_\{n\+1\}\}\)\|X\_\{t\_\{n\}\}=x\],h\(x\)\\right\)\.
For a given trajectory\{xnk\}n=0N\\\{x\_\{\{n\}\}^\{k\}\\\}\_\{n=0\}^\{N\}, if the immediate payoff at a time steptnt\_\{n\}is positive and the estimated timing value is non\-positive, then we shouldstopτnk=tn\\tau\_\{n\}^\{k\}=t\_\{n\}\. Otherwise, we shouldcontinue, and the pathwise stopping time is the one computed in the time stepτn\+1k\\tau\_\{n\+1\}^\{k\}\. For a fixed stepnn, the set\(τnk\)k=1K\(\\tau\_\{n\}^\{k\}\)\_\{k=1\}^\{K\}can be interpreted as a sample ofτn⋆\\tau\_\{n\}^\{\\star\}\. Finally, the backward recursion is run fromn=Nn=Ndown ton=0n=0to yield a sequence of regression emulators\{Rn\}n=0N−1\\\{R\_\{n\}\\\}\_\{n=0\}^\{N\-1\}\.

The timing values are converted into a stopping ruleϕ\\phiaccording to

ϕ\(t,x\)=\{1if𝒯^\(t,x\)≤0andh\(x\)≥0, or ift=T,0otherwise;\\phi\(t,x\)=\\begin\{cases\}1&\\text\{if \}\\hat\{\\mathscr\{T\}\}\(t,x\)\\leq 0\\text\{ and \}h\(x\)\\geq 0\\text\{, or if \}t=T,\\\\ 0&\\text\{otherwise\};\\end\{cases\}\(11\)whereϕ\(t,x\)=1\\phi\(t,x\)=1indicates stopping andϕ\(t,x\)=0\\phi\(t,x\)=0continuation\. Globally, the stopping strategy is summarized by the stopping region𝒮:=\{\(t,x\):ϕ\(t,x\)=0\}\\mathcal\{S\}:=\\\{\(t,x\):\\phi\(t,x\)=0\\\}\. To obtain the ultimate option price, a separate forward evaluation Monte Carlo is needed\. For a path\{xt\}\\\{x\_\{t\}\\\}starting from arbitrary\(tinit,xinit\)\(t\_\{init\},x\_\{init\}\)and progressing along the exercise grid𝕋\\mathbb\{T\}, the payoff ish\(xτ\)h\(x\_\{\\tau\}\), whereτ\\tauis the first time it leaves the continuation region𝒞:=\{\(t,x\):ϕ\(t,x\)=1\}\\mathcal\{C\}:=\\\{\(t,x\):\\phi\(t,x\)=1\\\},

τ:=min⁡\{t∈\{tinit\}∪𝕋:ϕ\(t,xt\)=0\}∧T\.\\tau:=\\min\\\{t\\in\\\{t\_\{\\text\{init\}\}\\\}\\cup\\mathbb\{T\}:\\phi\(t,x\_\{t\}\)=0\\\}\\wedge T\.\(12\)Expected payoff, aka option price with initial statex0x\_\{0\}, is obtained as an empirical average overMMpaths originating at\(0,x0\)\(0,x\_\{0\}\)\.

### 2\.2Time Discretization Effects

Before explaining CARLOS, we discuss the dual role of time stepping, namely for decision optimization and for forward evaluation\.

The LSMC paradigm generates two fundamental sources of error\. First, it works with stopping rules taking values in𝕋\(N\)\\mathbb\{T\}^\{\(N\)\}rather than in continuous time\. Second, the training of the surrogatesRnR\_\{n\}is iterative inn=N,…,0n=N,\\ldots,0and errors back\-propagate\. Further numerical errors, such as function approximation error \(how well can the surrogate approximate the true𝒯\\mathscr\{T\}\) and Monte Carlo noise \(how much do the particular samples affect the fitted network\) can be in principle removed as sufficiently wide/deep NNs are arbitrarily expressive, and sufficiently many stochastic gradient descent steps make the Monte Carlo noise negligible\.

In contrast, the discretization error directly conflicts with the back\-propagation\. If we makeNNlarger \(Δt\\Delta tsmaller\), then there is more error accumulation\. This issue is well known in the folklore, and nearly all benchmarks include just a few time periods\. Indeed, the typical recommendation is to keep the number of stepsNNwell below 100, which may be reasonable forT<1/2T<1/2, but otherwise becomes quite restrictive\. For example, the oft\-cited benchmarks in\[[1](https://arxiv.org/html/2606.17545#bib.bib40)\]useT=3,N=9T=3,N=9, meaning there are just 9 exercise opportunities, with stopping allowed once every 4 months \(Δt\(N\)=1/3\\Delta t^\{\(N\)\}=1/3\), an extreme example of Bermudization\.

There are in fact two different discretization parameters when using a discrete\-time solver for OSP\. The exercise frequencyΔtex\\Delta t^\{ex\}controls when the reward may be collected during forward evaluation\. LargeΔtex\\Delta t^\{ex\}values correspond to fewer stopping opportunities\. The solver or “training” frequencyΔttr\\Delta t^\{tr\}determines how often the maximum operator is applied when computing the Snell envelope\. This parameter controls the resulting stopping policy, i\.e\., the exercise rule\. While classicallyΔtex=Δttr\\Delta t^\{ex\}=\\Delta t^\{tr\}, they can be different: for instance, one may allow stopping very frequently even if the stopping policy is obtained from a coarse time grid; or conversely, one might have a fine\-grained policy but exercise only rarely\.

Literally takingΔtex→0\\Delta t^\{ex\}\\to 0is unnecessary\. First, the loss when usingΔtex=T/N\\Delta t^\{ex\}=T/Ncompared to continuous stopping when computing timing values is theoretically𝒪\(N−1\)\\mathcal\{O\}\(N^\{\-1\}\)\[[12](https://arxiv.org/html/2606.17545#bib.bib43)\], so there is limited upside to refining ad infinitum\. Second, to evaluate a rule with exercise frequencyΔtex\\Delta t^\{ex\}takes effectively𝒪\(\(Δtex\)−1\)=𝒪\(N\)\\mathcal\{O\}\(\(\\Delta t^\{ex\}\)^\{\-1\}\)=\\mathcal\{O\}\(N\)effort since we must evaluate the machine learning surrogatesRn\(⋅\)R\_\{n\}\(\\cdot\)for everyn=1,…,Nn=1,\\ldots,N\(doubling the number of steps will require double the evaluations ofRnR\_\{n\}’s\)\. Third, the learning becomes more challenging asΔtex→0\\Delta t^\{ex\}\\to 0due to the smooth fit principle: in discrete time, there is \(theoretically\) a crisp classification between a timing value being positive \(and hence not stopping\) or negative\. However, in continuous time, we have∂x𝒯~\(t,x\)=0\\partial\_\{x\}\\tilde\{\\mathscr\{T\}\}\(t,x\)=0andminx⁡𝒯~\(t,x\)=0\\min\_\{x\}\\tilde\{\\mathscr\{T\}\}\(t,x\)=0so that the value function smooth pastes to the payoff function and the timing value is at least zero, never negative\. This makes it harder to numerically determine whether to stop or continue at an input\(t,x\)\(t,x\)close to the boundary\.

To illustrate, we price a 1\-dimensional Bermudan Put option \(B1from Table[2](https://arxiv.org/html/2606.17545#S3.T2)below\) using a variety of exercise frequenciesΔtex\\Delta t^\{ex\}and solver frequenciesΔttr\\Delta t^\{tr\}\. For the latter, we use the Crank\-Nicolson partial differential equation solver \(CN, which is second\-order accurate in both time and space and is unconditionally stable\[[34](https://arxiv.org/html/2606.17545#bib.bib44), p\. 156\]\)111Both the CN and explicit finite\-difference schemes use a time stepΔtPDE≤Δttr\\Delta t^\{PDE\}\\leq\\Delta t^\{tr\}to approximate time derivatives\. The time stepΔtPDE\\Delta t^\{PDE\}is inherent to the finite\-difference method itself and chosen independently\.\. Figure[1](https://arxiv.org/html/2606.17545#S2.F1)displays the optimal exercise boundaries under variousΔttr\\Delta t^\{tr\}’s\. Each boundaryB\[I\]B^\{\[I\]\}is a right\-continuous function with steps at each\{iΔttr:i=0,1,…,I\}\\\{i\\Delta t^\{tr\}:i=0,1,\\ldots,I\\\}, whereI=T/ΔttrI=T/\\Delta t^\{tr\}\. A larger solver frequencyΔttr\\Delta t^\{tr\}lowers potential future gains \(as there are fewer opportunities to stop\) and therefore makes the controller more impatient about collecting the reward early\. Accordingly, in Figure[1](https://arxiv.org/html/2606.17545#S2.F1)asΔttr\\Delta t^\{tr\}increases, the boundaries shift upwards\. We note that asΔttr\\Delta t^\{tr\}decreases, the boundariesB\[I\]B^\{\[I\]\}rapidly converge \(to the boundaryB\[∞\]B^\{\[\\infty\]\}of the American formulation\)\. At the other extreme, we plot theEuropean boundaryB\[0\]B^\{\[0\]\}that delineates the region

\{\(t,x\)∈\[0,T\]×𝒳:𝔼\[e−r\(T−t\)h\(XT\)\|Xt=x\]=h\(x\)\}\.\\big\\\{\(t,x\)\\in\[0,T\]\\times\\mathcal\{X\}:\\mathbb\{E\}\[e^\{\-r\(T\-t\)\}h\(X\_\{T\}\)\\big\|\\,X\_\{t\}=x\]=h\(x\)\\big\\\}\.\(13\)where the immediate payoffh\(x\)h\(x\)exceeds the price of the European\-style contract, effectively corresponding to the Bermudan formulation withΔttr=T\\Delta t^\{tr\}=T\. The European boundary delineates the region where stopping decision must be evaluated—above it continuation is a priori optimal\.

![Refer to caption](https://arxiv.org/html/2606.17545v1/x1.png)Figure 1:Optimal stopping boundaries of the 1\-dimensional Bermudan Put optionB1from Table[2](https://arxiv.org/html/2606.17545#S3.T2)computed via the CN method at various solver frequenciesΔttr\\Delta t^\{tr\}\. These boundaries were obtained by convertingϕ\(t,x\)\\phi\(t,x\)into contour lines\. The CN method usedΔtPDE=1/192\\Delta t^\{\\text\{PDE\}\}=1/192andΔx=0\.02\\Delta x=0\.02\. The dashed contour shows the thresholdsB\[0\]B^\{\[0\]\}where the European Put becomes less valuable than its intrinsic value, cf\. \([13](https://arxiv.org/html/2606.17545#S2.E13)\)\.Table[1](https://arxiv.org/html/2606.17545#S2.T1)shows how the interplay between the exercise frequencyΔtex\\Delta t^\{ex\}and the solver frequencyΔttr\\Delta t^\{tr\}influences the expected reward\. The classical case ofΔtex=Δttr\\Delta t^\{ex\}=\\Delta t^\{tr\}is on the diagonal and shows that as expected more frequent stopping opportunities capture greater payoffs\. The more interesting behavior is off\-diagonal\. WhenΔtex\>Δttr\\Delta t^\{ex\}\>\\Delta t^\{tr\}, infrequent stopping “leaves money on the table” by skipping exercise opportunities\. This results in lower rewards compared toΔttr=Δtex\\Delta t^\{tr\}=\\Delta t^\{ex\}\. Conversely, whenΔtex<Δttr\\Delta t^\{ex\}<\\Delta t^\{tr\}, the stopping rule is too cautious: e\.g\., looking at the blue boundary in Figure[1](https://arxiv.org/html/2606.17545#S2.F1)when the red one is optimal, yet the ability to stop more frequently still contributes to higher rewards\. MakingΔtex\\Delta t^\{ex\}extremely small may lead to lower rewards\. Testing the statistical significance of price differences between adjacentΔtex\\Delta t^\{ex\}values in Table[1](https://arxiv.org/html/2606.17545#S2.T1)suggests that for a givenΔttr\\Delta t^\{tr\}highest rewards are obtained whenΔtex≃1/4×Δttr\\Delta t^\{ex\}\\simeq 1/4\\times\\Delta t^\{tr\}222These findings are consistent across contracts, see Table[A\.3](https://arxiv.org/html/2606.17545#A2.T3)in the Appendix\.\. Finally we note that the MC\-based payoffs are slightly different from those reported by the PDE solver due to the finite\-difference and finite\-domain boundary errors\.

Δttr\\Delta t^\{tr\}Δtex\\Delta t^\{ex\}112\\frac\{1\}\{12\}124\\frac\{1\}\{24\}148\\frac\{1\}\{48\}196\\frac\{1\}\{96\}1192\\frac\{1\}\{192\}PDE112\\frac\{1\}\{12\}4\.5714\.5844\.5874\.5854\.5824\.567124\\frac\{1\}\{24\}4\.5684\.5864\.5934\.594∗∗\\mathbf\{4\.594\}^\{\*\*\}4\.593∗∗4\.593^\{\*\*\}4\.582148\\frac\{1\}\{48\}4\.5644\.5854\.5944\.5974\.597∗∗4\.597^\{\*\*\}4\.589196\\frac\{1\}\{96\}4\.5594\.5834\.5944\.5984\.6004\.5931192\\frac\{1\}\{192\}4\.5534\.5794\.5924\.5994\.6014\.595Table 1:Average payoffs for theB1contract from Table[2](https://arxiv.org/html/2606.17545#S3.T2), estimated using1\.6×1061\.6\\times 10^\{6\}Monte Carlo simulated paths and PDE\-based stopping rules\. Highest value in each row is bolded\. Row\-wise price differences between adjacentΔtex\\Delta t^\{ex\}values are evaluated for statistical significance; standard deviations are approx\. 0\.0025 throughout\. If the difference between a price and its right neighbor is not statistically significant, the right\-side price is marked with∗\\,\{\}^\{\*\}at the 0\.05 level and∗∗\\,\{\}^\{\*\*\}at the 0\.01 level\. For reference, the PDE\-based prices for eachΔttr\\Delta t^\{tr\}are also provided in the last column; the respective CN parameters areΔtPDE=1/192\\Delta t^\{\\text\{PDE\}\}=1/192,Δx=0\.02\\Delta x=0\.02\.Table[1](https://arxiv.org/html/2606.17545#S2.T1)is one of the motivations for our RL approach\. Indeed, we wish to “traverse” the table from the top\-left \(coarse Bermudan\-style optionality\) to the bottom\-right \(fine\-grained American\-style\), by gradually refiningΔtex\\Delta t^\{ex\}andΔttr\\Delta t^\{tr\}\. The Table shows that we need not to proceed strictly along the diagonal and crystallizes the distinction between training and forward\-evaluation discretizations\. Part of our objective is then to analyze how muchΔttr\\Delta t^\{tr\}must be reduced to obtain a sufficiently accurate American option price estimate and how to progressively maximize the extracted payoff as exercise grids are refined\.

## 3New Algorithm

### 3\.1Reinforcement Learning

To overcome the limitations of the classical LSMC method we develop a new RL\-driven two\-step approach:

1. Stage 1a: Run the classical LSMC algorithm to construct a sequence of regression emulators\.
2. Stage 1b: Train an aggregate deep neural network using the data from Stage 1a;
3. Stage 2: Run an RL algorithm that gradually refines the timing values by retraining the ADNN\.

In Stage 1, we use a coarse time grid and aim for a moderate accuracy level\. The goal is to have a reasonable starting point for the RL: the primary effort and computational resources are reserved for Stage 2\. To initialize the RL, we stack the training data\(𝐲,\(𝐭,𝐱\)\)\(\\mathbf\{y\},\(\\mathbf\{t\},\\mathbf\{x\}\)\)generated by the classical LSMC in Stage 1a across all the time steps and use it to train a global regressorRΘR^\{\\Theta\}, henceforth referred to as theaggregate deep neural network\(ADNN\), whereΘ\\Thetadenotes its trainable parameters \(distinguished from the NN parametersθn,n=0,…,N\\theta\_\{n\},n=0,\\ldots,Nin Stage 1a\)\. The ADNN incorporates both the time and spatial dimensions as inputs, setting the stage to learning out\-of\-sample timing values corresponding to smaller solver frequencies\.

The core RL iterations in Stage 2, indexed by the loop counterℓ\\ell, serve two concurrent objectives\. The first role of the loop is to refine the stopping frequencyΔttr\\Delta t^\{tr\}\. The second role is to converge to the optimal ADNN weights through training on additional inputs, resolving inaccuracies inherited from Stage 1 while mitigating error accumulation\. The two objectives must be achieved gradually and in parallel due to the underlying challenge ofconcept drift: changing the distribution of training inputs and outputs\.

To formalize our grid refinement, we define a sequence of nested time grids𝕋\(b\)⊇𝕋\(b−1\)\\mathbb\{T\}^\{\(b\)\}\\supseteq\\mathbb\{T\}^\{\(b\-1\)\},b=0,…,Bb=0,\\ldots,Band label the exercise grid as𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}and the solver grid as𝕋\(tr,b\)\\mathbb\{T\}^\{\(tr,b\)\}\. We use the coarsest𝕋\(tr,0\)\\mathbb\{T\}^\{\(tr,0\)\}for Stage 1 which also serves to initialize the RL in Stage 2\.

To update the ADNNR\[ℓ\]R^\{\[\\ell\]\}, in each loopℓ\\ellwe generate a set ofMMinput\-output pairs\(𝐲,\(𝐭,𝐱\)\)\[ℓ\]\(\\mathbf\{y\},\(\\mathbf\{t\},\\mathbf\{x\}\)\)^\{\[\\ell\]\}, with boldface denoting vectors𝐲\[ℓ\]=y1:M\\mathbf\{y\}^\{\[\\ell\]\}=y^\{1:M\}, etc\. A detailed explanation of how these input\-output collections are constructed is provided in Sections[4\.2](https://arxiv.org/html/2606.17545#S4.SS2)and[4\.3](https://arxiv.org/html/2606.17545#S4.SS3)\. After each training update, algorithm logic is used to decide what should be the next exercise frequency: either keeping the same grid𝕋\(tr,b\)\\mathbb\{T\}^\{\(tr,b\)\}or shifting to the next finer grid𝕋\(tr,b\+1\)\\mathbb\{T\}^\{\(tr,b\+1\)\}, see Section[4\.5](https://arxiv.org/html/2606.17545#S4.SS5)\. The learning continues until the ADNN has been trained on the finest solver grid𝕋\(tr,B\)\\mathbb\{T\}^\{\(tr,B\)\}, see the pseudocode of CARLOS in Algorithm \[[1](https://arxiv.org/html/2606.17545#alg1)\]\.

Algorithm 1CARLOS: Continuous\-time Adaptive Reinforcement Learning for Optimal Stopping1:Initial ADNN

R\[0\]R^\{\[0\]\};RL parameterse\.g\. Solver grids

𝕋\(tr,B\)⊇…⊇𝕋\(tr,0\)\\mathbb\{T\}^\{\(tr,B\)\}\\supseteq\\ldots\\supseteq\\mathbb\{T\}^\{\(tr,0\)\}, Learning rate

η\[0\]\\eta^\{\[0\]\}, Training set size

MM;Path and contract parameters\.

2:Initialize loop count

ℓ←0\\ell\\leftarrow 0and grid index

b←0b\\leftarrow 0
3:while

b≤Bb\\leq Bdo

4:Generate

MMinputs

\(𝐭,𝐱\)\[ℓ\]\(\\mathbf\{t\},\\mathbf\{x\}\)^\{\[\\ell\]\}
5:Generate

MMrespective outputs

𝐲\[ℓ\]\\mathbf\{y\}^\{\[\\ell\]\}based on rewards on paths with step\-size

Δt\(ex\),b\\Delta t^\{\(ex\),b\}
6:Update to ADNN

R\[ℓ\+1\]R^\{\[\\ell\+1\]\}by training

R\[ℓ\]R^\{\[\\ell\]\}on

\(𝐲,\(𝐭,𝐱\)\)\[ℓ\]\(\\mathbf\{y\},\(\\mathbf\{t\},\\mathbf\{x\}\)\)^\{\[\\ell\]\}with learning rate

η\[ℓ\]\\eta^\{\[\\ell\]\}
7:Set

ℓ←ℓ\+1\\ell\\leftarrow\\ell\+1
8:Update grid level

bband learning rate

η\[ℓ\+1\]\\eta^\{\[\\ell\+1\]\}
9:endwhile

10:returnFinal refined ADNN

R\[ℓ\]R^\{\[\\ell\]\}

For visualization purposes, we primarily track the stopping boundaries associated with eachR\[ℓ\]R^\{\[\\ell\]\}\. Given the ADNNR\[ℓ\]R^\{\[\\ell\]\}, the corresponding stopping boundary is defined as

B\[ℓ\]=\{\(t,x\)∈\[0,T\]×𝒳:𝒯^\[ℓ\]\(x\)=0\}B^\{\[\\ell\]\}=\\big\\\{\(t,x\)\\in\[0,T\]\\times\\mathcal\{X\}:\\hat\{\\mathscr\{T\}\}^\{\[\\ell\]\}\(x\)=0\\big\\\}\(14\)where𝒯^\[ℓ\]\\hat\{\\mathscr\{T\}\}^\{\[\\ell\]\}denotes the timing value estimated byR\[ℓ\]R^\{\[\\ell\]\}\. The boundary separates the stopping region𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}and the continuation region𝒞\[ℓ\]\\mathcal\{C\}^\{\[\\ell\]\}\.

### 3\.2Time Grid Refinement

The ADNNR\[ℓ\]R^\{\[\\ell\]\}is trained on input\-output samples which directly depend on the underlying solver frequencyΔttr,b\\Delta t^\{tr,b\}\. Consequently, if one were to initialize the ADNN with a coarse solver grid𝕋\(tr,0\)\\mathbb\{T\}^\{\(tr,0\)\}and then immediately switch to a fine time grid𝕋\(tr,B\)\\mathbb\{T\}^\{\(tr,B\)\}, the underlying training samples would shift significantly, strongly affecting the next learning steps\. The mismatch between what the network has learned so far and what it is shown next creates an extrapolation error that hurts learning\. To give the ADNN time to adapt, we therefore gradually refine the time grids so that we simultaneously learn to stop on a finer grid and discover the best ADNN parametersΘ\\Thetagoverning that strategy\.

At each learning loopℓ\\ell, CARLOS adaptively decides whether to transition to a finer solver grid or keep it the same\. To this end, we assess incremental gains via a set of*validation paths*\. Namely we fix a database ofVVpaths\{𝐱\}1:V\\\{\\mathbf\{x\}\\\}^\{1:V\}and compute the reward vectorΥ\[ℓ\]\\Upsilon^\{\[\\ell\]\}over theseVVpaths\{𝐱\}1:V\\\{\\mathbf\{x\}\\\}^\{1:V\}using the grid𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}and the ADNNR\[ℓ\]R^\{\[\\ell\]\}\. The pathwise reward differences𝐃\[ℓ\]:=Υ\[ℓ\]−Υ\[ℓ−1\]\\mathbf\{D\}^\{\[\\ell\]\}:=\\Upsilon^\{\[\\ell\]\}\-\\Upsilon^\{\[\\ell\-1\]\}across successive loops are used to ascertain whether the learning at levelbbhas saturated\. Once average𝐃\[ℓ\]\\mathbf\{D\}^\{\[\\ell\]\}is not statistically different from zero, the RL algorithm advances to the next grid𝕋\(tr,b\+1\)\\mathbb\{T\}^\{\(tr,b\+1\)\}in the schedule, see Section[4\.5](https://arxiv.org/html/2606.17545#S4.SS5)\.

![Refer to caption](https://arxiv.org/html/2606.17545v1/x2.png)

![Refer to caption](https://arxiv.org/html/2606.17545v1/x3.png)

Figure 2:*Left Panel:*Average rewardsυ\[ℓ\]=Ave\(Υ\[ℓ\]\)\\upsilon^\{\[\\ell\]\}=Ave\(\\Upsilon^\{\[\\ell\]\}\)\(left y\-axis\) and learning ratesη\[ℓ\]\\eta^\{\[\\ell\]\}\(right y\-axis, log scale\) across RL loopsℓ\\elland exercise grids𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}for theB1option from Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. Stopping intervals associated with reward differences𝐃\[ℓ\]\\mathbf\{D\}^\{\[\\ell\]\}are vertically shifted byυ\[ℓ\]\\upsilon^\{\[\\ell\]\}and displayed as blue bands\.*Right Panel:*Stopping boundaries \(obtained using a marching\-squares algorithm\)B\[ℓ\]B^\{\[\\ell\]\},ℓ=0,…,L\\ell=0,\\ldots,L, after each RL iteration as the step size is refined fromΔttr,0=16\\Delta t^\{tr,0\}=\\frac\{1\}\{6\}toΔttr,4=196\\Delta t^\{tr,4\}=\\frac\{1\}\{96\}\. Distinct colors indicate different gridsbb, with progressively lighter shades marking repeated loops on the same time grid𝕋\(tr,b\)\\mathbb\{T\}^\{\(tr,b\)\}\. The dashed line demarcates the out\-of\-the\-money region\. The heatmap displays the final timing value𝒯^\[L\]\(⋅,⋅\)\\hat\{\\mathscr\{T\}\}^\{\[L\]\}\(\\cdot,\\cdot\)at loopL=10L=10, with its zero\-contour corresponding to the final stopping boundary drawn with a thicker line\. Parameter configuration is reported in Table[6](https://arxiv.org/html/2606.17545#S4.T6)\.When transitioning to a new solver grid, we decrease the learning rate according toη\[b\+1\]=αdec⋅η\[b\]\\eta^\{\[b\+1\]\}=\\alpha\_\{\\text\{dec\}\}\\cdot\\eta^\{\[b\]\}\. This stabilizes training and preserves recent performance gains by limiting further changes to the ADNN’s learned parameters\. We takeη\[0\]=10−4\\eta^\{\[0\]\}=10^\{\-4\}andαdec=0\.7\\alpha\_\{\\text\{dec\}\}=0\.7as the default initial learning rates and decay factors\. The RL ends when the ADNN has been fully trained on the finest grid𝕋\(tr,B\)\\mathbb\{T\}^\{\(tr,B\)\}\.

The left panel of Figure[2](https://arxiv.org/html/2606.17545#S3.F2)illustrates the progression of the average rewardsυ\[ℓ\]:=Ave\(Υ\[ℓ\]\)\\upsilon^\{\[\\ell\]\}:=Ave\(\\Upsilon^\{\[\\ell\]\}\)as a function ofℓ\\ellfor theB1contract in Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. Whileυ\[ℓ\]\\upsilon^\{\[\\ell\]\}tends to increase, the pattern is not monotone\. Moreover, jumps occur when a finer exercise grid is selected\. The right y\-axis of the figure displays the corresponding learning ratesη\[ℓ\]\\eta^\{\[\\ell\]\}which decrease asΔttr\\Delta t^\{tr\}decreases\.

The right panel of Figure[2](https://arxiv.org/html/2606.17545#S3.F2)shows the evolution of the stopping boundariesB\[ℓ\]B^\{\[\\ell\]\}’s of this Bermudan PutB1across the successiveℓ=1,2,…,L=10\\ell=1,2,\\ldots,L=10RL loops\. As expected, the continuation region expands as the time grid is refined, so that the stopping boundaries shift “downward” and also get steeper near maturityTT\. However, this expansion is not monotone: in loop 5, the RL overshoots because the ADNN mis\-estimates timing values near maturityTTon a solver grid with step sizeΔttr=148\\Delta t^\{tr\}=\\tfrac\{1\}\{48\}; once the grid is further refined in loop 6 toΔttr=196\\Delta t^\{tr\}=\\tfrac\{1\}\{96\}, the boundary shifts back upward\. This demonstrates that the CARLOS algorithm can correct earlier errors and recover boundaries that yield higher rewards, as evidenced in the left panel of Figure[2](https://arxiv.org/html/2606.17545#S3.F2)\.

As a further illustration, Figure[3](https://arxiv.org/html/2606.17545#S3.F3)shows CARLOS solving the 2\-d Max CallM2contract from Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. The left panel of Figure[3](https://arxiv.org/html/2606.17545#S3.F3)tracks the changes in the stopping boundary at timet=1t=1over successive learning iterations\. Once again we note the expansion of the continuation region asΔtex\\Delta t^\{ex\}is decreased\. The resulting ADNN boundaries resemble those reported by\[[30](https://arxiv.org/html/2606.17545#bib.bib22),[36](https://arxiv.org/html/2606.17545#bib.bib45)\]\. The right panel illustrates the gradual improvements during Stage 2, tracking the average rewards across the RL loops \(cf\. the earlier Figure[2](https://arxiv.org/html/2606.17545#S3.F2)\) across five independent runs of CARLOS\. Note that due to the adaptive stopping criterion, the resulting number of RL loopsLLvaries across runs, being as low as 5 and as high as 7\. Running our algorithm 5 times, we obtain final prices in the range\[14\.170,14\.195\]\[14\.170,14\.195\], just shy of the explicit finite\-difference PDE solver value of 14\.214 in Table[A\.3](https://arxiv.org/html/2606.17545#A2.T3)\. For comparison purposes, the Bermudan version withΔt=1/3\\Delta t=1/3has reference prices of 13\.901 and 13\.898 as reported in\[[3](https://arxiv.org/html/2606.17545#bib.bib24)\]and\[[30](https://arxiv.org/html/2606.17545#bib.bib22)\], while CARLOS yields 14\.173 \(see Table[6](https://arxiv.org/html/2606.17545#S4.T6)\), capturing more than 90% of the materially large gap \(30 cents or more than 2%\) between the American and Bermudan formulations\.

![Refer to caption](https://arxiv.org/html/2606.17545v1/x4.png)

![Refer to caption](https://arxiv.org/html/2606.17545v1/x5.png)

Figure 3:Pricing theM2contract from Table[2](https://arxiv.org/html/2606.17545#S3.T2)under the baseline CARLOS configuration in Table[6](https://arxiv.org/html/2606.17545#S4.T6)\.*Left panel \(a\):*Stopping boundaries \(obtained using a marching\-squares algorithm from the ADNNR\[ℓ\]R^\{\[\\ell\]\}\)B\[ℓ\]B^\{\[\\ell\]\}at timet=1t=1, after each RL iterationℓ=0,…,L\\ell=0,\\ldots,Las the step size is refined fromΔttr=16\\Delta t^\{tr\}=\\frac\{1\}\{6\}toΔttr=196\\Delta t^\{tr\}=\\frac\{1\}\{96\}\. The final Stage 2 boundary at loopL=7L=7is drawn with a thicker line\. Distinct colors indicate different grid levelsbb, with progressively lighter shades marking repeated loops on the same grid𝕋\(tr,b\)\\mathbb\{T\}^\{\(tr,b\)\}\. The dashed lines encompass the out\-of\-the\-money region\. The heatmap \(computed via an antialised interpolant\) corresponds to the final estimated timing values𝒯^\[L\]\(t,x\)\\hat\{\\mathcal\{T\}\}^\{\[L\]\}\(t,x\)\.*Right panel \(b\):*Average validation rewardsυ\[ℓ\]\\upsilon^\{\[\\ell\]\}across loopsℓ=1,2,…,L\\ell=1,2,\\ldots,L, over five independent RL runs\. Thicker lines denote finer solver grids𝕋\(tr,b\)\\mathbb\{T\}^\{\(tr,b\)\}\.
### 3\.3Benchmarked Results

We proceed to evaluate our RL algorithm on a test suite of contracts with a range of payoffs and number of underlying assets\. Unlike the vast majority of extant literature, our goal is not to value a Bermudan option with a pre\-specified exercise frequency, but the true American contract where exercise can occur at any time\. To provide a comprehensive assessment, we consider a collection of option contracts in dimensions 1\-5, listed in Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. We do not tackle truly high\-dimensional problems \(which require additional adjustments\), but rather focus on realistic settings, emphasizing the aforementioned concepts of time\-discretization and efficient learning\.

To assess performance, for low\-dimensional tests we benchmark against PDE\-based solvers, which can handle very smallΔtex\\Delta t^\{ex\}but are only tractable in dimensions 1\-2333implementations in 3 dimensions exist, but are cumbersome and not done here\.\. These provide a gold\-standard deterministic comparator\. Next, we compare to the price estimate based on ADNN that is trained on a medium\-grained gridΔttr\\Delta t^\{tr\}and evaluated on Monte Carlo paths at a fine exercise frequencyΔtex\\Delta t^\{ex\}\. This setup allows to capture a substantial chunk of the American early exercise premium as described in Section[2\.2](https://arxiv.org/html/2606.17545#S2.SS2), albeit at a high computational cost\. Finally, we compare our results against Bermudan solvers reported in the literature, which serve as reference lower bounds for the American value and quantify the gains from removing the time discretization\.

CasePayoffddX0X\_\{0\}𝒦\{\\cal K\}TTδ→\\vec\{\\delta\}rrσ→\\vec\{\\sigma\}B1Basket Put13640100\.050\.2B2Basket Put2\{40,40\}\\\{40,40\\\}401\{0,0\}\\\{0,0\\\}0\.06\{0\.2,0\.2\}\\\{0\.2,0\.2\\\}M2\.AMax Call2\{100,100\}\\\{100,100\\\}1003\{0\.1,0\.1\}\\\{0\.1,0\.1\\\}0\.05\{0\.2,0\.2\}\\\{0\.2,0\.2\\\}M2\.BMax Call2\{100,100\}\\\{100,100\\\}1003\{0\.05,0\.15\}\\\{0\.05,0\.15\\\}0\.05\{0\.2,0\.2\}\\\{0\.2,0\.2\\\}M3Max Call3\{90,90,90\}\\\{90,90,90\\\}1003\{0\.1,0\.1,0\.1\}\\\{0\.1,0\.1,0\.1\\\}0\.05\{0\.2,0\.2,0\.2\}\\\{0\.2,0\.2,0\.2\\\}M5\.AMax Call5\{100,100,100,\\\{100,100,100,100,100\}100,100\\\}1003\{0\.1,0\.1,0\.1,\\\{0\.1,0\.1,0\.1,0\.1,0\.1\}0\.1,0\.1\\\}0\.05\{0\.2,0\.2,\\\{0\.2,0\.2,0\.2,0\.2,0\.2\}0\.2,0\.2,0\.2\\\}M5\.BMax Call5\{70,70,70,\\\{70,70,70,70,70\}70,70\\\}1003\{0\.1,0\.1,0\.1,\\\{0\.1,0\.1,0\.1,0\.1,0\.1\}0\.1,0\.1\\\}0\.05\{0\.08,0\.16,\\\{0\.08,0\.16,0\.24,0\.32,0\.4\}0\.24,0\.32,0\.4\\\}Table 2:Specifications of the benchmarked option contracts\. Basket Put and Max Call payoffs are as in \([16](https://arxiv.org/html/2606.17545#S3.E16)\)\. For each contract,dddenotes the number of underlying assets,X0X\_\{0\}thedd\-vector initial condition,𝒦\\cal Kthe strike price,TTthe maturity,δ→\\vec\{\\delta\}the vector of dividend yields,rrthe risk\-free interest rate andσ→\\vec\{\\sigma\}the vector of volatilities\.The underlying assets follow add\-dimensional Black–Scholes framework\. Under the risk\-neutral measure \(still denoted asℙ\\mathbb\{P\}\), the asset dynamics are given by

Xti=X0iexp⁡\{\(r−δi−σi22\)t\+σiWti\},i=1,2,…,d,X\_\{t\}^\{i\}=X\_\{0\}^\{i\}\\exp\\left\\\{\\left\(r\-\\delta\_\{i\}\-\\tfrac\{\\sigma\_\{i\}^\{2\}\}\{2\}\\right\)t\+\\sigma\_\{i\}W\_\{t\}^\{i\}\\right\\\},\\quad i=1,2,\\ldots,d,\(15\)whereX0i\>0X\_\{0\}^\{i\}\>0is the initial value,δi≥0\\delta\_\{i\}\\geq 0the dividend yield, andσi\>0\\sigma\_\{i\}\>0the volatility of theithi^\{\\text\{th\}\}asset;r∈ℝr\\in\\mathbb\{R\}is the risk\-free rate\. All the Brownian motionsW1,…,WdW^\{1\},\\ldots,W^\{d\}are independent\. We consider two payoff families: the arithmetic basket puthbskth\_\{bskt\}and the max callhmxclh\_\{mxcl\}:

hbskt\(Xt\)=\(𝒦−1d∑i=1dXti\)\+,hmxcl\(Xt\)=\(max1≤i≤d⁡Xti−𝒦\)\+\.h\_\{bskt\}\(X\_\{t\}\)=\\Bigl\(\{\\cal K\}\-\\frac\{1\}\{d\}\\sum\_\{i=1\}^\{d\}X\_\{t\}^\{i\}\\Bigr\)\_\{\+\},\\qquad h\_\{mxcl\}\(X\_\{t\}\)=\\Bigl\(\\max\_\{1\\leq i\\leq d\}X\_\{t\}^\{i\}\-\{\\cal K\}\\Bigr\)\_\{\+\}\.\(16\)Whend=1d=1, these reduce to the standard put and call payoffs, respectively\.

To assess the CARLOS Algorithm[1](https://arxiv.org/html/2606.17545#alg1), we compare the Stage 2 price estimates with the benchmark prices summarized in Table[3](https://arxiv.org/html/2606.17545#S3.T3)\. For the ADNN comparator, we report the average price over five independent instantiations, each using the contract\-specific parameter configuration in Table[A\.4](https://arxiv.org/html/2606.17545#A3.T4), to reduce run\-to\-run variability\. Literature benchmarks are available for all contracts in Table[2](https://arxiv.org/html/2606.17545#S3.T2)exceptB1; the closest setup is studied in\[[16](https://arxiv.org/html/2606.17545#bib.bib31)\]and differs only in the risk\-free rate, withr=0\.06r=0\.06\. Consequently, for theB1option we use a PDE\-based comparator and price the\[[16](https://arxiv.org/html/2606.17545#bib.bib31)\]variant withr=0\.06r=0\.06for a direct cross\-check\. Our Crank–Nicolson \(CN\) scheme withΔtPDE=1192\\Delta t^\{\\mathrm\{PDE\}\}=\\frac\{1\}\{192\}andΔx=0\.02\\Delta x=0\.02yields a deterministic price of4\.48464\.4846atr=0\.06r=0\.06, whereas\[[16](https://arxiv.org/html/2606.17545#bib.bib31)\]reports an upper bound of4\.48934\.4893for the Bermudan formulation withΔttr=150\\Delta t^\{\\mathrm\{tr\}\}=\\frac\{1\}\{50\}\.

ContractLiteraturePDEADNNCARLOSADNN TimeRL TimeB1−4\.6014\.583\(0\.010\)4\.592\(0\.005\)246\.38\(1\.77\)11\.70\(1\.66\)B21\.460\[[26](https://arxiv.org/html/2606.17545#bib.bib8)\]1\.4781\.468\(0\.005\)1\.474\(0\.001\)140\.57\(0\.24\)9\.37\(0\.81\)M2\.A13\.901\[[3](https://arxiv.org/html/2606.17545#bib.bib24)\]14\.21414\.133\(0\.067\)14\.171\(0\.015\)297\.15\(0\.39\)57\.00\(5\.11\)M2\.B15\.575\[[30](https://arxiv.org/html/2606.17545#bib.bib22)\]15\.77715\.615\(0\.131\)15\.711\(0\.022\)302\.93\(0\.84\)51\.02\(0\.95\)M311\.278\[[3](https://arxiv.org/html/2606.17545#bib.bib24)\]−11\.457\(0\.031\)11\.510\(0\.008\)1272\.05\(22\.97\)117\.91\(4\.63\)M5\.A26\.151\[[30](https://arxiv.org/html/2606.17545#bib.bib22)\]−26\.401\(0\.034\)26\.55\(0\.032\)2647\.36\(38\.86\)451\.28\(36\.26\)M5\.B11\.810\[[26](https://arxiv.org/html/2606.17545#bib.bib8)\]−11\.914\(0\.073\)12\.009\(0\.010\)780\.04\(8\.93\)344\.69\(44\.41\)\\begin\{array\}\[\]\{lcrrrrr\}\\hline\\cr\\text\{Contract\}&\\text\{Literature\}&\\text\{PDE\}&\\text\{ADNN\}&\\text\{CARLOS\}&\\text\{ADNN Time\}&\\text\{RL Time\}\\\\ \\hline\\cr\\hline\\cr\\texttt\{B1\}&\-&4\.601&4\.583\(0\.010\)&4\.592\(0\.005\)&246\.38\(1\.77\)&11\.70\(1\.66\)\\\\ \\texttt\{B2\}&1\.460\\ \\cite\[cite\]\{\[\\@@bibref\{\}\{mlOSP\}\{\}\{\}\]\}&1\.478&1\.468\(0\.005\)&1\.474\(0\.001\)&140\.57\(0\.24\)&9\.37\(0\.81\)\\\\ \\texttt\{M2\.A\}&13\.901\\ \\cite\[cite\]\{\[\\@@bibref\{\}\{Becker\}\{\}\{\}\]\}&14\.214&14\.133\(0\.067\)&14\.171\(0\.015\)&297\.15\(0\.39\)&57\.00\(5\.11\)\\\\ \\texttt\{M2\.B\}&15\.575\\cite\[cite\]\{\[\\@@bibref\{\}\{ValentinTissot\}\{\}\{\}\]\}&15\.777&15\.615\(0\.131\)&15\.711\(0\.022\)&302\.93\(0\.84\)&51\.02\(0\.95\)\\\\ \\texttt\{M3\}&11\.278\\ \\cite\[cite\]\{\[\\@@bibref\{\}\{Becker\}\{\}\{\}\]\}&\-&11\.457\(0\.031\)&11\.510\(0\.008\)&1272\.05\(22\.97\)&117\.91\(4\.63\)\\\\ \\texttt\{M5\.A\}&26\.151\\ \\cite\[cite\]\{\[\\@@bibref\{\}\{ValentinTissot\}\{\}\{\}\]\}&\-&26\.401\(0\.034\)&26\.55\(0\.032\)&2647\.36\(38\.86\)&451\.28\(36\.26\)\\\\ \\texttt\{M5\.B\}&11\.810\\ \\cite\[cite\]\{\[\\@@bibref\{\}\{mlOSP\}\{\}\{\}\]\}&\-&11\.914\(0\.073\)&12\.009\(0\.010\)&780\.04\(8\.93\)&344\.69\(44\.41\)\\\\ \\hline\\cr\\end\{array\}Table 3:Benchmark price comparators for the option contracts in Table[2](https://arxiv.org/html/2606.17545#S3.T2): PDE\-based Monte Carlo price estimates reported in Table[A\.3](https://arxiv.org/html/2606.17545#A2.T3)for contracts in dimensiond≤2d\\leq 2, ADNN comparators, and the highest report literature sources\. Prices are computed using1\.6×1061\.6\\times 10^\{6\}Monte Carlo paths at exercise frequencyΔtex=1192\\Delta t^\{ex\}=\\frac\{1\}\{192\}\. ADNN and CARLOS values are averaged over five independent runs; standard deviations are shown in parentheses\. Parameter configurations are reported in Tables[A\.4](https://arxiv.org/html/2606.17545#A3.T4)and[6](https://arxiv.org/html/2606.17545#S4.T6), respectively\. The training times are reported in seconds\.Besides the benchmark prices in Table[3](https://arxiv.org/html/2606.17545#S3.T3), several contracts are also considered elsewhere\. TheM2\.Acontract has reported prices of13\.89713\.897in\[[36](https://arxiv.org/html/2606.17545#bib.bib45)\]and13\.89813\.898in\[[30](https://arxiv.org/html/2606.17545#bib.bib22)\]\. ForM3,\[[26](https://arxiv.org/html/2606.17545#bib.bib8)\]reports a price of11\.1511\.15\. ForM5\.B, reported prices include25\.8425\.84in\[[26](https://arxiv.org/html/2606.17545#bib.bib8)\],26\.055326\.0553in\[[36](https://arxiv.org/html/2606.17545#bib.bib45)\], and26\.14726\.147in\[[3](https://arxiv.org/html/2606.17545#bib.bib24)\]\.

Table[3](https://arxiv.org/html/2606.17545#S3.T3)presents the results for the 7 contracts\. By using a largeNN\(equivalently, a small training time stepΔttr,0\\Delta t^\{tr,0\}\), the ADNN comparators can approach the American\-style price, but this requires a very large number of pathsKK, otherwise the regression predictions are not stable\. This refinement comes at a substantial computational cost, especially in dimensiond\>2d\>2\. Our method not only substantially improves on ADNN but also does so in much shorter runtime\. This happens thanks to most of the training done on coarser, hence cheaper grids\. Table[3](https://arxiv.org/html/2606.17545#S3.T3)shows speeds\-up of 3\-5 times, and sometimes up to 10x, see the M3 contract\.

## 4Algorithm Details and Ablation Studies

In this section we provide additional details of our algorithm and present a set of experiments that test the role of the different tuning parameters\. A comprehensive analysis can be found in\[[5](https://arxiv.org/html/2606.17545#bib.bib51)\]\.

All parameter tuning is performed on the 2\-dimensional Max CallM2option from Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. Starting with the baseline configuration in Table[6](https://arxiv.org/html/2606.17545#S4.T6), we vary one parameter at a time, reporting post\-RL prices, the number of RL loops, and the total training time in seconds\. Specifically, we tune hyperparameters controlling the training\-set size per loop, input selection, time\-grid refinement, and the extent of exploration in computing timing values\. We also analyze failure modes to illustrate how ill\-tuned settings impede learning and lead to low rewards\.

In all Tables, reported values are averages over five independent runs with different random seeds, with standard deviations given in parentheses\. Final prices are computed using1\.6×1061\.6\\times 10^\{6\}Monte Carlo paths at exercise frequencyΔtex=1192\\Delta t^\{ex\}=\\frac\{1\}\{192\}; the standard error of these estimates is approximately0\.0130\.013\.

All numerical computations were performed on the CPU of a single Linux server equipped with an AMD Ryzen Threadripper PRO 5965WX and 251 GiB of system memory\. The system provides 48 logical CPUs \(2 threads per core\) on a single NUMA node at up to 4\.57 GHz\.

### 4\.1Neural Network Architecture

The ADNN in CARLOS is implemented as a fully\-connected feedforward neural networkRΘ:ℝd\+1→ℝR^\{\\Theta\}:\\mathbb\{R\}^\{d\+1\}\\to\\mathbb\{R\}where the additional dimension accounts for the time input\. In Stage 1a, we have a similar construction where the input is just thedd\-dimensional state,Rθn:ℝd→ℝR^\{\\theta\_\{n\}\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}\. The architecture of the network withIIlayers that consist ofqi,i=1,…,I−1q\_\{i\},i=1,\\ldots,I\-1nodes respectively, is defined in standard form\[[3](https://arxiv.org/html/2606.17545#bib.bib24)\]as:

Rθ=φqI∘aIθ∘φqI−1∘aI−1θ∘⋯∘φq1∘a1θR^\{\\theta\}=\\varphi\_\{q\_\{I\}\}\\circ a\_\{I\}^\{\\theta\}\\circ\\varphi\_\{q\_\{I\-1\}\}\\circ a\_\{I\-1\}^\{\\theta\}\\circ\\cdots\\circ\\varphi\_\{q\_\{1\}\}\\circ a\_\{1\}^\{\\theta\}\(17\)whereqI=1q\_\{I\}=1is the output dimension,q0∈\{d,d\+1\}q\_\{0\}\\in\\\{d,d\+1\\\}is the input dimension andaiθ:ℝqi−1→ℝqi,i=1,…,Ia\_\{i\}^\{\\theta\}:\\mathbb\{R\}^\{q\_\{i\-1\}\}\\to\\mathbb\{R\}^\{q\_\{i\}\},i=1,\\ldots,Iare affine functions with weight matrices𝐀i∈ℝqi×qi−1\\mathbf\{A\}\_\{i\}\\in\\mathbb\{R\}^\{q\_\{i\}\\times q\_\{i\-1\}\}and bias vectorsbi∈ℝqib\_\{i\}\\in\\mathbb\{R\}^\{q\_\{i\}\}:

aiθ\(x\)=𝐀ix\+bi,i=1,…,I\.a\_\{i\}^\{\\theta\}\(x\)=\\mathbf\{A\}\_\{i\}x\+b\_\{i\},\\quad i=1,\\ldots,I\.The parametersΘ\\Thetaof a networkRΘR^\{\\Theta\}comprise the entries of the weight matrices𝐀1,𝐀2,…,𝐀I\\mathbf\{A\}\_\{1\},\\mathbf\{A\}\_\{2\},\\ldots,\\mathbf\{A\}\_\{I\}and the bias vectorsb1,b2,…,bIb\_\{1\},b\_\{2\},\\ldots,b\_\{I\}\. Consequently, the dimension ofΘ\\Thetais given by

\|Θ\|=1\+q1\+…\+qI−1\+q0q1\+…\+qI−2qI−1\+qI−1\.\|\\Theta\|=1\+q\_\{1\}\+\\ldots\+q\_\{I\-1\}\+q\_\{0\}q\_\{1\}\+\\ldots\+q\_\{I\-2\}q\_\{I\-1\}\+q\_\{I\-1\}\.\(18\)
By default we use 3 hidden layers \(I=4I=4\) with equal numberqi≡qq\_\{i\}\\equiv qof hidden nodes, for a total of\|Θ\|=1\+\(4\+q0\)q\+2q2\|\\Theta\|=1\+\(4\+q\_\{0\}\)q\+2q^\{2\}of hyperparameters\. The number of neuronsqiq\_\{i\}must be large enough to allow the ADNN to fit the timing value hypersurface\. If the widthsqiq\_\{i\}’s are too low, performance suffers; adding nodes beyond a certain level does not translate into materially higher rewards, as fit quality saturates\. Theoretically,\[[15](https://arxiv.org/html/2606.17545#bib.bib65)\]shows that the number of parameters in a ReLU deep NN surrogate for LSMC ought to scale polynomially in dimensionddto maintain a given expressivity error\. We do observe that in higher\-dimensional settings wider networks are beneficial and in turn require more training epochs to train\. Again, insufficient number of epochs lowers accuracy, but a plateau is quickly reached as epochs are added\. Algorithm runtime scales linearly in the number of epochs, so it can be worthwhile to tune that parameter\.

A variety of choices have been proposed for the component\-wise activation functionsφq:ℝq→ℝq\\varphi\_\{q\}:\\mathbb\{R\}^\{q\}\\to\\mathbb\{R\}^\{q\}, including ReLU\[[4](https://arxiv.org/html/2606.17545#bib.bib2)\], Leaky ReLU\[[20](https://arxiv.org/html/2606.17545#bib.bib23),[17](https://arxiv.org/html/2606.17545#bib.bib21)\], Tanh\[[3](https://arxiv.org/html/2606.17545#bib.bib24),[17](https://arxiv.org/html/2606.17545#bib.bib21)\]; we also mention ELU and Swish\. On the one hand, the choice of the activation function tends to have a second\-order effect on the final price\. On the other hand, different activation functions seem to do better for different contracts and moreoverϕq\\phi\_\{q\}affects the smoothness of the stopping boundaries: a non\-smooth activation like ReLU leads to zig\-zaggy, piecewise linear stopping boundaries\. We recommend the Swish function

φq\(x1,…,xq\):=\(x11\+e−x1,…,xq1\+e−xq\),\\varphi\_\{q\}\(x\_\{1\},\\ldots,x\_\{q\}\):=\(\\frac\{x\_\{1\}\}\{1\+e^\{\-x\_\{1\}\}\},\\ldots,\\frac\{x\_\{q\}\}\{1\+e^\{\-x\_\{q\}\}\}\),\(19\)or ReLU,φq\(x1,…,xq\):=\(x1\+,…,xq\+\),x\+:=max⁡\{x,0\}\\varphi\_\{q\}\(x\_\{1\},\\ldots,x\_\{q\}\):=\(x\_\{1\}^\{\+\},\\ldots,x\_\{q\}^\{\+\}\),\\ x^\{\+\}:=\\max\\\{x,0\\\}\. The output layer uses the linear activation functionφqI\(x\)=x\\varphi\_\{q\_\{I\}\}\(x\)=x\.

We initialize weights and biases using PyTorch’s built\-in uniform initialization, and train using the Adam optimizer based on the mean squared error \(MSE\) loss function\. We takeη\[0\]=10−4\\eta^\{\[0\]\}=10^\{\-4\}as the default initial learning rate\. Adam minibatch size has little impact and based on experiments in\[[5](https://arxiv.org/html/2606.17545#bib.bib51)\], we use batch size of 64 throughout\.

### 4\.2Selecting Inputs

Since the primary task of the surrogate is to determine the optimal stopping rule, we wish to preferentially train on inputs that are near the estimated stopping boundary\. This principle of “training in regions where it matters”—that is, focusing on areas that contribute to achieving superior rewards instead of sampling inputs uniformly corresponds to the*exploitation*aspect of our training approach\. Cognizant of the respective*exploration*aspect, we also train on inputs beyond the boundary, lest the surrogate makes incorrect stopping decisions in other regions\. The balancing of the exploitation\-exploration tradeoff is essential because neural networks are prone to overexploit and find hyperparameters that exclusively optimize performance in the training region\. A further challenge is learning timing values near maturityTTwhere the stopping boundary moves rapidly \(see Figure[1](https://arxiv.org/html/2606.17545#S2.F1)\), and which calls for additional training inputs in that region\.

With the above considerations in mind, we develop an adaptive multi\-pronged sampling strategy that is formalized through a sampling densityp\[ℓ\]\(⋅,⋅\)p^\{\[\\ell\]\}\(\\cdot,\\cdot\)on𝕋\(tr,b\)×𝒳\\mathbb\{T\}^\{\(tr,b\)\}\\times\\mathcal\{X\}\. To constructp\[ℓ\]p^\{\[\\ell\]\}that governs the selection of the inputs\(𝐭,𝐱\)\[ℓ\]\(\\mathbf\{t\},\\mathbf\{x\}\)^\{\[\\ell\]\}in line 3 of Algorithm[1](https://arxiv.org/html/2606.17545#alg1), we employ a weighted mixture of four components:

p\[ℓ\]:=λ\+\[b\]⋅p\{b,\+\}\+λ−\[b\]⋅p\{b,−\}\+λexl\[b\]⋅p\{b,exl\}\+λter\[b\]⋅p\{b,ter\}λ\+\[b\]\+λ−\[b\]\+λexl\[b\]\+λter\[b\]=1\.p^\{\[\\ell\]\}:=\\lambda\_\{\+\}^\{\[b\]\}\\cdot p^\{\\\{b,\+\\\}\}\+\\lambda\_\{\-\}^\{\[b\]\}\\cdot p^\{\\\{b,\-\\\}\}\+\\lambda\_\{\\text\{exl\}\}^\{\[b\]\}\\cdot p^\{\\\{b,\\text\{exl\}\\\}\}\+\\lambda\_\{\\text\{ter\}\}^\{\[b\]\}\\cdot p^\{\\\{b,\\text\{ter\}\\\}\}\\qquad\\lambda^\{\[b\]\}\_\{\+\}\+\\lambda^\{\[b\]\}\_\{\-\}\+\\lambda^\{\[b\]\}\_\{\\text\{exl\}\}\+\\lambda^\{\[b\]\}\_\{\\text\{ter\}\}=1\.\(20\)In \([20](https://arxiv.org/html/2606.17545#S4.E20)\), we balance the exploratory sampling viap\{b,exl\}p^\{\\\{b,\\text\{exl\}\\\}\}with the exploitative boundary\-oriented sampling viap\{b,\+\}p^\{\\\{b,\+\\\}\}\(samples with positive timing values near the estimated stopping boundary\) andp\{b,−\}p^\{\\\{b,\-\\\}\}\(samples with negative timing values close to the boundary\)\. Moreover, we also explicitly incorporate terminal samples att=Tt=Tgoverned byp\{b,ter\}p^\{\\\{b,\\text\{ter\}\\\}\}\. The weightsλ\+\[b\]\\lambda\_\{\+\}^\{\[b\]\},λ−\[b\]\\lambda\_\{\-\}^\{\[b\]\},λexl\[b\]\\lambda\_\{\\text\{exl\}\}^\{\[b\]\}, andλter\[b\]\\lambda\_\{\\text\{ter\}\}^\{\[b\]\}are adaptively adjusted as a function of the grid counterbb\.

Practically, we implement \([20](https://arxiv.org/html/2606.17545#S4.E20)\) as kernel\-smoothed densities based on*anchor sites*\. To construct the anchor sites we generatePPi\.i\.d\. pilot paths\{𝐱ˇ\}1:P\\\{\\check\{\\mathbf\{x\}\}\\\}^\{1:P\}, started from the randomized initial condition

xˇ0p,i=\[𝒦1h\(X0\)=0\+X01h\(X0\)\>0\]\(1\+βiσiε0i\),i=1,…,d,ε0i∼𝒩\(0,1\)\.\\check\{x\}\_\{0\}^\{p,i\}=\[\{\\cal K\}1\_\{h\(X\_\{0\}\)=0\}\+X\_\{0\}1\_\{h\(X\_\{0\}\)\>0\}\]\\bigl\(1\+\\beta^\{i\}\\sigma\_\{i\}\{\\varepsilon\}\_\{0\}^\{i\}\\bigr\),\\qquad i=1,\\ldots,d,\\qquad\{\\varepsilon\}\_\{0\}^\{i\}\\sim\{\\cal N\}\(0,1\)\.\(21\)hey start from at\-the\-money\. The randomized initial condition ensures more exploration, namely to provide information about the timing values forttsmall\. We furthermore discard anyxˇ0p\\check\{x\}^\{p\}\_\{0\}that are out of the money,h\(xˇ0p\)=0h\(\\check\{x\}^\{p\}\_\{0\}\)=0, or have a non\-positive initial timing value𝒯^\[ℓ\]\(0,xˇ0p\)≤0\\hat\{\\mathscr\{T\}\}^\{\[\\ell\]\}\(0,\\check\{x\}^\{p\}\_\{0\}\)\\leq 0\(deep in\-the\-money\)\.

To construct the exploitative anchor sites that target the neighborhood of the stopping boundary, we look for time stepstˇp\\check\{t\}^\{p\}along each pilot path𝐱ˇp\\check\{\\mathbf\{x\}\}^\{p\}, where the pathwise action shifts fromcontinuetostop, i\.e\., where the current decision isϕtˇpp=1\\phi\_\{\\check\{t\}^\{p\}\}^\{p\}=1while the previous decision wasϕtˇp−1p=0\\phi\_\{\\check\{t\}^\{p\}\-1\}^\{p\}=0\. By construction, timing values are negative at\(tˇp,xˇtnp\)\(\\check\{t\}^\{p\},\\check\{x\}\_\{t\_\{n\}\}^\{p\}\)—which we use for the negative anchor sites\(𝐭ˇ,𝐱ˇ\)\{b,−\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\-\\\}\}—and positive at\(tˇn−1,xˇtn−1p\)\(\\check\{t\}\_\{n\-1\},\\check\{x\}\_\{t\_\{n\-1\}\}^\{p\}\)—used for\(𝐭ˇ,𝐱ˇ\)\{b,\+\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\+\\\}\}\. Pilot paths that are in\-the\-money are used for the exploratory anchor set\(𝐭ˇ,𝐱ˇ\)\{b,exl\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\\text\{exl\}\\\}\}\.

Finally, the terminal anchor sites are based on the terminal values of the in\-the\-money pilot paths\(T,𝐱ˇ\)\{b,ter\}\(\{\{T\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\\text\{ter\}\\\}\}\. To fence against the upward pull in the learned timing values nearTT, we assign them a negative terminal timing valueyterby^\{b\}\_\{\\text\{ter\}\}linked to the loss from delayed exercise:

yterb\(T,xTp\):=cter,0⋅cter,1b⋅Δttr,b\.y^\{b\}\_\{\\text\{ter\}\}\(T,x^\{p\}\_\{T\}\):=c\_\{ter,0\}\\cdot c\_\{ter,1\}^\{b\}\\cdot\\Delta t^\{tr,b\}\.\(22\)Here,cter,0,cter,1c\_\{\\text\{ter\},0\},c\_\{\\text\{ter\},1\}are tuning parameters\. The pseudocode for constructing anchor sets is provided in Algorithm \[[2](https://arxiv.org/html/2606.17545#alg2)\]\.

Given the set of anchor sites, actual training inputs\(𝐭,𝐱\),\(\\mathbf\{t\},\\mathbf\{x\}\),are obtained through jittering \(kernel\-smoothing\) the\(𝐭ˇ,𝐱\)ˇ\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\)\}with Gaussian noise of0\.01σixi0\.01\\sigma\_\{i\}x\_\{i\}in space and0\.5Δttr,b0\.5\\Delta t^\{tr,b\}in time\. To reduce the sample noise of the pathwise payoffsyb\(t,xt\)y^\{b\}\(t,x\_\{t\}\)in \([30](https://arxiv.org/html/2606.17545#S4.E30)\), we generatePˇ=4\\check\{P\}=4paths emanating from each input and use the respective average for ADNN training\.

Algorithm 2Anchor Set Construction1:Regression surrogate

RΘR^\{\\Theta\}, \# of pilot paths

PP, Solver grid

𝕋\(tr,b\)\\mathbb\{T\}^\{\(tr,b\)\}, Payoff function

hh\.

2:Generate

PPpilot paths

\{𝐱ˇ\}1:P\\\{\\check\{\\mathbf\{x\}\}\\\}^\{1:P\}on

𝕋\(tr,b\)\\mathbb\{T\}^\{\(tr,b\)\}\. Initialize all anchor components to be empty\.

3:For all

n=1,…,N,p=1,…,pn=1,\\ldots,N,p=1,\\ldots,ppredict timing values

𝒯^np←𝒯^\(tn,xˇtnp\)\\hat\{\\mathscr\{T\}\}\_\{n\}^\{p\}\\leftarrow\\hat\{\\mathscr\{T\}\}\(t\_\{n\},\\check\{x\}\_\{t\_\{n\}\}^\{p\}\)using

RΘR^\{\\Theta\}, compute immediate payoffs

hnp←h\(xˇtnp\)h\_\{n\}^\{p\}\\leftarrow h\(\\check\{x\}\_\{t\_\{n\}\}^\{p\}\)and evaluate stopping decisions

ϕnp←𝟏\{𝒯^np≤0\}∩\{hnp\>0\}\\phi\_\{n\}^\{p\}\\leftarrow\\mathbf\{1\}\_\{\\\{\\hat\{\\mathscr\{T\}\}\_\{n\}^\{p\}\\leq 0\\\}\\cap\\\{h\_\{n\}^\{p\}\>0\\\}\}
4:for

tn∈𝕋\(tr,b\)t\_\{n\}\\in\\mathbb\{T\}^\{\(tr,b\)\}do

5:Append

\(𝐭ˇ,𝐱ˇ\)\{b,−\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\-\\\}\}by

\{\(tn,xˇtnp\)∈\{𝐱ˇ\}1:P:ϕn−1p=0,ϕnp=1,hnp\>0\}\\\{\(t\_\{n\},\\check\{x\}\_\{t\_\{n\}\}^\{p\}\)\\in\\\{\\check\{\\mathbf\{x\}\}\\\}^\{1:P\}:\\phi\_\{\{n\-1\}\}^\{p\}=0,\\phi\_\{n\}^\{p\}=1,h\_\{n\}^\{p\}\>0\\\}
6:Append

\(𝐭ˇ,𝐱ˇ\)\{b,\+\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\+\\\}\}by

\{\(tn−1,xˇtn−1p\)∈\{𝐱ˇ\}1:P:ϕn−1p=0,ϕnp=1,hnp\>0\}\\\{\(t\_\{n\-1\},\\check\{x\}\_\{t\_\{n\-1\}\}^\{p\}\)\\in\\\{\\check\{\\mathbf\{x\}\}\\\}^\{1:P\}:\\phi\_\{\{n\-1\}\}^\{p\}=0,\\phi\_\{n\}^\{p\}=1,h\_\{n\}^\{p\}\>0\\\}
7:Append exploratory sites

\(𝐭ˇ,𝐱ˇ\)\{b,exl\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\\text\{exl\}\\\}\}by

\{\(tn,xˇtnp\)∈\{𝐱ˇ\}1:P:hnp\>0\}\\\{\(\{t\}\_\{n\},\\check\{x\}^\{p\}\_\{t\_\{n\}\}\)\\in\\\{\\check\{\\mathbf\{x\}\}\\\}^\{1:P\}:h^\{p\}\_\{n\}\>0\\\}
8:endfor

9:Save terminal sites

\(𝐓ˇ,𝐱ˇ\)\{b,ter\}←\{\(T,xˇTp\):h\(xˇTp\)\>0\}\(\\check\{\\mathbf\{T\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\\text\{ter\}\\\}\}\\leftarrow\\\{\(T,\\check\{x\}^\{p\}\_\{T\}\):h\(\\check\{x\}^\{p\}\_\{T\}\)\>0\\\}
10:return

\(𝐭ˇ,𝐱ˇ\)\{b,\+\},\(𝐭ˇ,𝐱ˇ\)\{b,−\},\(𝐭ˇ,𝐱ˇ\)\{b,exl\},\(𝐓ˇ,𝐱ˇ\)\{b,ter\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\+\\\}\},\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\-\\\}\},\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\\text\{exl\}\\\}\},\(\\check\{\\mathbf\{T\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\\text\{ter\}\\\}\}

Figure[4](https://arxiv.org/html/2606.17545#S4.F4)visualizes the training inputs\(𝐭,𝐱\)\(\\mathbf\{t\},\\mathbf\{x\}\)across learning loops for the 1\-dimensional Bermudan Put optionB1from Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. In the left panel,Δttr=1/12\\Delta t^\{tr\}=1/12and the training set is more spread out; in the right panel we work with the finest grid withΔttr=1/96\\Delta t^\{tr\}=1/96\. As the time grid is refined, the exploitation inputs—with either positive or negative timing values—concentrate around the estimated stopping boundaryB\[ℓ\]B^\{\[\\ell\]\}becausexˇtnp\\check\{x\}\_\{t\_\{n\}\}^\{p\}gets closer toxˇtn−1p\\check\{x\}\_\{t\_\{n\}\-1\}^\{p\}in lines 4 and 5 of Algorithm[2](https://arxiv.org/html/2606.17545#alg2)\.

![Refer to caption](https://arxiv.org/html/2606.17545v1/x6.png)\(a\)Inputs on the solver grid withΔttr=112\\Delta t^\{tr\}=\\frac\{1\}\{12\}\. 413 exploratory, 269 with positive timing value, 269 with negative timing value, and 50 terminal\.
![Refer to caption](https://arxiv.org/html/2606.17545v1/x7.png)\(b\)Inputs on the solver grid withΔttr=196\\Delta t^\{tr\}=\\frac\{1\}\{96\}\. 174 exploratory, 388 with positive timing value, 388 with negative timing value, and 50 terminal\.

Figure 4:Input sets\(𝐭,𝐱\)\[ℓ\]\(\\mathbf\{t\},\\mathbf\{x\}\)^\{\[\\ell\]\}for theB1contract using the parameter configuration in Table[6](https://arxiv.org/html/2606.17545#S4.T6)\. There are1,0001\{,\}000distinct training inputs categorized into: exploratory\(𝐭ˇ,𝐱ˇ\)\{b,exl\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,exl\\\}\}, positive timing value\(𝐭ˇ,𝐱ˇ\)\{b,\+\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\+\\\}\}, negative timing value\(𝐭ˇ,𝐱ˇ\)\{b,−\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,\-\\\}\}, and terminal\(𝐭ˇ,𝐱ˇ\)\{b,ter\}\(\\check\{\\mathbf\{t\}\},\\check\{\\mathbf\{x\}\}\)^\{\\\{b,ter\\\}\}sampling mixture components\.Exploratory Coverage:In Algorithm[2](https://arxiv.org/html/2606.17545#alg2), anchor sites attnt\_\{n\}are based on the pilot paths and reflect the underlying distribution ofXtn\|X0X\_\{t\_\{n\}\}\|X\_\{0\}\. This can be a limiting factor for exploration; in our examples the distribution ofXtX\_\{t\}is log\-normal which has a long right tail\. To be concrete, the2,5002\{,\}500distinct exploratory inputs for theM2\.Acontract span\[33\.55,288\.36\]×\[32\.49,244\.66\]\[33\.55,288\.36\]\\times\[32\.49,244\.66\]\. By contrast, the1\.6×1061\.6\\times 10^\{6\}testing paths span a much broader range,\[14\.74,423\.22\]×\[15\.62,473\.35\]\[14\.74,423\.22\]\\times\[15\.62,473\.35\]\. This mismatch between training and testing set implies that the ADNN stopping rule is inaccurate for very highX1,X2X\_\{1\},X\_\{2\}values, e\.g\., along the diagonalX1,X2∈\[200,400\]2X\_\{1\},X\_\{2\}\\in\[200,400\]^\{2\}, which has a disproportionally high impact on option value as this is precisely the high\-payoff region\.

To address this, it may be beneficial to modify the sampling scheme for exploratory training inputs\. For instance, we consider a mixture of the underlying log\-normal density and a uniform sampling near the diagonal—to obtain more informative training inputs that can expand the learned boundary and sharpen its geometry\. Accordingly, we let

phybrid\{b,exl\}:=\(1−cband\)p\{b,exl\}\+cbandUnif\(\[1,3\]×𝒳band\),p\_\{\\text\{hybrid\}\}^\{\\\{b,\\text\{exl\}\\\}\}:=\(1\-c\_\{\\text\{band\}\}\)\\,p^\{\\\{b,\\text\{exl\}\\\}\}\+c\_\{\\text\{band\}\}\\,\\mathrm\{Unif\}\(\[1,3\]\\times\\mathcal\{X\}\_\{\\text\{band\}\}\),\(23\)wherecbandc\_\{\\text\{band\}\}controls the weight of the contract\-specific exploration component andcwidthc\_\{\\text\{width\}\}controls the width of the diagonal band

𝒳band=\{\(x1,x2\)∈𝒳:\|x1−x2\|≤cwidth,x1,x2∈\[xmin,xmax\]\}\.\\mathcal\{X\}\_\{\\text\{band\}\}=\\left\\\{\(x\_\{1\},x\_\{2\}\)\\in\\mathcal\{X\}:\|x\_\{1\}\-x\_\{2\}\|\\leq c\_\{\\text\{width\}\},\\;x\_\{1\},x\_\{2\}\\in\[x\_\{\\min\},x\_\{\\max\}\]\\right\\\}\.\(24\)A simple manual calibration is to setxmin=𝒦x\_\{\\min\}=\\mathcal\{K\}to match the contract’s strike price and choosexmax=300,cwidth=30x\_\{\\max\}=300,c\_\{\\text\{width\}\}=30\. To stay on the conservative side, a uniform distribution over the time window\[1,3\]\[1,3\]is used as reaching the diagonal band is unlikely early on\.

Tuning Training Set Geometry:The exploration\-exploitation balance is controlled by the weightsλ\+\[b\]\\lambda^\{\[b\]\}\_\{\+\}andλ−\[b\]\\lambda^\{\[b\]\}\_\{\-\}in Equation \([20](https://arxiv.org/html/2606.17545#S4.E20)\) that define the sampling densityp\[ℓ\]p^\{\[\\ell\]\}\. As the solver\-grid step decreases, the anchor inputs in Algorithm \[[2](https://arxiv.org/html/2606.17545#alg2)\] are drawn closer to the stopping boundary, so exploitation is intrinsically amplified asΔttr\\Delta t^\{tr\}shrinks\. While advantageous for improving the ADNN fit around the boundary, over\-exploitation limits the ability of ADNN to self\-correct and can make the RL prone to error accumulation\. Moreover, at the earlier RL iterations the boundary may shift substantially, especially during grid transitions, and selecting inputs based on the previous solver grid can then misdirect learning\. At the same time, exploration on fine grids is also inefficient, since the stopping boundary stabilizes once the grid step sizeΔttr\\Delta t^\{\\mathrm\{tr\}\}becomes small\.

These considerations motivate a*gradual*reallocation of training effort from exploration to exploitation\. Letcexplc\_\{\\text\{expl\}\}denote the proportion of exploratory inputs that is reallocated to exploitation after each solver\-grid transition\. When advancing the grid index frombbtob\+1b\+1, we update the weights used to constructp\[ℓ\]p^\{\[\\ell\]\}in Equation \([20](https://arxiv.org/html/2606.17545#S4.E20)\) as

λexl\[b\+1\]:=\(1−cexpl\)λexl\[b\],λ\+\[b\+1\]:=λ\+\[b\]\+cexpl2λexl\[b\],λ−\[b\+1\]:=λ−\[b\]\+cexpl2λexl\[b\],\\displaystyle\\lambda\_\{\\text\{exl\}\}^\{\[b\+1\]\}:=\(1\-c\_\{\\text\{expl\}\}\)\\lambda\_\{\\text\{exl\}\}^\{\[b\]\},\\quad\\lambda\_\{\+\}^\{\[b\+1\]\}:=\\lambda\_\{\+\}^\{\[b\]\}\+\\frac\{c\_\{\\text\{expl\}\}\}\{2\}\\lambda\_\{\\text\{exl\}\}^\{\[b\]\},\\quad\\lambda\_\{\-\}^\{\[b\+1\]\}:=\\lambda\_\{\-\}^\{\[b\]\}\+\\frac\{c\_\{\\text\{expl\}\}\}\{2\}\\lambda\_\{\\text\{exl\}\}^\{\[b\]\},\(25\)while keepingλter\[b\]\\lambda\_\{\\text\{ter\}\}^\{\[b\]\}fixed\. Settingcexpl=0c\_\{\\text\{expl\}\}=0recovers the baseline setting in which the weights remain constant throughout the RL loops\.

\(λexl\[0\],λ\+\[0\],λ−\[0\],λter\[0\]\)\(\\lambda\_\{\\text\{exl\}\}^\{\[0\]\},\\lambda\_\{\+\}^\{\[0\]\},\\lambda\_\{\-\}^\{\[0\]\},\\lambda\_\{\\text\{ter\}\}^\{\[0\]\}\)cexplc\_\{expl\}PriceLoopsLLRuntime\(0\.95,0,0,0\.05\)\(0\.95,0,0,0\.05\)014\.1438\(0\.014\)14\.1438\_\{\(0\.014\)\}11\.2\(1\.17\)11\.2\_\{\(1\.17\)\}64\.20\(4\.76\)64\.20\_\{\(4\.76\)\}\(0\.55,0\.2,0\.2,0\.05\)\(0\.55,0\.2,0\.2,0\.05\)014\.1641\(0\.013\)14\.1641\_\{\(0\.013\)\}8\.6\(1\.62\)8\.6\_\{\(1\.62\)\}55\.36\(3\.86\)55\.36\_\{\(3\.86\)\}\(0\.35,0\.3,0\.3,0\.05\)\(0\.35,0\.3,0\.3,0\.05\)014\.1775\(0\.007\)14\.1775\_\{\(0\.007\)\}10\.6\(3\.07\)10\.6\_\{\(3\.07\)\}61\.05\(10\.32\)61\.05\_\{\(10\.32\)\}\(0\.15,0\.4,0\.4,0\.05\)\(0\.15,0\.4,0\.4,0\.05\)014\.1748\(0\.007\)14\.1748\_\{\(0\.007\)\}8\.2\(2\.04\)8\.2\_\{\(2\.04\)\}56\.13\(7\.09\)56\.13\_\{\(7\.09\)\}\(0,0\.475,0\.475,0\.05\)\(0,0\.475,0\.475,0\.05\)014\.1694\(0\.015\)14\.1694\_\{\(0\.015\)\}8\.0\(1\.67\)8\.0\_\{\(1\.67\)\}53\.65\(4\.48\)53\.65\_\{\(4\.48\)\}\(0\.5,0\.2,0\.2,0\.1\)\(0\.5,0\.2,0\.2,0\.1\)014\.1486\(0\.015\)14\.1486\_\{\(0\.015\)\}8\.8\(2\.32\)8\.8\_\{\(2\.32\)\}54\.60\(5\.90\)54\.60\_\{\(5\.90\)\}\(0\.6,0\.2,0\.2,0\.0\)\(0\.6,0\.2,0\.2,0\.0\)014\.1526\(0\.013\)14\.1526\_\{\(0\.013\)\}10\.8\(2\.23\)10\.8\_\{\(2\.23\)\}60\.44\(10\.42\)60\.44\_\{\(10\.42\)\}\(0\.35,0\.2,0\.4,0\.05\)\(0\.35,0\.2,0\.4,0\.05\)014\.1725\(0\.005\)14\.1725\_\{\(0\.005\)\}7\.8\(1\.94\)7\.8\_\{\(1\.94\)\}53\.82\(6\.51\)53\.82\_\{\(6\.51\)\}\(0\.55,0\.2,0\.2,0\.05\)\(0\.55,0\.2,0\.2,0\.05\)0\.2514\.1777\(0\.009\)14\.1777\_\{\(0\.009\)\}9\.6\(2\.33\)9\.6\_\{\(2\.33\)\}59\.16\(8\.11\)59\.16\_\{\(8\.11\)\}\(0\.55,0\.2,0\.2,0\.05\)\(0\.55,0\.2,0\.2,0\.05\)0\.514\.1698\(0\.018\)14\.1698\_\{\(0\.018\)\}9\.6\(3\.38\)9\.6\_\{\(3\.38\)\}59\.16\(9\.47\)59\.16\_\{\(9\.47\)\}\(0\.35,0\.3,0\.3,0\.05\)\(0\.35,0\.3,0\.3,0\.05\)0\.2514\.1764\(0\.007\)14\.1764\_\{\(0\.007\)\}8\.6\(2\.24\)8\.6\_\{\(2\.24\)\}54\.51\(5\.51\)54\.51\_\{\(5\.51\)\}\(0\.35,0\.3,0\.3,0\.05\)\(0\.35,0\.3,0\.3,0\.05\)0\.514\.1762\(0\.006\)14\.1762\_\{\(0\.006\)\}10\.0\(1\.79\)10\.0\_\{\(1\.79\)\}59\.43\(6\.00\)59\.43\_\{\(6\.00\)\}Table 4:Impact of the sampling weights\(λexl\(\\lambda\_\{\\text\{exl\}\},λ\+\\lambda\_\{\+\},λ−,λter\)\\lambda\_\{\-\},\\lambda\_\{\\text\{ter\}\}\)in \([20](https://arxiv.org/html/2606.17545#S4.E20)\) and the reallocation factorcexplc\_\{\\text\{expl\}\}for balancing exploration and exploitation, for theM2Max Call contract in Table[2](https://arxiv.org/html/2606.17545#S3.T2)\.Table[4](https://arxiv.org/html/2606.17545#S4.T4)shows the impact of varying the weightsλexl\\lambda\_\{\\text\{exl\}\},λ\+\\lambda\_\{\+\},λ−,λter\\lambda\_\{\-\},\\lambda\_\{\\text\{ter\}\}and the tuning parametercexplc\_\{\\text\{expl\}\}for theM2\.Acontract, yielding several take\-aways\. First, terminal inputs are helpful:λter\[0\]=0\\lambda\_\{\\text\{ter\}\}^\{\[0\]\}=0does substantially worse thanλter\[0\]=0\.05\\lambda\_\{\\text\{ter\}\}^\{\[0\]\}=0\.05\(λter\[0\]=0\.10\\lambda\_\{\\text\{ter\}\}^\{\[0\]\}=0\.10is comparable but leaves fewer interior samples\)\. Second, the two extremes of no\-exploitationλ\+\[0\]=λ−\[0\]=0\\lambda\_\{\+\}^\{\[0\]\}=\\lambda\_\{\-\}^\{\[0\]\}=0or no\-explorationλexl\[b\]=0\\lambda\_\{\\text\{exl\}\}^\{\[b\]\}=0perform relatively poorly\. Eliminating exploitation under\-samples the region where ADNN accuracy matters most and makes RL less efficient, yielding the lowest price\. Conversely, a fully exploitative training withλexl=0\\lambda\_\{\\text\{exl\}\}=0confines learning to the boundary region\. This inhibits self\-correction, providing insufficient information to adjust timing values away from the obsolete boundary, leading to lower rewards\. Third, we find no benefit from asymmetrically sampling within the stopping/continuation regions,λ\+\[0\]<λ−\[0\]\\lambda\_\{\+\}^\{\[0\]\}<\\lambda\_\{\-\}^\{\[0\]\}\. Fourth, a gradual re\-allocation towards exploitationcexpl\>0c\_\{\\text\{expl\}\}\>0is beneficial and moreover makes the performance less sensitive to preciseλexl\[0\],λ\+\[0\],λ−\[0\]\\lambda\_\{\\text\{exl\}\}^\{\[0\]\},\\lambda\_\{\+\}^\{\[0\]\},\\lambda\_\{\-\}^\{\[0\]\}values\.

In conclusion, we recommend a balanced allocation across exploration, exploitation \(split evenly betweenλ\+\\lambda\_\{\+\}andλ−\\lambda\_\{\-\}\), and terminal fencing, settingλexl\[0\]=0\.55\\lambda\_\{\\text\{exl\}\}^\{\[0\]\}=0\.55,λ\+\[0\]=λ−\[0\]=0\.2\\lambda\_\{\+\}^\{\[0\]\}=\\lambda\_\{\-\}^\{\[0\]\}=0\.2, andλter\[0\]=0\.05\\lambda\_\{\\text\{ter\}\}^\{\[0\]\}=0\.05, and choosingcexpl=0\.25c\_\{\\text\{expl\}\}=0\.25to gradually reallocate exploratory mass toward exploitation as the solver grid is refine\.

### 4\.3Exploratory Stopping

Accurately evaluating timing values becomes challenging as we transition to progressively finer time grids𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}\. In particular, near the stopping boundaryB\[ℓ\]B^\{\[\\ell\]\}timing values may switch signs, cf\. the shifting stopping boundaries in Figure[1](https://arxiv.org/html/2606.17545#S2.F1)\. Recall the stopping decision mapϕ\[ℓ\]\\phi^\{\[\\ell\]\}from \([11](https://arxiv.org/html/2606.17545#S2.E11)\) \(now based on the ADNN𝒯^\[ℓ\]\\hat\{\\mathscr\{T\}\}^\{\[\\ell\]\}\) and consider a generic path\{xt\}\\\{x\_\{t\}\\\}starting from\(tinit,xtinit\)\(t\_\{\\text\{init\}\},x\_\{t\_\{\\text\{init\}\}\}\)\. If\(tinit,xtinit\)\(t\_\{\\text\{init\}\},x\_\{t\_\{\\text\{init\}\}\}\)falls within the stopping region𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}, the default would be to stop immediately, leading toτ=tinit\\tau=t\_\{\\text\{init\}\}\. This implies that any input with𝒯^\[ℓ\]\(t,x\)<0\\hat\{\\mathscr\{T\}\}^\{\[\\ell\]\}\(t,x\)<0will be assigned a negative training outputyy, and hence will \(almost surely\) end up in the stopping region𝒮\[ℓ\+1\]\\mathcal\{S\}^\{\[\\ell\+1\]\}of the next\-iteration ADNN𝒯^\[ℓ\+1\]\\hat\{\\mathscr\{T\}\}^\{\[\\ell\+1\]\}\. As a result, the naive implementation would cause the continuation region to be non\-expanding across the RL rounds\.

To mitigate this, we introduce “delayed” stopping, with the goal of adjusting the timing values of training inputs in the “old” stopping region\. Such exploratory stopping allows a path originating in𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}to continue for a bit, in the hopes of crossing into the continuation region which then enables bonified stopping\. Letξb\\xi^\{b\}be the first time a path\{xt\}\\\{x\_\{t\}\\\}, starting from\(tinit,xtinit\)\(t\_\{\\text\{init\}\},x\_\{t\_\{\\text\{init\}\}\}\)and progressing along the exercise grid𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}, enters𝒞\[ℓ\]\\mathcal\{C\}^\{\[\\ell\]\}:

ξb:=min⁡\{t∈\{tinit\}∪𝕋\(ex,b\):ϕ\[ℓ\]\(t,xt\)=0,t≥tinit\}∧T\.\\xi^\{b\}:=\\min\\left\\\{t\\in\\\{t\_\{\\text\{init\}\}\\\}\\cup\\mathbb\{T\}^\{\(ex,b\)\}:\\phi^\{\[\\ell\]\}\(t,x\_\{t\}\)=0,t\\geq t\_\{\\text\{init\}\}\\right\\\}\\wedge T\.\(26\)If the path starts in𝒞\[ℓ\]\\mathcal\{C\}^\{\[\\ell\]\}, thenξb=tinit\\xi^\{b\}=t\_\{\\text\{init\}\}; otherwise,ξb\>tinit\\xi^\{b\}\>t\_\{\\text\{init\}\}andξb=T\\xi^\{b\}=Tfor a path that never leaves𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}\. We then defineζb\\zeta^\{b\}as the first time the path\{xt\}\\\{x\_\{t\}\\\}transitions from the continuation region into the stopping region

ζb:=min⁡\{t\>ξb,t∈𝕋\(ex,b\):ϕ\[ℓ\]\(t,xt\)=1\}\.\\zeta^\{b\}:=\\min\\left\\\{t\>\\xi^\{b\},t\\in\\mathbb\{T\}^\{\(ex,b\)\}:\\phi^\{\[\\ell\]\}\(t,x\_\{t\}\)=1\\right\\\}\.\(27\)For trajectories originating within𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}, delayed stopping means that the path is allowed a waiting period of lengthΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}to exit into𝒞\[ℓ\]\\mathcal\{C\}^\{\[\\ell\]\}; otherwise, it is stopped once the delay expires\. Hence, the path is stopped at the minimum ofζb\\zeta^\{b\}from \([27](https://arxiv.org/html/2606.17545#S4.E27)\) and

τ̊b:=min⁡\{t\>tinit\+Δwaitb:t∈𝕋\(ex,b\)\}∧T,\\mathring\{\\tau\}^\{b\}:=\\min\\\{t\>t\_\{\\text\{init\}\}\+\\Delta^\{b\}\_\{\\text\{wait\}\}:t\\in\\mathbb\{T\}^\{\(ex,b\)\}\\\}\\wedge T,\(28\)which is the first time point in𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}afterΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}\. The resulting payoff is set to

ydpfb\(tinit,xtinit\):=\{e−r\(ζb−tinit\)h\(xζb\)ifξb≤τ̊b,e−r\(τ̊b−tinit\)h\(xτ̊b\)otherwise\.y\_\{\\text\{dpf\}\}^\{b\}\(t\_\{\\text\{init\}\},x\_\{t\_\{\\text\{init\}\}\}\):=\\begin\{cases\}e^\{\-r\(\\zeta^\{b\}\-t\_\{\\text\{init\}\}\)\}h\(x\_\{\\zeta^\{b\}\}\)&\\text\{if \}\\xi^\{b\}\\leq\\mathring\{\\tau\}^\{b\},\\\\ e^\{\-r\(\\mathring\{\\tau\}^\{b\}\-t\_\{\\text\{init\}\}\)\}h\(x\_\{\\mathring\{\\tau\}^\{b\}\}\)&\\text\{otherwise\}\.\\end\{cases\}\(29\)If the path\{xt\}\\\{x\_\{t\}\\\}starts in, or enters𝒞\[ℓ\]\\mathcal\{C\}^\{\[\\ell\]\}beforeτ̊b\\mathring\{\\tau\}^\{b\}, it is stopped at the entrance timeζb\\zeta^\{b\}; otherwise, it is stopped atτ̊b\\mathring\{\\tau\}^\{b\}, after the waiting periodΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}has elapsed\.

Exploratory stopping can be understood via a traffic light analogy, captured by labelszt∈\{0,1,2\}z\_\{t\}\\in\\\{0,1,2\\\}, interpreted as yellow, green, and red signals\. A path that starts in𝒞\[ℓ\]\\mathcal\{C\}^\{\[\\ell\]\}is labeled green,ztinit=1z\_\{t\_\{\\text\{init\}\}\}=1, while one that originates in𝒞\[ℓ\]\\mathcal\{C\}^\{\[\\ell\]\}is initially labeled yellow,ztinit=0z\_\{t\_\{\\text\{init\}\}\}=0\. The label transitions follow three rules, illustrated for theB1contract in Figure[A\.1b](https://arxiv.org/html/2606.17545#A1.F1.sf2)\. First, when a yellow path enters the continuation region at hitting timeξb\\xi^\{b\}, it turns green:zξb=1z\_\{\\xi^\{b\}\}=1, cf\. the green dot on path 1\. Second, when a green path enters the stopping region, it becomes red:zζb=2z\_\{\\zeta^\{b\}\}=2, see the red dot markingζb\\zeta^\{b\}on path 2\. Third, when a yellow path—i\.e\.,ztinit=0z\_\{t\_\{\\text\{init\}\}\}=0remains in𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}beyond the waiting periodΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}, it is set to red:zτ̊b=2z\_\{\\mathring\{\\tau\}^\{b\}\}=2and stopped\. This is recorded on path 3, where the gray dashed vertical line indicates the deadlinet\+Δtwaitbt\+\\Delta t^\{b\}\_\{\\text\{wait\}\}andτ̊b\\mathring\{\\tau\}^\{b\}is shown by a red dot\. Algorithm[3](https://arxiv.org/html/2606.17545#alg3)outlines how the labelsztz\_\{t\}guide the computation ofydpfby\_\{\\text\{dpf\}\}^\{b\}in Equation \([29](https://arxiv.org/html/2606.17545#S4.E29)\)\.

Algorithm 3Discounted Delayed Payoff Evaluation1:Regression surrogate

RΘR^\{\\Theta\}, Path

\{xti\}\\\{\{x\}\_\{t\_\{i\}\}\\\}started at

tinit,xtinitt\_\{\\text\{init\}\},x\_\{t\_\{\\text\{init\}\}\}, Exercise grid

𝕋\(ex\)\\mathbb\{T\}^\{\(ex\)\}, Exploration window

Δwait\\Delta\_\{\\text\{wait\}\},Path and contract parameterse\.g\., Payoff function

hh, Discount rate

rr\.

2:Set initial label

z←1−ϕΘ\(tinit,xtinit\)z\\leftarrow 1\-\\phi^\{\\Theta\}\(t\_\{\\text\{init\}\},x\_\{t\_\{\\text\{init\}\}\}\)
3:for

tn∈\{t∈𝕋\(ex\):t≥tinit\}t\_\{n\}\\in\\\{t\\in\\mathbb\{T\}^\{\(ex\)\}:t\\geq t\_\{\\text\{init\}\}\\\}do

4:Compute stopping decisions

ϕn←ϕΘ\(tn,xtn\)\\phi\_\{n\}\\leftarrow\\phi^\{\\Theta\}\(t\_\{n\},x\_\{t\_\{n\}\}\)based on

RΘR^\{\\Theta\}by Equation \([11](https://arxiv.org/html/2606.17545#S2.E11)\)

5:Set new label

zn←z\+ϕn⋅𝟏\{z=1\}\+\(1−ϕn\)⋅𝟏\{z=0\}z\_\{n\}\\leftarrow z\+\\phi\_\{n\}\\cdot\\mathbf\{1\}\_\{\\\{z=1\\\}\}\+\(1\-\\phi\_\{n\}\)\\cdot\\mathbf\{1\}\_\{\\\{z=0\\\}\}
6:Get

ydpf←e−r\(tn−tinit\)h\(xtn\)⋅\(𝟏\{zn=2,z=1\}\+𝟏\{zn=0,tn−tinit≥Δwaitb\}\)y\_\{\\text\{dpf\}\}\\leftarrow e^\{\-r\(t\_\{n\}\-t\_\{\\text\{init\}\}\)\}h\(x\_\{t\_\{n\}\}\)\\cdot\(\\mathbf\{1\}\_\{\\\{z\_\{n\}=2,z=1\\\}\}\+\\mathbf\{1\}\_\{\\\{z\_\{n\}=0,t\_\{n\}\-t\_\{\\text\{init\}\}\\geq\\Delta^\{b\}\_\{\\text\{wait\}\}\\\}\}\)
7:Update the label

z←znz\\leftarrow z\_\{n\}
8:endfor

9:return

ydpfy\_\{\\text\{dpf\}\}

The timing value for each training input\(t,xt\)∼p\[ℓ\]\(t,x\_\{t\}\)\\sim p^\{\[\\ell\]\}is given by the difference between the discounted delayed payoff of a path\{xt\}\\\{x\_\{t\}\\\}starting at\(t,xt\)\(t,x\_\{t\}\)that progresses along the grid𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}and its immediate payoff:

yb\(t,xt\)=ydpfb\(t,xt\)−h\(xt\)\.y^\{b\}\(t,x\_\{t\}\)=y\_\{\\text\{dpf\}\}^\{b\}\(t,x\_\{t\}\)\-h\(x\_\{t\}\)\.\(30\)
To regulate exploration in𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}we shortenΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}when time to maturityT−tinitT\-t\_\{\\text\{init\}\}is small or when a finer solver grid provides more stopping opportunities, setting it according to delayed ratiocdlstc\_\{dlst\}

Δwaitb\(t\):=Δwait\(t;Δttr,b\)=\(cdlst\)b\+1Δttr,b\(1−tT\)\.\\Delta\_\{\\text\{wait\}\}^\{b\}\(t\):=\\Delta\_\{\\text\{wait\}\}\(t;\\Delta t^\{tr,b\}\)=\(c\_\{\\text\{dlst\}\}\)^\{b\+1\}\\Delta t^\{tr,b\}\\left\(1\-\\frac\{t\}\{T\}\\right\)\.\(31\)
Table[A\.1](https://arxiv.org/html/2606.17545#A2.T1)shows the impact ofcdlstc\_\{\\text\{dlst\}\}\. Suppressing any delayed stoppingcdlst=0c\_\{\\text\{dlst\}\}=0leads to premature stopping as the expansion of𝒞\[ℓ\]\\mathcal\{C\}^\{\[\\ell\]\}is largely removed\. Taking very largecdlst=2c\_\{\\text\{dlst\}\}=2is also counter\-productive, as excessiveΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}mislearns the ultimate stopping decision\. Both produce poor prices\. To allow sufficient exploration within𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}without introducing overly long waiting periodsΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}, we selectcdlst∈\[1\.1,1\.3\]c\_\{\\text\{dlst\}\}\\in\[1\.1,1\.3\]as a balanced choice\. In\[[5](https://arxiv.org/html/2606.17545#bib.bib51)\]we also experimented with nonlinear dependence on time\-to\-maturity, such as using\(1−t/T\)2\(1\-t/T\)^\{2\}in \([31](https://arxiv.org/html/2606.17545#S4.E31)\), but this was ultimately rejected\.

The exploration windowΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}introduces discontinuities in timing values\. Supplementary Figure[A\.1a](https://arxiv.org/html/2606.17545#A1.F1.sf1)illustrates this artifact of exploratory stopping for theB1option\. At the boundary—identified as the zero\-timing\-value contour of the ADNNR\[0\]R^\{\[0\]\}—the timing values jump due to the enforced exploration on yellow\-labeled paths\. Paths 4 and 5 in Figure[A\.1b](https://arxiv.org/html/2606.17545#A1.F1.sf2)exemplify this: they have the same underlying trajectory, shifted so that their origins lie symmetrically across the boundary\. The green\-labeled path 4 is stopped atζ\\zeta, while the yellow\-labeled path 5 remains in the stopping region and is stopped atτ̊\\mathring\{\\tau\}\. In this case, becauseζ<τ̊\\zeta<\\mathring\{\\tau\}, Equation \([30](https://arxiv.org/html/2606.17545#S4.E30)\) assigns different timing valuesyyto the two paths; if the stopping times coincided, the values would be identical\. The two\-piece spline in Figure[A\.1a](https://arxiv.org/html/2606.17545#A1.F1.sf1)is constructed on timing valuesyyfrom Equation \([30](https://arxiv.org/html/2606.17545#S4.E30)and shows that this creates a substantial gap right aroundB\[ℓ\]B^\{\[\\ell\]\}\. By construction, the ADNN fits a continuous surrogate, smoothing the above discontinuity and producing timing values that distort the stopping boundary, shown by the dashed vertical black line in Figure[A\.1a](https://arxiv.org/html/2606.17545#A1.F1.sf1)\. The discontinuity dinimishes asΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}shrinks, which is the motivation to reducing the exploration window as𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}is refined\.

### 4\.4Stage Allocation

The quality of Stage 1 shapes the effectiveness of the RL algorithm in Stage 2\. If the initial ADNNR\[0\]R^\{\[0\]\}is poorly trained, the CARLOS algorithm \[[1](https://arxiv.org/html/2606.17545#alg1)\] does not converge, so the Stage 1a data has to be sufficiently large, and the initial ADNN training in Stage 1b must be sufficiently comprehensive\. Conversely, over\-allocating computational resources to Stage 1 leaves little room for further improvement in Stage 2 and incurs long runtimes\. To strike this balance, two parameters need to be tuned: the number of pathsKK—which determines the size of the data used to train the ADNNR\[0\]R^\{\[0\]\}, and the initial grid\-step sizeΔtex,0\\Delta t^\{ex,0\}\. As discussed in Section[2\.2](https://arxiv.org/html/2606.17545#S2.SS2), errors in the LSMC backpropagate, so reducingΔtex,0\\Delta t^\{ex,0\}does not necessarily yield better data\.

As shown in supplementary Table[A\.2](https://arxiv.org/html/2606.17545#A2.T2), moderately large values ofKKimprove Stage 1 data quality and yield higher final prices\. Since the LSMC cost scales linearly withKK, doublingKKsay from2⋅1042\\cdot 10^\{4\}to4⋅1044\\cdot 10^\{4\}doubles that portion of the runtime but delivers only minimal improvement on the stopping rule\.

The number of input–output pairsMMcontrols per\-loop training and using too few training inputs may lead to instability due to overfitting the ADNN on sparse data\[[28](https://arxiv.org/html/2606.17545#bib.bib36)\]\. Conversely, using excessively largeMMincreases runtime while yielding only marginal improvements in timing\-value estimates and rewards\. For theM2contract, Table[A\.1](https://arxiv.org/html/2606.17545#A2.T1)indicates thatM=20,000M=20\{,\}000strikes this balance, since runtime rises sharply for largerMMwithout a commensurate increase in final prices\.

### 4\.5Grid Transition

To guide the learning, we monitor, assess, and adjust the ADNN in each learning loop using the rewardsΥv\[ℓ\]\\Upsilon^\{\[\\ell\]\}\_\{v\}over thev=1,…,Vv=1,\\ldots,Vvalidation paths\{𝐱\}1:V\\\{\\mathbf\{x\}\\\}^\{1:V\}using the grid𝕋\(ex,b\)\\mathbb\{T\}^\{\(ex,b\)\}and the ADNNR\[ℓ\]R^\{\[\\ell\]\}\. Achieving full convergence for each solver frequencyΔttr,b\\Delta t^\{tr,b\}is unnecessary, as the marginal benefits of further accuracy at an intermediate level do not justify the additional computational cost\. To decide whether to transition to the next grid level, we check whether zero belongs to the interval

0∈\[D¯\[ℓ\]−z1−α/2σ^D\[ℓ\],D¯\[ℓ\]\+z1−α/2σ^D\[ℓ\]\],0\\in\[\\bar\{D\}^\{\[\\ell\]\}\-z\_\{1\-\\alpha/2\}\\hat\{\\sigma\}\_\{D\}^\{\[\\ell\]\},\\bar\{D\}^\{\[\\ell\]\}\+z\_\{1\-\\alpha/2\}\\hat\{\\sigma\}\_\{D\}^\{\[\\ell\]\}\],\(32\)whereD¯\[ℓ\]\\bar\{D\}^\{\[\\ell\]\}andσ^D\[ℓ\]\\hat\{\\sigma\}\_\{D\}^\{\[\\ell\]\}are the mean and standard deviation ofDv\[ℓ\]:=Υv\[ℓ\]−Υv\[ℓ−1\]\{D\}\_\{v\}^\{\[\\ell\]\}:=\\Upsilon^\{\[\\ell\]\}\_\{v\}\-\\Upsilon^\{\[\\ell\-1\]\}\_\{v\},v=1,…,Vv=1,\\ldots,V\.

A very large fraction \(as many as 95% of the paths\) ofDv\[ℓ\]D\_\{v\}^\{\[\\ell\]\}’s are zero\. First, many validation paths never yield a positive payoff\. Second, some paths receive the same rewardΥv\[ℓ\]=Υv\[ℓ−1\]\\Upsilon^\{\[\\ell\]\}\_\{v\}=\\Upsilon^\{\[\\ell\-1\]\}\_\{v\}in consecutive loops because ADNN parameter updates do not alter the stopping decisions in Equation \([11](https://arxiv.org/html/2606.17545#S2.E11)\)\. Accordingly, we use a Delta method to separately account for the variance in the proportion of zeros and the variance of the non\-zeroDvD\_\{v\}’s\. We exploit the decompositionDv=InzDnzD\_\{v\}=I\_\{\\text\{nz\}\}D\_\{\\text\{nz\}\}, whereInzI\_\{\\text\{nz\}\}is an indicator ofDv≠0D\_\{v\}\\neq 0, andDnzD\_\{\\text\{nz\}\}denotes the reward difference conditional onDv≠0D\_\{v\}\\neq 0\. Letp=ℙ\(Dv≠0\)p=\\mathbb\{P\}\(D\_\{v\}\\neq 0\), and denoteμnz:=𝔼\[Dnz\]\\mu\_\{\\text\{nz\}\}:=\\mathbb\{E\}\[D\_\{\\text\{nz\}\}\]andσnz2:=Var\(Dnz\)\\sigma\_\{\\text\{nz\}\}^\{2\}:=\\mathrm\{Var\}\(D\_\{\\text\{nz\}\}\)\. By the law of total expectation,

𝔼\[Dv\]\\displaystyle\\mathbb\{E\}\[D\_\{v\}\]=𝔼\[Dv∣Dv≠0\]ℙ\(Dv≠0\)\+𝔼\[Dv∣Dv=0\]ℙ\(Dv=0\)=pμnz;\\displaystyle=\\mathbb\{E\}\\\!\\left\[D\_\{v\}\\mid D\_\{v\}\\neq 0\\right\]\\mathbb\{P\}\(D\_\{v\}\\neq 0\)\+\\mathbb\{E\}\\\!\\left\[D\_\{v\}\\mid D\_\{v\}=0\\right\]\\mathbb\{P\}\(D\_\{v\}=0\)=p\\,\\mu\_\{\\text\{nz\}\};\(33\)𝔼\[Dv2\]\\displaystyle\\mathbb\{E\}\[D^\{2\}\_\{v\}\]=𝔼\[Dv2∣Dv≠0\]ℙ\(Dv≠0\)=p𝔼\[Dnz2\]=p\(σnz2\+μnz2\)\.\\displaystyle=\\mathbb\{E\}\\\!\\left\[D^\{2\}\_\{v\}\\mid D\_\{v\}\\neq 0\\right\]\\mathbb\{P\}\(D\_\{v\}\\neq 0\)=p\\,\\mathbb\{E\}\\\!\\left\[D\_\{\\text\{nz\}\}^\{2\}\\right\]=p\\left\(\\sigma\_\{\\text\{nz\}\}^\{2\}\+\\mu\_\{\\text\{nz\}\}^\{2\}\\right\)\.\(34\)It follows that the variance of the i\.i\.d\. sample averageD¯\[ℓ\]=1V∑vDv\[ℓ\]\\bar\{D\}^\{\[\\ell\]\}=\\frac\{1\}\{V\}\\sum\_\{v\}D\_\{v\}^\{\[\\ell\]\}is

Var\(D¯\[ℓ\]\)=𝔼\[\(D¯\[ℓ\]\)2\]−𝔼\[D¯\[ℓ\]\]2=pVσnz2\+p\(1−p\)Vμnz2\.\\mathrm\{Var\}\\left\(\\bar\{D\}^\{\[\\ell\]\}\\right\)=\\mathbb\{E\}\\big\[\(\\bar\{D\}^\{\[\\ell\]\}\)^\{2\}\\big\]\-\\mathbb\{E\}\\big\[\\bar\{D\}^\{\[\\ell\]\}\\big\]^\{2\}=\\frac\{p\}\{V\}\\sigma\_\{\\text\{nz\}\}^\{2\}\+\\frac\{p\(1\-p\)\}\{V\}\\mu\_\{\\text\{nz\}\}^\{2\}\.\(35\)Replacingpp,μnz\\mu\_\{\\text\{nz\}\}, andσnz2\\sigma\_\{\\text\{nz\}\}^\{2\}in Equation \([35](https://arxiv.org/html/2606.17545#S4.E35)\) with their sample counterpartsp^\[ℓ\]\\hat\{p\}^\{\[\\ell\]\},D¯nz\[ℓ\]\\bar\{D\}\_\{\\text\{nz\}\}^\{\[\\ell\]\}, and\(snz\[ℓ\]\)2\(s^\{\[\\ell\]\}\_\{\\text\{nz\}\}\)^\{2\}gives the estimated standard errorσ^\[ℓ\]=Var^\(D¯\[ℓ\]\)\\hat\{\\sigma\}^\{\[\\ell\]\}=\\sqrt\{\\widehat\{\\mathrm\{Var\}\}\(\\bar\{D\}^\{\[\\ell\]\}\)\}\. The resulting normal\-approximation confidence interval at level100\(1−α\)%100\(1\-\\alpha\)\\%isD¯\[ℓ\]±z1−α/2σ^\[ℓ\]\\bar\{D\}^\{\[\\ell\]\}\\pm z\_\{1\-\\alpha/2\}\\hat\{\\sigma\}^\{\[\\ell\]\}, wherez1−α/2z\_\{1\-\\alpha/2\}denotes the\(1−α/2\)\(1\-\\alpha/2\)\-quantile of the standard normal distribution\. Unless stated otherwise, we takeα=0\.05\\alpha=0\.05, resulting in a95%95\\%stopping interval\. If the interval in Equation \([32](https://arxiv.org/html/2606.17545#S4.E32)\) contains zero, the reward change is not significant and the grid is advanced to the next level\.

### 4\.6Grid Schedule

As a last ingredient of CARLOS we discuss the construction of the grid schedule𝕋\(tr,b\),b=0,…,B\\mathbb\{T\}^\{\(tr,b\)\},b=0,\\ldots,B\. Recall that the coarsest grid𝕋\(tr,0\)\\mathbb\{T\}^\{\(tr,0\)\}is used in Stage 1 to generate good\-enough training data for fitting the initial ADNN, while Stage 2 refines the timing value estimates on progressively denser grids\. In constructing\{𝕋\(tr,b\)\}b=0B\\\{\\mathbb\{T\}^\{\(tr,b\)\}\\\}\_\{b=0\}^\{B\}, we must both limit concept drift and exploit opportunities to speed up the RL algorithm[1](https://arxiv.org/html/2606.17545#alg1)\. In particular, we prefer maximal progress on coarse solver grids, where training is computationally inexpensive, while limiting iterations on fine grids, where each loop is costly due to high overhead of generating training samples𝐲\\mathbf\{y\}\. The resulting considerations are investigated in Table[5](https://arxiv.org/html/2606.17545#S4.T5)\.

Step Size ScheduleΔttr,b\\Delta t^\{tr,b\}PriceLoopsLLRuntime\(13,16,112,124,148,196,1192\)\(\\frac\{1\}\{3\},\\frac\{1\}\{6\},\\frac\{1\}\{12\},\\frac\{1\}\{24\},\\frac\{1\}\{48\},\\frac\{1\}\{96\},\\frac\{1\}\{192\}\)14\.1552\(0\.008\)14\.1552\_\{\(0\.008\)\}11\.6\(2\.33\)11\.6\_\{\(2\.33\)\}68\.96\(11\.58\)68\.96\_\{\(11\.58\)\}\(13,16,112,124,148,196\)\(\\frac\{1\}\{3\},\\frac\{1\}\{6\},\\frac\{1\}\{12\},\\frac\{1\}\{24\},\\frac\{1\}\{48\},\\frac\{1\}\{96\}\)14\.1616\(0\.009\)14\.1616\_\{\(0\.009\)\}10\.6\(2\.33\)10\.6\_\{\(2\.33\)\}52\.85\(10\.94\)52\.85\_\{\(10\.94\)\}\(13,112,148,1192\)\(\\frac\{1\}\{3\},\\frac\{1\}\{12\},\\frac\{1\}\{48\},\\frac\{1\}\{192\}\)14\.1482\(0\.007\)14\.1482\_\{\(0\.007\)\}6\.8\(0\.75\)6\.8\_\{\(0\.75\)\}54\.21\(8\.45\)54\.21\_\{\(8\.45\)\}\(16,112,124,148,196,1192\)\(\\frac\{1\}\{6\},\\frac\{1\}\{12\},\\frac\{1\}\{24\},\\frac\{1\}\{48\},\\frac\{1\}\{96\},\\frac\{1\}\{192\}\)14\.1571\(0\.024\)14\.1571\_\{\(0\.024\)\}11\.0\(1\.90\)11\.0\_\{\(1\.90\)\}89\.10\(11\.06\)89\.10\_\{\(11\.06\)\}\(𝟏𝟔,𝟏𝟏𝟐,𝟏𝟐𝟒,𝟏𝟒𝟖,𝟏𝟗𝟔\)\\mathbf\{\(\\frac\{1\}\{6\},\\frac\{1\}\{12\},\\frac\{1\}\{24\},\\frac\{1\}\{48\},\\frac\{1\}\{96\}\)\}14\.1733\(0\.022\)14\.1733\_\{\(0\.022\)\}9\.4\(1\.96\)9\.4\_\{\(1\.96\)\}68\.51\(7\.92\)68\.51\_\{\(7\.92\)\}\(16,124,196\)\(\\frac\{1\}\{6\},\\frac\{1\}\{24\},\\frac\{1\}\{96\}\)14\.1692\(0\.008\)14\.1692\_\{\(0\.008\)\}6\.4\(0\.80\)6\.4\_\{\(0\.80\)\}56\.73\(1\.86\)56\.73\_\{\(1\.86\)\}\(16,196\)\(\\frac\{1\}\{6\},\\frac\{1\}\{96\}\)14\.1499\(0\.034\)14\.1499\_\{\(0\.034\)\}5\.6\(1\.02\)5\.6\_\{\(1\.02\)\}58\.46\(5\.42\)58\.46\_\{\(5\.42\)\}\(112,124,148,196,1192\)\(\\frac\{1\}\{12\},\\frac\{1\}\{24\},\\frac\{1\}\{48\},\\frac\{1\}\{96\},\\frac\{1\}\{192\}\)14\.1782\(0\.010\)14\.1782\_\{\(0\.010\)\}7\.6\(0\.80\)7\.6\_\{\(0\.80\)\}119\.25\(3\.44\)119\.25\_\{\(3\.44\)\}\(112,124,148,196\)\(\\frac\{1\}\{12\},\\frac\{1\}\{24\},\\frac\{1\}\{48\},\\frac\{1\}\{96\}\)14\.1614\(0\.012\)14\.1614\_\{\(0\.012\)\}6\.0\(0\.63\)6\.0\_\{\(0\.63\)\}96\.11\(2\.67\)96\.11\_\{\(2\.67\)\}\(112,148,1192\)\(\\frac\{1\}\{12\},\\frac\{1\}\{48\},\\frac\{1\}\{192\}\)14\.1599\(0\.007\)14\.1599\_\{\(0\.007\)\}5\.2\(0\.40\)5\.2\_\{\(0\.40\)\}107\.43\(4\.80\)107\.43\_\{\(4\.80\)\}Halve every 1 loop14\.1528\(0\.015\)14\.1528\_\{\(0\.015\)\}5\.0\(0\.00\)5\.0\_\{\(0\.00\)\}55\.51\(0\.37\)55\.51\_\{\(0\.37\)\}Halve every 2 loops14\.1660\(0\.017\)14\.1660\_\{\(0\.017\)\}10\.00\(0\.00\)10\.00\_\{\(0\.00\)\}71\.41\(1\.92\)71\.41\_\{\(1\.92\)\}Halve every11\+ Adaptive on𝕋\(ex,B\)\\mathbb\{T\}^\{\(ex,B\)\}14\.1506\(0\.016\)14\.1506\_\{\(0\.016\)\}5\.2\(0\.40\)5\.2\_\{\(0\.40\)\}59\.27\(2\.47\)59\.27\_\{\(2\.47\)\}Halve every22\+ Adaptive on𝕋\(ex,B\)\\mathbb\{T\}^\{\(ex,B\)\}14\.1636\(0\.016\)14\.1636\_\{\(0\.016\)\}9\.8\(0\.75\)9\.8\_\{\(0\.75\)\}72\.13\(3\.98\)72\.13\_\{\(3\.98\)\}Table 5:Tuning of the solver grid schedule\{𝕋\(tr,b\)\}\\\{\\mathbb\{T\}^\{\(tr,b\)\}\\\}for theM2contract in Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. See main text for explanation of the hybrid scheme in the last two rows\. The base choice is bolded\.It is not necessary to train on very fine solver grids: the stopping boundaries converge rapidly, so accurate American prices can be obtained without taking minusculeΔttr\\Delta t^\{tr\}\. We find thatΔttr,B=1150\\Delta t^\{tr,B\}=\\tfrac\{1\}\{150\}is sufficient for typical stock volatilities\. Table[5](https://arxiv.org/html/2606.17545#S4.T5)shows that for theM2contract, although the ADNN is ultimately evaluated onΔtex=1192\\Delta t^\{ex\}=\\tfrac\{1\}\{192\}, comparable prices are obtained by terminating the RL already atΔttr,B=196\\Delta t^\{tr,B\}=\\tfrac\{1\}\{96\}\. Since RL loops on the finest grids are the most time\-consuming \(and most prone to instabilities that inflate loop counts\), this also saves nearly 30% of runtime\.

For the initial grid size, a good rule of thumb is to useN=20N=20steps\. Using a coarser𝕋\(tr,0\)\\mathbb\{T\}^\{\(tr,0\)\}increases the risk of concept drift and ultimately leads to lower rewards, as evidenced in Table[5](https://arxiv.org/html/2606.17545#S4.T5)\. TakingN≫20N\\gg 20steps is counterproductive: LS errors back\-propagate, so finer discretization does not improve the Stage 1 data while adding computational cost\.

In terms of grid refinement, our default is halve the step size at each transition\. Decreasing the step size more rapidly, e\.g\., takingΔttr,b\+1=14Δttr,b\\Delta t^\{tr,b\+1\}=\\tfrac\{1\}\{4\}\\Delta t^\{tr,b\}orΔttr,b\+1=18Δttr,b\\Delta t^\{tr,b\+1\}=\\tfrac\{1\}\{8\}\\Delta t^\{tr,b\}, decreases the overall number of learning loopsLL\. However, as shown in Table[5](https://arxiv.org/html/2606.17545#S4.T5)the runtime savings remain modest because the reduced loop count mainly shifts training away from computationally cheaper training on coarse grid to more expensive fine\-grid RL loops\. Aggressive schedules also increase concept drift, which makes NN instabilities more likely, and amplify their impact, since the algorithm has fewer chances to self\-correct\. As a result,Δttr,b=0\.5b⋅Δttr,0\\Delta t^\{tr,b\}=0\.5^\{b\}\\cdot\\Delta t^\{tr,0\}is the empirical best choice\.

To assess the effectiveness of the stopping\-interval–based transitions introduced in Section[3\.2](https://arxiv.org/html/2606.17545#S3.SS2), we consider a version of our RL algorithm with a pre\-specified number of learning loops per solver grid\. As shown in Table[5](https://arxiv.org/html/2606.17545#S4.T5), configurations with deterministic grid transition schedules perform worse, as some of the RL flexibility is removed\. As a further option, we implemented a hybrid rule: deterministic transitions with a fixed number ofℓ∈\[1,2\]\\ell\\in\[1,2\]loops for intermediate solver grids, while on the finest grid𝕋\(tr,B\)\\mathbb\{T\}^\{\(tr,B\)\}, we only conclude learning based on the adaptive rule in \([32](https://arxiv.org/html/2606.17545#S4.E32)\)\. This hybrid transition scheme performs on par with the fixed\-loop variant\. Taken together, these results indicate that the adaptive grid transitioning is effective and beats deterministic alternatives\.

### 4\.7Tuning Guidelines

While CARLOS has many tuning parameters, most of them can be set to default values and need not be adjusted according to contract specifics\. Based on our extensive experiments, we make the following recommendations\. The overarching philosophy is to minimize computational effort without compromising learning stability\. Table[2](https://arxiv.org/html/2606.17545#S3.T2)then summarizes the contract\-specific settings of the CARLOS algorithm based on the below guidelines\.

Stage 1 should employ a sufficiently large number of pathsKKcommensurate with the state dimensiondd\. As a baseline, we recommend settingK=104×dK=10^\{4\}\\times d\. LargerKKis needed for deep OTM contracts, to ensure enough informative input\-output pairs for ADNN initialization\. Similarly, the number of input–output pairsMMshould scale with the contract dimensiondd; we recommendM=104×dM=10^\{4\}\\times d\. Although a largerMMcan improve the price estimate and reduce the standard error, the resulting gains are generally too modest to justify the additional computational cost\.

As a baseline, ReLU serves as a robust activation function\. The ADNN hidden\-layer width should scale with the option dimensiondd\. Specifically, we set the number of nodes tomax⁡\{30×d,60\}\\max\\\{30\\times d,60\\\}, implying that even one\-dimensional contracts use 60 nodes\. Finally, we recommend a batch size of 64, with the number of training epochs increasing for higher\-dimensional options\. For one\- and two\-dim\. contracts, we use 5 training epochs, increasing to 10 ford∈\{3,4,5\}d\\in\\\{3,4,5\\\}, cf\. Table[6](https://arxiv.org/html/2606.17545#S4.T6)\.

For the training density, we favor relatively high exploration initially,λexl\[0\]=0\.55,λ\+\[0\]=λ−\[0\]=0\.2\\lambda\_\{\\text\{exl\}\}^\{\[0\]\}=0\.55,\\lambda\_\{\+\}^\{\[0\]\}=\\lambda\_\{\-\}^\{\[0\]\}=0\.2andλter\[0\]=0\.05\\lambda\_\{\\text\{ter\}\}^\{\[0\]\}=0\.05in \([20](https://arxiv.org/html/2606.17545#S4.E20)\), coupled with a gradual reallocation toward exploitation throughcexpl=0\.25c\_\{\\text\{expl\}\}=0\.25in Equation \([25](https://arxiv.org/html/2606.17545#S4.E25)\)\. A moderately large number of validation pathsVVis needed to ensure stable grid transitions via the adaptive rule \([32](https://arxiv.org/html/2606.17545#S4.E32)\)\. Note that largerVVmakes the transition rule more strict \(since the confidence interval shrinks\), which can ultimately reduce runtime by avoiding premature transitions to finer grids, where ADNN training is more computationally expensive, see Table[A\.1](https://arxiv.org/html/2606.17545#A2.T1)\.

For the grid schedule, we start with roughlyN=20N=20grid steps in Stage 1 and halve the step size at each transition, up to aroundΔttr≃1150\\Delta t^\{tr\}\\simeq\\tfrac\{1\}\{150\}, even if the exercise gridΔtex\\Delta t^\{ex\}is denser\. Each grid transition makes further improvement of the timing value estimates roughly twice as computationally expensive, so there is a rapid saturation between price gains from further refinement relative to runtime\. An initial learning rateη\[0\]=10−4\\eta^\{\[0\]\}=10^\{\-4\}is a good default, reduced by the decay factorαdec=0\.7\\alpha\_\{\\text\{dec\}\}=0\.7at each grid transition\. This allows for larger ADNN updates on the coarser grids, while yielding more measured updates on the finer grids, where the timing value estimates are mainly refined near the stopping boundary\.

The waiting periodΔwaitb\\Delta^\{b\}\_\{\\text\{wait\}\}enforces exploration in the stopping region, and should decrease sublinearly as step size shrinks\. To preserve sufficient exploration for updating timing value estimates inside𝒮\[ℓ\]\\mathcal\{S\}^\{\[\\ell\]\}, we recommendcdlst=1\.3c\_\{\\text\{dlst\}\}=1\.3in \([31](https://arxiv.org/html/2606.17545#S4.E31)\) assuming thatΔtr,b=Δtr,0/2b\\Delta^\{tr,b\}=\\Delta^\{tr,0\}/2^\{b\}\.

Ultimately, one may run the algorithm in the high\-fidelity mode, where accuracy and stability is paramount\. In that case, one should use a wide/deep network, a largeK,M,VK,M,Vand many epochs\. If speed is equally important, the values in Table[6](https://arxiv.org/html/2606.17545#S4.T6), used in our main benchmarking, offer good accuracy\-runtime trade\-offs and a guide for other contracts\.

ParameterB1B2M2a/bM3M5a/bStage 1 pathsKK10,00010\{,\}00010,00010\{,\}00020,00020\{,\}00030,00030\{,\}00050,00050\{,\}000NN Nodesqq6060606060609090150150NN Epochs551010101010101010RL training inputsMM10,00010\{,\}00020,00020\{,\}00020,00020\{,\}00030,00030\{,\}00050,00050\{,\}000Table 6:Contract\-specific parameter settings for the CARLOS algorithm, cf\. Table[2](https://arxiv.org/html/2606.17545#S3.T2)\.

## 5Conclusion

The developed CARLOS algorithm permits, for the first time, to solve continuous\-time optimal stopping problems using a Monte Carlo simulation\-based framework\. To do so, we leveraged neural network techniques to train a single space\-time aggregate surrogate that provides a stopping rule for anytt\. We then employed reinforcement learning logic that simultaneously trains this surrogate while gradually refining the exercise grid\. Our main innovation is the iterative refinement which serves a dual purpose, “killing two birds with one stone”\. First, it integrates with the mini\-batch\-based NN training, while controlling for concept drift, i\.e\., the shift in the timing\-value distribution as the time grid changes\. Second, it substantially lowers the running time compared to the brute force alternative of directly training a fine\-grid ADNN\. Instead, in CARLOS most of the training is done on coarser grids, achieving significant speed\-up*and*higher expected payoffs\. As we show, 5\-12 RL loops are generally sufficient to numerically converge to the American\-style formulation and provide an accurate continuous\-time stopping boundary\. Once RL training is complete, the ADNN can price American options on arbitrarily fine time grids, effectively enabling continuous\-time stopping\. For cases where PDE\-based benchmarks are unavailable, CARLOS produces prices higher than all previously reported values in the literature\.

Two other important innovations in CARLOS are the exploratory stopping and the adaptive training samples\. The continuation region expands as the grid is refined, motivating us to introduce a “waiting interval” for training inputs that start in the stopping region, in order to allow the ADNN to shift the stopping boundaries “outward”\. In addition, we prioritize ADNN training near the estimated stopping boundary—where even small timing\-value updates matter—and near the contract maturity where timing values are very close to zero and it is important to stabilize them\. To that end, we use a mix of exploratory, exploitative, and terminal inputs based on anchor sets\.

Looking ahead, two aspects of the method could be investigated further\. First, while we restrict the ADNN to fully connected feedforward networks, the optimal\-stopping literature has explored alternative architectures, including convolutional NNs\[[32](https://arxiv.org/html/2606.17545#bib.bib34)\], Long Short\-Term Memory \(LSTM\) networks\[[10](https://arxiv.org/html/2606.17545#bib.bib32)\], recurrent NNs\[[9](https://arxiv.org/html/2606.17545#bib.bib47)\], and randomized neural networks\[[17](https://arxiv.org/html/2606.17545#bib.bib21)\]\. In particular, CNNs may be an appealing alternative for the ADNN because they are designed to recognize local patterns\., which could help with adaptively refining the near\-maturity region without substantially altering the learned timing\-value estimates across the rest of the horizon\. Second, additional ideas can be leveraged for constructing the training inputs, for instance exploiting contract specifics such as the ATM diagonal present in max\-call options, or using batched training designs\[[27](https://arxiv.org/html/2606.17545#bib.bib58)\]\.

It would be worthwhile to generalize the idea of gradual and adaptive temporal grid refinement underlying CARLOS to related control problems, including multiple\-stopping \(swing option pricing\)\[[21](https://arxiv.org/html/2606.17545#bib.bib56),[11](https://arxiv.org/html/2606.17545#bib.bib62)\], optimal switching\[[18](https://arxiv.org/html/2606.17545#bib.bib59)\]and optimal impulse control\[[25](https://arxiv.org/html/2606.17545#bib.bib63)\]\. The approach should also prove fruitful for settings where time discretization is a limitation, such as deep BSDE solvers\[[14](https://arxiv.org/html/2606.17545#bib.bib60)\]\.

## References

- \[1\]L\. Andersen and M\. Broadie\(2004\)Primal\-dual simulation algorithm for pricing multidimensional American options\.Management Science50\(9\),pp\. 1222–1234\(eng\)\.External Links:ISSN 0025\-1909,[Document](https://dx.doi.org/https%3A//doi.org/10.1287/mnsc.1040.0258)Cited by:[§2\.2](https://arxiv.org/html/2606.17545#S2.SS2.p3.6)\.
- \[2\]J\. G\. Andréasson and P\. V\. Shevchenko\(2022\)A bias\-corrected least\-squares Monte Carlo for solving multi\-period utility models\.European Actuarial Journal12\(1\),pp\. 349–379\(eng\)\.External Links:ISSN 2190\-9733,[Document](https://dx.doi.org/http%3A//dx.doi.org/10.2139/ssrn.2985828)Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p6.1)\.
- \[3\]S\. Becker, P\. Cheridito, A\. Jentzen, and T\. Welti\(2021\)Solving high\-dimensional optimal stopping problems using deep learning\.European Journal of Applied Mathematics32\(3\),pp\. 470–514\(eng\)\.External Links:ISSN 0956\-7925,[Document](https://dx.doi.org/https%3A//doi.org/10.1017/S0956792521000073)Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p2.1),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p3.4),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p5.1),[Table 3](https://arxiv.org/html/2606.17545#S3.Ex2.m1.12.12.12.6.1.3),[Table 3](https://arxiv.org/html/2606.17545#S3.Ex2.m1.20.20.20.6.1.3),[§3\.2](https://arxiv.org/html/2606.17545#S3.SS2.p6.5),[§3\.3](https://arxiv.org/html/2606.17545#S3.SS3.p5.6),[§4\.1](https://arxiv.org/html/2606.17545#S4.SS1.p1.5),[§4\.1](https://arxiv.org/html/2606.17545#S4.SS1.p3.2)\.
- \[4\]S\. Becker, P\. Cheridito, and A\. Jentzen\(2019\-01\)Deep optimal stopping\.Journal of Machine Learning Research20\(1\),pp\. 2712–2736\.External Links:ISSN 1532\-4435,[Document](https://dx.doi.org/https%3A//dl.acm.org/doi/10.5555/3322706.3362015)Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p4.1),[§4\.1](https://arxiv.org/html/2606.17545#S4.SS1.p3.2)\.
- \[5\]C\. Borsa\(2026\)American option pricing in continuous time via reinforcement learning\.Ph\.D\. Thesis,UC Santa Barbara\.Cited by:[§4\.1](https://arxiv.org/html/2606.17545#S4.SS1.p4.1),[§4\.3](https://arxiv.org/html/2606.17545#S4.SS3.p6.9),[§4](https://arxiv.org/html/2606.17545#S4.p1.1)\.
- \[6\]Y\. Chen and J\. W\. Wan\(2021\)Deep neural network framework based on backward stochastic differential equations for pricing and hedging American options in high dimensions\.Quantitative Finance21\(1\),pp\. 45–67\.Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p4.1)\.
- \[7\]M\. Dai, Y\. Sun, Z\. Q\. Xu, and X\. Y\. Zhou\(2026\)Learning to optimally stop diffusion processes, with financial applications\.Management Science\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p4.1)\.
- \[8\]R\. Daluiso, E\. Nastasi, A\. Pallavicini, and G\. Sartorelli\(2024\)Swing option pricing consistent with futures smiles\.Applied Stochastic Models in Business and Industry40\(2\),pp\. 224–242\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p4.1)\.
- \[9\]N\. Damera\-Venkata and C\. Bhattacharyya\(2023\)Deep recurrent optimal stopping\.Advances in Neural Information Processing Systems36,pp\. 12222–12244\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p4.1),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p5.1),[§5](https://arxiv.org/html/2606.17545#S5.p3.1)\.
- \[10\]\(2021\)Deep reinforcement learning for optimal stopping with application in financial engineering\.Cornell University Library, arXiv\.org\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p4.1),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p5.1),[§5](https://arxiv.org/html/2606.17545#S5.p3.1)\.
- \[11\]T\. Deschatre and J\. Mikael\(2022\)Deep combinatorial optimisation for optimal stopping time problems: application to swing options pricing\.MathematicS In Action11\(1\),pp\. 243–258\.Cited by:[§5](https://arxiv.org/html/2606.17545#S5.p4.1)\.
- \[12\]P\. Dupuis and H\. Wang\(2005\)On the convergence from discrete to continuous time in an optimal stopping problem\.The Annals of Applied Probability15\(2\),pp\. 1339–1366\(eng\)\.External Links:ISSN 1050\-5164,[Document](https://dx.doi.org/https%3A//doi.org/10.1214/105051605000000034)Cited by:[§1](https://arxiv.org/html/2606.17545#S1.p2.2),[§2\.2](https://arxiv.org/html/2606.17545#S2.SS2.p5.12),[§2](https://arxiv.org/html/2606.17545#S2.p4.3)\.
- \[13\]J\. Ery and L\. Michel\(2024\)Solving optimal stopping problems with deep Q\-learning\.Cornell University Library, arXiv\.org,Ithaca\(eng\)\.Note:arXiv\.org:2101\.09682External Links:ISSN 2331\-8422,[Document](https://dx.doi.org/https%3A//doi.org/10.48550/arXiv.2101.09682)Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p5.1)\.
- \[14\]C\. Gao, S\. Gao, R\. Hu, and Z\. Zhu\(2023\)Convergence of the backward deep BSDE method with applications to optimal stopping problems\.SIAM Journal on Financial Mathematics14\(4\),pp\. 1290–1303\.Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p4.1),[§5](https://arxiv.org/html/2606.17545#S5.p4.1)\.
- \[15\]L\. Gonon\(2024\)Deep neural network expressivity for optimal stopping problems\.Finance and Stochastics28\(3\),pp\. 865–910\.Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.17545#S4.SS1.p2.6)\.
- \[16\]I\. Guo, N\. Langrené, and J\. Wu\(2025\)Simultaneous upper and lower bounds of American\-style option prices with hedging via neural networks\.Quantitative Finance25\(4\),pp\. 509–525\.Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p3.4),[§3\.3](https://arxiv.org/html/2606.17545#S3.SS3.p4.8),[Remark 1](https://arxiv.org/html/2606.17545#Thmremark1.p1.6.6)\.
- \[17\]C\. Herrera, F\. Krach, P\. Ruyssen, and J\. Teichmann\(2024\)Optimal stopping via randomized neural networks\.Frontiers of Mathematical Finance3\(1\),pp\. 31–77\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p3.4),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p2.1),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p5.1),[§4\.1](https://arxiv.org/html/2606.17545#S4.SS1.p3.2),[§5](https://arxiv.org/html/2606.17545#S5.p3.1)\.
- \[18\]R\. Hu\(2020\)Deep learning for ranking response surfaces with applications to optimal stopping problems\.Quantitative Finance20\(9\),pp\. 1567–1581\.Cited by:[§5](https://arxiv.org/html/2606.17545#S5.p4.1)\.
- \[19\]M\. Kohler, A\. Krzyżak, and N\. Todorovic\(2010\)Pricing of high\-dimensional American options by neural networks\.Mathematical Finance20\(3\),pp\. 383–410\(eng\)\.External Links:ISSN 0960\-1627,[Document](https://dx.doi.org/https%3A//doi.org/10.1111/j.1467-9965.2010.00404.x)Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p2.1),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p5.1),[Remark 1](https://arxiv.org/html/2606.17545#Thmremark1.p1.6.6)\.
- \[20\]B\. Lapeyre and J\. Lelong\(2021\)Neural network regression for Bermudan option pricing\.Monte Carlo Methods and Applications27\(3\),pp\. 227–247\.External Links:[Link](https://arxiv.org/abs/1907.06474)Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p2.1),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p3.4),[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p5.1),[§4\.1](https://arxiv.org/html/2606.17545#S4.SS1.p3.2)\.
- \[21\]M\. Laurière and M\. Talbi\(2025\)Deep learning for the multiple optimal stopping problem\.arXiv preprint arXiv:2512\.22961\.Cited by:[§5](https://arxiv.org/html/2606.17545#S5.p4.1)\.
- \[22\]X\. Li and C\. Lee\(2023\)Δ\\DeltaV\-learning: an adaptive reinforcement learning algorithm for the optimal stopping problem\.Expert Systems with Applications231,pp\. 120702\.External Links:ISSN 0957\-4174,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.eswa.2023.120702)Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p3.4)\.
- \[23\]Y\. Li, C\. Szepesvari, and D\. Schuurmans\(2009\)Learning exercise policies for American options\.InArtificial intelligence and statistics,pp\. 352–359\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p2.4)\.
- \[24\]F\. A\. Longstaff and E\. S\. Schwartz\(2001\)Valuing American options by simulation: a simple least\-squares approach\.The Review of Financial Studies14\(1\),pp\. 113–147\(eng\)\.External Links:ISSN 0893\-9454,[Document](https://dx.doi.org/http%3A//dx.doi.org/10.1093/rfs/14.1.113)Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p1.4),[§1](https://arxiv.org/html/2606.17545#S1.p1.1)\.
- \[25\]M\. Ludkovski\(2022\)Regression Monte Carlo for impulse control\.MathematicS In Action11\(1\),pp\. 73–90\.Cited by:[§5](https://arxiv.org/html/2606.17545#S5.p4.1)\.
- \[26\]M\. Ludkovski\(2023\)mlOSP: towards a unified implementation of regression Monte Carlo algorithms\.Journal of Computational Finance17\(1\)\.Cited by:[§1](https://arxiv.org/html/2606.17545#S1.p2.2),[Table 3](https://arxiv.org/html/2606.17545#S3.Ex2.m1.28.28.28.6.1.3),[Table 3](https://arxiv.org/html/2606.17545#S3.Ex2.m1.8.8.8.6.1.3),[§3\.3](https://arxiv.org/html/2606.17545#S3.SS3.p5.6)\.
- \[27\]X\. Lyu and M\. Ludkovski\(2022\)Adaptive batching for Gaussian process surrogates with application in noisy level set estimation\.Statistical Analysis and Data Mining: The ASA Data Science Journal15\(2\),pp\. 225–246\.Cited by:[§5](https://arxiv.org/html/2606.17545#S5.p3.1)\.
- \[28\]V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski, S\. Petersen, C\. Beattie, A\. Sadik, I\. Antonoglou, H\. King, D\. Kumaran, D\. Wierstra, S\. Legg, and D\. Hassabis\(2015\)Human\-level control through deep reinforcement learning\.Nature \(London\)518\(7540\),pp\. 529–533\(eng\)\.External Links:ISSN 0028\-0836,[Document](https://dx.doi.org/https%3A//doi.org/10.1038/nature14236)Cited by:[§4\.4](https://arxiv.org/html/2606.17545#S4.SS4.p3.4)\.
- \[29\]G\. Peskir and A\. Shiryaev\(2006\)Optimal stopping and free\-boundary problems\.Lectures in mathematics ETH Zurich,Birkhauser Verlag,Basel\(eng\)\.External Links:ISBN 3764324198,LCCN 2006049876Cited by:[§2](https://arxiv.org/html/2606.17545#S2.p3.4)\.
- \[30\]A\. M\. Reppen, H\. M\. Soner, and V\. Tissot\-Daguette\(2025\)Neural optimal stopping boundary\.Mathematical Finance35,pp\. 100–128\.External Links:[Link](https://arxiv.org/abs/2205.04595)Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p4.1),[Table 3](https://arxiv.org/html/2606.17545#S3.Ex2.m1.16.16.16.6.1.3),[Table 3](https://arxiv.org/html/2606.17545#S3.Ex2.m1.24.24.24.6.1.3),[§3\.2](https://arxiv.org/html/2606.17545#S3.SS2.p6.5),[§3\.3](https://arxiv.org/html/2606.17545#S3.SS3.p5.6)\.
- \[31\]J\. Sirignano and K\. Spiliopoulos\(2018\)DGM: a deep learning algorithm for solving partial differential equations\.Journal of computational physics375,pp\. 1339–1364\.Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p4.1)\.
- \[32\]\(2022\)Solving the optimal stopping problem with reinforcement learning: an application in financial option exercise\.pp\. 1–8\.Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p5.1),[§5](https://arxiv.org/html/2606.17545#S5.p3.1)\.
- \[33\]J\. N\. Tsitsiklis and B\. Van Roy\(2002\)Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high\-dimensional financial derivatives\.IEEE Transactions on Automatic Control44\(10\),pp\. 1840–1851\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p2.4)\.
- \[34\]P\. Wilmott, S\. Howison, and J\. Dewynne\(1995\)The mathematics of financial derivatives: a student introduction\.Cambridge University Press,Oxford\(eng\)\.External Links:ISBN 0521496993,LCCN 95016466Cited by:[§2\.2](https://arxiv.org/html/2606.17545#S2.SS2.p6.12)\.
- \[35\]J\. Yang and G\. Li\(2024\)A deep primal\-dual BSDE method for optimal stopping problems\.arXiv preprint arXiv:2409\.06937\.Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p4.1)\.
- \[36\]J\. Yang and G\. Li\(2025\)Gradient\-enhanced sparse Hermite polynomial expansions for pricing and hedging high\-dimensional American options\.SIAM Journal on Financial Mathematics16\(3\),pp\. 959–987\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1137/24M1659066)Cited by:[§3\.2](https://arxiv.org/html/2606.17545#S3.SS2.p6.5),[§3\.3](https://arxiv.org/html/2606.17545#S3.SS3.p5.6)\.
- \[37\]X\. Yang, A\. Kratsios, F\. Krach, M\. Grasselli, and A\. Lucchi\(2026\)Synchronizing pretrained kernel regressors with applications to American option pricing\.Frontiers of Mathematical Finance8,pp\. 23–77\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p4.1)\.
- \[38\]H\. Yu and D\. P\. Bertsekas\(2007\)Q\-learning algorithms for optimal stopping based on least squares\.In2007 European Control Conference \(ECC\),pp\. 2368–2375\.Cited by:[§1\.1](https://arxiv.org/html/2606.17545#S1.SS1.p2.4)\.
- \[39\]R\. Zhang, N\. Langrené, Y\. Tian, Z\. Zhu, F\. Klebaner, and K\. Hamza\(2019\)Dynamic portfolio optimization with liquidity cost and market impact: a simulation\-and\-regression approach\.Quantitative Finance19\(3\),pp\. 519–532\.Cited by:[§1\.2](https://arxiv.org/html/2606.17545#S1.SS2.p6.1)\.

## Appendix ASupplementary Plots

![Refer to caption](https://arxiv.org/html/2606.17545v1/x8.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.17545v1/x9.png)\(b\)

Figure A\.1:Exploratory stopping for theB1Put contract\.
\(a\) Timing values𝒯^\\hat\{\\mathscr\{T\}\}estimated att=0\.89t=0\.89on1,0001\{,\}000discretized locationsxxspanning\[30,40\]\[30,40\]\. The three curves show estimates from the ADNNR\[0\]R^\{\[0\]\}at Stage 1 \(ℓ=0\\ell=0, blue\), the ADNNR\[1,0\]R^\{\[1,0\]\}after the first learning loop \(ℓ=1\\ell=1, black\), and a segmented cubic smoothing spline \(red\) fitted to Equation \([30](https://arxiv.org/html/2606.17545#S4.E30)\) outputs\. To capture the piecewise structure ofydpfy\_\{\\text\{dpf\}\}in \([29](https://arxiv.org/html/2606.17545#S4.E29)\), two splines are fitted separately\. Each outputyyis the average timing value of10,00010\{,\}000simulated paths starting at\(t,x\)\(t,x\)with exercise frequencyΔtex=1/12\\Delta t^\{ex\}=1/12, subjected to a waiting periodΔwait=0\.065\\Delta\_\{\\text\{wait\}\}=0\.065, and stopped according to the ADNNR\[0\]R^\{\[0\]\}\. Dashed vertical lines mark the zero\-timing\-value contours\.
\(b\) Sample paths \(at finest frequencyΔtex=1/192\\Delta t^\{ex\}=1/192\) illustrating the path labeling in Algorithm[3](https://arxiv.org/html/2606.17545#alg3)\. Yellow paths must exit the stopping region within a waiting period ofΔwait=1/12\\Delta\_\{\\text\{wait\}\}=1/12, indicated by vertical dashed gray lines\. The stopping boundary is derived from the final ADNN after Stage 2 \.![Refer to caption](https://arxiv.org/html/2606.17545v1/new-fig/B1-Stopping-Times-Histogram-Actual.png)\(a\)Paths initialized atX0X\_\{0\}\. Approximately 76\.7% of these paths are stopped\.
![Refer to caption](https://arxiv.org/html/2606.17545v1/new-fig/B1-Stopping-Times-Histogram-Adjusted.png)\(b\)Paths initialized atx0px\_\{0\}^\{p\}, cf\. Equation \([21](https://arxiv.org/html/2606.17545#S4.E21)\)\. Approximately 65\.1% of these paths are stopped\.

Figure A\.2:Distributions of the optimal stopping timeτ0∗\\tau^\{\*\}\_\{0\}for theB1contract in Table[2](https://arxiv.org/html/2606.17545#S3.T2)across PDE solver and exercise grids, estimated from1\.6×1061\.6\\times 10^\{6\}Monte Carlo paths\. The underlying stopping boundaries are computed with the Crank–Nicolson scheme withΔtex=ΔtPDE=1/192\\Delta t^\{ex\}=\\Delta t^\{\\text\{PDE\}\}=1/192andΔx=0\.02\\Delta x=0\.02\.
## Appendix BSupplementary Tables

Input\-Output pairsMMPriceLoopsLLRuntimeM=10,000M=10\{,\}00014\.1646\(0\.011\)14\.1646\_\{\(0\.011\)\}8\.8\(1\.94\)8\.8\_\{\(1\.94\)\}44\.96\(2\.36\)44\.96\_\{\(2\.36\)\}M=20,000M=20\{,\}00014\.1714\(0\.015\)14\.1714\_\{\(0\.015\)\}9\.0\(1\.67\)9\.0\_\{\(1\.67\)\}57\.00\(5\.11\)57\.00\_\{\(5\.11\)\}M=40,000M=40\{,\}00014\.1637\(0\.015\)14\.1637\_\{\(0\.015\)\}8\.0\(0\.63\)8\.0\_\{\(0\.63\)\}72\.10\(2\.04\)72\.10\_\{\(2\.04\)\}Validation PathsVVV=12,500V=12\{,\}50014\.1619\(0\.025\)14\.1619\_\{\(0\.025\)\}8\.4\(1\.02\)8\.4\_\{\(1\.02\)\}62\.20\(3\.14\)62\.20\_\{\(3\.14\)\}V=25,000V=25\{,\}00014\.1733\(0\.022\)14\.1733\_\{\(0\.022\)\}9\.4\(1\.96\)9\.4\_\{\(1\.96\)\}68\.51\(7\.92\)68\.51\_\{\(7\.92\)\}V=50,000V=50\{,\}00014\.1688\(0\.004\)14\.1688\_\{\(0\.004\)\}8\.2\(2\.48\)8\.2\_\{\(2\.48\)\}67\.52\(6\.28\)67\.52\_\{\(6\.28\)\}Exploration Window \([31](https://arxiv.org/html/2606.17545#S4.E31)\)cdlst=0c\_\{\\text\{dlst\}\}=014\.1573\(0\.012\)14\.1573\_\{\(0\.012\)\}10\.2\(2\.04\)10\.2\_\{\(2\.04\)\}58\.45\(6\.79\)58\.45\_\{\(6\.79\)\}cdlst=1\.0c\_\{\\text\{dlst\}\}=1\.014\.1621\(0\.016\)14\.1621\_\{\(0\.016\)\}9\.2\(2\.40\)9\.2\_\{\(2\.40\)\}57\.05\(8\.90\)57\.05\_\{\(8\.90\)\}cdlst=1\.1c\_\{\\text\{dlst\}\}=1\.114\.1686\(0\.020\)14\.1686\_\{\(0\.020\)\}8\.8\(0\.75\)8\.8\_\{\(0\.75\)\}56\.43\(2\.67\)56\.43\_\{\(2\.67\)\}cdlst=1\.3c\_\{\\text\{dlst\}\}=1\.314\.1714\(0\.015\)14\.1714\_\{\(0\.015\)\}9\.0\(1\.67\)9\.0\_\{\(1\.67\)\}57\.00\(5\.11\)57\.00\_\{\(5\.11\)\}cdlst=1\.5c\_\{\\text\{dlst\}\}=1\.514\.1524\(0\.023\)14\.1524\_\{\(0\.023\)\}8\.4\(2\.06\)8\.4\_\{\(2\.06\)\}53\.08\(4\.88\)53\.08\_\{\(4\.88\)\}cdlst=2c\_\{\\text\{dlst\}\}=214\.1066\(0\.046\)14\.1066\_\{\(0\.046\)\}9\.2\(2\.93\)9\.2\_\{\(2\.93\)\}56\.47\(7\.87\)56\.47\_\{\(7\.87\)\}Table A\.1:Tuning the number of input\-output pairsMMper learning loop and the exploration power factorcdlstc\_\{\\text\{dlst\}\}in Equation \([31](https://arxiv.org/html/2606.17545#S4.E31)\) for theM2contract in Table[2](https://arxiv.org/html/2606.17545#S3.T2)\.LSMCKKStage 1 PriceFinal PriceLoopsLLStage 1 TimeTotal Runtime10,00010\{,\}00014\.0757\(0\.023\)14\.0757\_\{\(0\.023\)\}14\.1574\(0\.011\)14\.1574\_\{\(0\.011\)\}9\.4\(1\.96\)9\.4\_\{\(1\.96\)\}15\.60\(0\.05\)15\.60\_\{\(0\.05\)\}42\.44\(4\.21\)42\.44\_\{\(4\.21\)\}20,00020\{,\}00013\.9697\(0\.152\)13\.9697\_\{\(0\.152\)\}14\.1714\(0\.015\)14\.1714\_\{\(0\.015\)\}9\.0\(1\.67\)9\.0\_\{\(1\.67\)\}29\.50\(0\.67\)29\.50\_\{\(0\.67\)\}57\.00\(5\.11\)57\.00\_\{\(5\.11\)\}40,00040\{,\}00013\.7377\(0\.467\)13\.7377\_\{\(0\.467\)\}14\.1768\(0\.010\)14\.1768\_\{\(0\.010\)\}9\.6\(1\.02\)9\.6\_\{\(1\.02\)\}60\.85\(1\.33\)60\.85\_\{\(1\.33\)\}88\.07\(1\.70\)88\.07\_\{\(1\.70\)\}Table A\.2:Tuning the number of pathsKKin Stage 1 for theM2contract in Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. Stage 1 and Stage 2 prices are computed using1\.6×1061\.6\\times 10^\{6\}Monte Carlo paths at exercise frequencyΔtex=1192\\Delta t^\{ex\}=\\frac\{1\}\{192\}; the standard error of these estimates is approximately0\.0130\.013\.ContractΔttr\\Delta t^\{tr\}Δtex\\Delta t^\{ex\}Monte Carlo Estimated PricesPDE Price1/12\\nicefrac\{\{1\}\}\{\{12\}\}1/24\\nicefrac\{\{1\}\}\{\{24\}\}1/48\\nicefrac\{\{1\}\}\{\{48\}\}1/96\\nicefrac\{\{1\}\}\{\{96\}\}1/192\\nicefrac\{\{1\}\}\{\{192\}\}B21/12\\nicefrac\{\{1\}\}\{\{12\}\}1\.4591\.4671\.4691\.4681\.4661\.4551/24\\nicefrac\{\{1\}\}\{\{24\}\}1\.4571\.4691\.4731\.473∗∗1\.473^\{\*\*\}1\.473∗∗1\.473^\{\*\*\}1\.4651/48\\nicefrac\{\{1\}\}\{\{48\}\}1\.4541\.4681\.4741\.4751\.4761\.4711/96\\nicefrac\{\{1\}\}\{\{96\}\}1\.4511\.4671\.4741\.4761\.4771\.4731/192\\nicefrac\{\{1\}\}\{\{192\}\}1\.4481\.4651\.4731\.4761\.4781\.475M2\.A1/12\\nicefrac\{\{1\}\}\{\{12\}\}14\.13714\.17214\.177∗14\.177^\{\*\}14\.16714\.15714\.1481/24\\nicefrac\{\{1\}\}\{\{24\}\}14\.13214\.18114\.19614\.198∗∗14\.198^\{\*\*\}14\.19214\.1901/48\\nicefrac\{\{1\}\}\{\{48\}\}14\.11114\.17414\.20114\.20914\.211∗∗14\.211^\{\*\*\}14\.2121/96\\nicefrac\{\{1\}\}\{\{96\}\}14\.09614\.16814\.20014\.21014\.214∗14\.214^\{\*\}14\.2221/192\\nicefrac\{\{1\}\}\{\{192\}\}14\.08014\.15714\.19214\.20914\.21414\.228M2\.B1/12\\nicefrac\{\{1\}\}\{\{12\}\}15\.71815\.73815\.74715\.74115\.73415\.7421/24\\nicefrac\{\{1\}\}\{\{24\}\}15\.71415\.74515\.75915\.759∗∗15\.759^\{\*\*\}15\.756∗15\.756^\{\*\}15\.7741/48\\nicefrac\{\{1\}\}\{\{48\}\}15\.70515\.74415\.76115\.76915\.768∗∗15\.768^\{\*\*\}15\.7901/96\\nicefrac\{\{1\}\}\{\{96\}\}15\.69215\.73815\.76215\.77315\.776∗∗15\.776^\{\*\*\}15\.7981/192\\nicefrac\{\{1\}\}\{\{192\}\}15\.68215\.73315\.76015\.77115\.77715\.802Table A\.3:Expected prices for theB1,B2, andM2options from Table[2](https://arxiv.org/html/2606.17545#S3.T2), estimated using1\.6×1061\.6\\times 10^\{6\}Monte Carlo simulated paths and PDE\-based stopping rules\. Row\-wise price differences between adjacentΔtex\\Delta t^\{ex\}values are evaluated for statistical significance\. If the difference between a price and its right neighbor is not statistically significant, the right\-side price is marked with∗\\,\{\}^\{\*\}at the 0\.05 level and∗∗\\,\{\}^\{\*\*\}at the 0\.01 level\. Standard deviations of the Monte Carlo estimated prices are approx\. 0\.0025 forB1, 0\.0015 forB2, and 0\.0115 forM2\. For reference, the PDE\-based prices for eachΔttr\\Delta t^\{tr\}are also provided\. The parameters for CN and explicit finite\-difference methods are:B1\(ΔtPDE=1/192\\Delta t^\{\\text\{PDE\}\}=1/192,Δx=0\.02\\Delta x=0\.02\),B2\(ΔtPDE=10−5\\Delta t^\{\\text\{PDE\}\}=10^\{\-5\},ΔXi=0\.2\\Delta X^\{i\}=0\.2fori=1,2i=1,2\), andM2\(ΔtPDE=10−5\\Delta t^\{\\text\{PDE\}\}=10^\{\-5\},ΔXi=0\.8\\Delta X^\{i\}=0\.8fori=1,2i=1,2\)\.
## Appendix CFull Benchmark Descriptions

ADNN Parameter SettingsParameterB1B2M2a/bM3M5a/bKK100,000100\{,\}000100,000100\{,\}000100,000100\{,\}000200,000200\{,\}000200,000200\{,\}000Δt\(N\)\\Delta t^\{\(N\)\}1/241/241/121/121/12Hidden nodes6060606060609090150150Epochs55555510101010Table A\.4:Parameter configuration for the Stage 1 high\-capacity ADNNs used to price the option contracts in Table[2](https://arxiv.org/html/2606.17545#S3.T2)\. Settings are grouped into LSMC parameters and ADNN architecture and training hyperparameters\. All runs use ReLU activation and batch size of 64\.
Continuous-time Optimal Stopping through Deep Reinforcement Learning

Similar Articles

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

Submit Feedback

Similar Articles

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL
From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments
Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving
Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints
Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts