Mahalanobis-Guided Latent OOD Detection for Hybrid ES-DRL Control in Time-Varying Systems
Summary
This paper presents a Mahalanobis-guided latent out-of-distribution detection method using a VAE to switch between a reinforcement learning controller and an extremum seeking controller in time-varying systems, validated in particle accelerator control.
View Cached Full Text
Cached at: 06/11/26, 01:47 PM
# Mahalanobis-Guided Latent OOD Detection for Hybrid ES–DRL Control in Time-Varying Systems
Source: [https://arxiv.org/html/2606.11474](https://arxiv.org/html/2606.11474)
###### Abstract
In this paper, we study Mahalanobis\-guided latent out\-of\-distribution \(OOD\) detection for test\-time RL controller switching in nonlinear time\-varying systems\. RL controllers can quickly control high\-dimensional systems within the training distribution, but their performance can degrade when time\-varying dynamics produce unseen observations\. We consider a combined ES–DRL controller, where RL provides fast in\-distribution actions and bounded extremum seeking \(ES\) provides robust model\-independent control under OOD operation\. The key challenge is deciding when to switch\. We train a variational autoencoder \(VAE\) on in\-distribution beam\-profile observations and use Mahalanobis distance in the VAE latent space to detect OOD beam profiles at test time\. This OOD decision sets a binary switch that selects either the RL controller or the ES controller\. We evaluate the approach in safety\-critical particle accelerator control\. In this setting, spatial magnet motion creates OOD beam profiles that were not seen during RL training\. Visualization of the VAE latent space shows that the proposed method identifies this OOD scenario and provides an interpretable signal for switching between RL and ES in the combined controller\.
Machine Learning, ICML
## 1Introduction
Reinforcement learning \(RL\) is a promising approach for high\-dimensional control because it can learn fast policies through interaction in simulation\(Sutton and Barto,[2018](https://arxiv.org/html/2606.11474#bib.bib31)\), and has seen great success in real\-world robotic systems\(Guet al\.,[2017](https://arxiv.org/html/2606.11474#bib.bib1)\)\. However, learned RL policies are typically reliable only near the distribution on which they were trained\(Danesh and Fern,[2021](https://arxiv.org/html/2606.11474#bib.bib13)\)\. When the system evolves over time because of actuator drift, calibration error, changing operating conditions, or geometric perturbations, the deployed observations may become out\-of\-distribution \(OOD\), leading to degraded or unsafe actions\.
Many ongoing efforts exist to improve the safety and robustness of deep learning methods in general and RL methods in particular for changing systems\. Action\-constrained RL \(ACRL\) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety\-critical and resource\-constrained application\(Hunget al\.,[2025](https://arxiv.org/html/2606.11474#bib.bib5)\)\. Safe reinforcement learning is an approach to RL for real\-world problems in which unsafe states can be avoided by planning ahead a short time into the future when a sufficiently accurate model can avoid unsafe states\(Thomaset al\.,[2021](https://arxiv.org/html/2606.11474#bib.bib3)\)\. The difficulty of designing deep RL algorithms for novel problems is being studied with new automated RL frameworks\(Parker\-Holderet al\.,[2022](https://arxiv.org/html/2606.11474#bib.bib6)\)\. For difficulties faced by LLMs in adapting to new data distributions retrieval\-augmented LMs are being studied\(Asaiet al\.,[2024](https://arxiv.org/html/2606.11474#bib.bib4)\)\.
Nonlinear time\-varying systems make this problem particularly important because deployment conditions may differ from those seen during training\. A hybrid controller is useful in this setting because it combines the complementary strengths of learning\-based and model\-free control\. The RL policy can produce fast, coordinated actions for high\-dimensional systems when the observation remains close to the training distribution\. In contrast, a robust model\-free controller such as extremum seeking \(ES\) can continue to adapt when the dynamics drift, but may converge more slowly in large action spaces\. The hybrid architecture therefore uses RL for rapid in\-distribution control and ES for robustness when the system becomes uncertain or distributionally unfamiliar\.
Particle accelerator control is a representative example of this challenge\. Particle accelerators support a wide range of scientific applications, including neutron production, materials characterization, nuclear physics, medical isotope production, and high\-energy physics experiments\. These systems are safety\-critical: poor control actions can increase beam loss, drive the beam toward aperture limits, or degrade experimental operation\. As a result, learned controllers must be deployed carefully and should not be trusted blindly when beam dynamics move outside the conditions seen during training\. In our setting, the RL controller acts on quadrupole magnet strengths, while the supervisor monitors beam\-profile observations\. Time\-varying magnet motion and geometry changes can push the beam dynamics outside the RL training distribution\.
The central challenge is selecting the switch between RL and ES at test time\. We investigate Mahalanobis\-guided latent OOD detection for RL controller switching\. A probabilistic latent model is trained on in\-distribution beam\-profile observations and used to embed each test\-time observation into a low\-dimensional latent space\. The Mahalanobis distance between the current latent representation and the in\-distribution latent model is then used to select the binary switching coefficientβt\\beta\_\{t\}\. Small distances select the RL controller, while large distances indicate OOD behavior and trigger a switch to ES\. In this way, the supervisor in the combined ES–DRL controller is formulated as a latent OOD detection problem\.
## 2Related Work
#### Hybrid RL and fallback control\.
Hybrid control architectures combine the fast decision\-making capability of learned RL policies with the robustness of classical, adaptive, or model\-free controllers\. Recent work combines DRL with bounded extremum seeking to improve robustness in nonlinear time\-varying systems, including accelerator tuning, where RL provides fast nominal control while the fallback controller maintains performance under drift\(Saxenaet al\.,[2025](https://arxiv.org/html/2606.11474#bib.bib7)\)\. A related hybrid controller has also been studied for robotic manipulation under distribution shift, including time\-varying goals and spatially varying friction\(Saxenaet al\.,[2026](https://arxiv.org/html/2606.11474#bib.bib8)\)\. Other hybrid RL\-control approaches combine learned policies with model predictive control or adaptive control, for example through actor\-critic MPC\(Romeroet al\.,[2025](https://arxiv.org/html/2606.11474#bib.bib19)\)or MRAC\-RL\(Guha and Annaswamy,[2021](https://arxiv.org/html/2606.11474#bib.bib20)\)\. These methods motivate hybrid control as a practical architecture for deploying RL in changing environments\. In contrast, our focus is not on designing a new fallback controller, but on learning the switching supervisor that decides when the RL policy should be trusted\.
#### OOD and anomaly detection in reinforcement learning\.
Detecting OOD observations is important for reliable RL deployment because unfamiliar states can cause learned policies to select low\-performance or unsafe actions\. Early work formulated OOD classification in deep RL using uncertainty estimates and policy entropy\(Sedlmeieret al\.,[2020a](https://arxiv.org/html/2606.11474#bib.bib17),[b](https://arxiv.org/html/2606.11474#bib.bib16)\)\. Other work studies OOD dynamics detection, where the objective is to identify changes in the environment dynamics relative to the training distribution\(Danesh and Fern,[2021](https://arxiv.org/html/2606.11474#bib.bib13)\)\. Probabilistic dynamics models and bootstrapped ensembles have also been used to detect OOD situations for RL agents\(Haideret al\.,[2023](https://arxiv.org/html/2606.11474#bib.bib14)\)\. More recently, OOD detection in RL has been revisited with benchmarks that include temporally correlated anomalies and time\-series\-based detection methods\(Nasvytiset al\.,[2024](https://arxiv.org/html/2606.11474#bib.bib15)\)\. These works focus on detecting anomalous states or dynamics\. Our work uses OOD detection as an actionable control signal: the latent distance determines the binary switching coefficientβt\\beta\_\{t\}in a hybrid controller\.
Figure 1:VAE\-guided ES/RL switching setup\.Figure 2:Effect of spatial magnet movement on beam profiles\. The moving magnet creates non\-smooth envelope and slope behavior with large excursions\. Thezzcoordinate is in meters; other values are normalized\.
#### Mahalanobis and latent\-space anomaly detection\.
Mahalanobis distance is widely used to measure whether a feature vector lies far from a reference distribution\. In supervised deep learning, class\-conditional Gaussian feature models have been used to detect OOD and adversarial samples with a Mahalanobis confidence score\(Leeet al\.,[2018](https://arxiv.org/html/2606.11474#bib.bib10)\)\. In reinforcement learning, MDX extends Mahalanobis\-distance detection to RL by estimating class\-conditional feature distributions from policy\-network representations and detecting random, adversarial, and OOD state outliers\(Zhanget al\.,[2024](https://arxiv.org/html/2606.11474#bib.bib9)\)\. Our approach builds on this distance\-based perspective but differs in both representation and use\. We compute Mahalanobis distance in a learned latent representation of physical system observations, rather than in action\-class features of a policy network, and use the resulting distance to select between the RL and ES controllers\.
## 3Problem Formulation
### 3\.1Particle Accelerator Tuning Problem
Particle accelerator tuning is challenging because the beam dynamics are nonlinear, strongly coupled, and sensitive to both magnet settings and incoming beam conditions\. A change in one quadrupole magnet can affect the downstream beam envelope, and poor settings can produce large envelope excursions, non\-smooth beam evolution, and beam loss\. The controller must therefore tune many coupled actuators while maintaining a compact, smooth, and well\-aligned beam\.
We evaluate the proposed supervisor in a particle accelerator tuning problem based on the Kapchinskij–Vladimirskij \(KV\) envelope model\(Kapchinskij and Vladimirskij,[1959](https://arxiv.org/html/2606.11474#bib.bib28)\)\. The system represents a low\-energy beam transport section of a linear accelerator\. The beam state is described by the horizontal and vertical envelope radiiX\(z,t\)X\(z,t\)andY\(z,t\)Y\(z,t\)and their slopesX′\(z,t\)X^\{\\prime\}\(z,t\)andY′\(z,t\)Y^\{\\prime\}\(z,t\)along the beamline coordinatezz\. The controller acts on quadrupole magnet strengths\.
At each control step, the observation is the sampled beam\-profile vectorot=\[X\(z,t\),Y\(z,t\),X′\(z,t\),Y′\(z,t\)\],o\_\{t\}=\\left\[X\(z,t\),\\;Y\(z,t\),\\;X^\{\\prime\}\(z,t\),\\;Y^\{\\prime\}\(z,t\)\\right\],and the control input is the vector of setpoints for the 22 quadrupole magnet strengths,
Qt=\[Q1\(t\),…,Q22\(t\)\]⊤∈ℝ22\.Q\_\{t\}=\[Q\_\{1\}\(t\),\\ldots,Q\_\{22\}\(t\)\]^\{\\top\}\\in\\mathbb\{R\}^\{22\}\.The RL controller is trained using Deep Deterministic Policy Gradient \(DDPG\)\(Lillicrapet al\.,[2020](https://arxiv.org/html/2606.11474#bib.bib29)\)in simulation over a finite set of beamline configurations and operating regimes\. During training, beam initial conditions are randomized across episodes so that the policy observes different initial beam profiles\. The actor is trained to maximize a reward that encourages compact beam envelopes, smooth beam evolution, and terminal alignment\. During test time, the trained actor is frozen to prevent policy drift\. Additional DDPG training details and the reward definition are provided in Appendix[A](https://arxiv.org/html/2606.11474#A1)\.
### 3\.2OOD\-Aware Test\-Time RL–ES Switching
We combine the trained RL actor with bounded extremum seeking \(ES\) because the two controllers have complementary strengths\. The RL actor provides fast, coordinated magnet commands when the observed beam profile is close to the distribution encountered during DDPG training\. ES provides robustness under unknown and time\-varying dynamics because it optimizes a measured scalar objective without requiring an analytic model of the accelerator, while keeping parameter updates bounded\(Scheinker and others,[2013](https://arxiv.org/html/2606.11474#bib.bib2); Scheinker and Scheinker,[2016](https://arxiv.org/html/2606.11474#bib.bib26)\)\. It has also been demonstrated for particle accelerator beam\-loss minimization\(Scheinkeret al\.,[2021](https://arxiv.org/html/2606.11474#bib.bib27)\)\. However, ES is a local feedback\-based optimizer: in high\-dimensional tuning problems, such as the 22\-dimensional quadrupole setting considered here, convergence can be slow and the search may settle in suboptimal local regions\. Thus, using ES alone can be unnecessarily slow, while using RL alone can be unreliable under OOD conditions\.
We use RL in two ways\. First, when the beam profile is detected as in\-distribution, the RL actor directly supplies the control action\. Second, the ES controller is warm\-started from the RL\-recommended magnet setting, reducing the transient associated with starting local search from an arbitrary magnet setting\. ES uses the same beam\-quality objective as the RL reward and produces an actionuES,tu\_\{\\mathrm\{ES\},t\}for the quadrupole magnet strengths\.
The hybrid controller selects between the RL and ES actions through a binary switch,
ut=βtuRL,t\+\(1−βt\)uES,t,βt∈\{0,1\}\.u\_\{t\}=\\beta\_\{t\}u\_\{\\mathrm\{RL\},t\}\+\(1\-\\beta\_\{t\}\)u\_\{\\mathrm\{ES\},t\},\\qquad\\beta\_\{t\}\\in\\\{0,1\\\}\.\(1\)
Whenβt=1\\beta\_\{t\}=1, the RL action is applied\. Whenβt=0\\beta\_\{t\}=0, the ES action is applied\. Thus,βt\\beta\_\{t\}is not a continuous authority weight in this work; it is a test\-time selector between the two controllers\.
We selectβt\\beta\_\{t\}using latent OOD evidence\. As shown in Fig\.[1](https://arxiv.org/html/2606.11474#S2.F1), during test\-time deployment the supervisor observes the current beam profile, embeds it into the learned latent space, computes its Mahalanobis distance from the training\-distribution latent model, and uses this score to select either the RL controller or the ES controller\.
## 4Mahalanobis\-Guided Latent Supervisor
Figure 3:The top row shows the locations of 8192 embedded test beam envelopes in the 3D latent space of the trained VAE together with the embeddings of 128 beam envelopes from a time\-varying lattice in which one of the magnets moves 1 meter\. The points are colored by reconstruction error, by how far the magnet has moved, and by the Mahalanobis distance of each point from the latent distribution learned by the VAE which is modeled as𝒩\(𝟎,I3×3\)\\mathcal\{N\}\(\\mathbf\{0\},I\_\{3\\times 3\}\)\. The next three rows show orthogonal 2D projections of the 3D view\.### 4\.1Learning a Latent Model of Beam Profiles
The VAE\-based switching supervisor is trained to learn a compact representation of in\-distribution beam\-profile observations\. To construct the VAE dataset, we start from an expert\-tuned quadrupole setting that produces a stable beam profile\. We then generate beam\-profile data by applying random perturbations around this reference quadrupole setting and solving the KV envelope model\. This produces a dataset of700,000700\{,\}000beam profiles, which was split into680,000680\{,\}000training,10,00010\{,\}000validation, and10,00010\{,\}000test profiles\. VAE architecture, training details, and latent\-dimension comparison are provided in Appendix[B](https://arxiv.org/html/2606.11474#A2)\.
The OOD detector takes the beam\-profile observation as input\. Since the beam profile is sampled along the longitudinal coordinatezz, the input has four channels corresponding toX\(z,t\)X\(z,t\),Y\(z,t\)Y\(z,t\),X′\(z,t\)X^\{\\prime\}\(z,t\), andY′\(z,t\)Y^\{\\prime\}\(z,t\)\. Each channel is sampled atNz=4000N\_\{z\}=4000longitudinal locations, so each input beam profile is represented as
xt∈ℝ4×4000\.x\_\{t\}\\in\\mathbb\{R\}^\{4\\times 4000\}\.We train a variational autoencoder \(VAE\)\(Kingma and Welling,[2013](https://arxiv.org/html/2606.11474#bib.bib30)\)to embed these high\-dimensional beam profiles into a low\-dimensional latent space\. The encoder is implemented using one\-dimensional convolutional layers along the beamline coordinate, reducing the spatial resolution as4000→2000→1000→500→250→125→Dense→Ldim\.4000\\rightarrow 2000\\rightarrow 1000\\rightarrow 500\\rightarrow 250\\rightarrow 125\\rightarrow\\mathrm\{Dense\}\\rightarrow L\_\{\\mathrm\{dim\}\}\.
The dense layers output the latent mean and variance,
μϕ\(xt\),σϕ2\(xt\)∈ℝLdim\.\\mu\_\{\\phi\}\(x\_\{t\}\),\\;\\sigma\_\{\\phi\}^\{2\}\(x\_\{t\}\)\\in\\mathbb\{R\}^\{L\_\{\\mathrm\{dim\}\}\}\.As shown in Figure[3](https://arxiv.org/html/2606.11474#S4.F3), we useLdim=3L\_\{\\mathrm\{dim\}\}=3\.
The encoder maps each beam profile to a Gaussian latent posterior,
qϕ\(zt\|xt\)=𝒩\(μϕ\(xt\),diag\(σϕ2\(xt\)\)\),q\_\{\\phi\}\(z\_\{t\}\|x\_\{t\}\)=\\mathcal\{N\}\\left\(\\mu\_\{\\phi\}\(x\_\{t\}\),\\operatorname\{diag\}\(\\sigma\_\{\\phi\}^\{2\}\(x\_\{t\}\)\)\\right\),and the decoder reconstructs the beam profile from the latent variable\. The VAE is trained by minimizing
ℒVAE=𝔼qϕ\(z\|x\)\[−logpψ\(x\|z\)\]\+λKLDKL\(qϕ\(z\|x\)∥p\(z\)\),\\mathcal\{L\}\_\{\\mathrm\{VAE\}\}=\\mathbb\{E\}\_\{q\_\{\\phi\}\(z\|x\)\}\\left\[\-\\log p\_\{\\psi\}\(x\|z\)\\right\]\+\\lambda\_\{\\mathrm\{KL\}\}D\_\{\\mathrm\{KL\}\}\\left\(q\_\{\\phi\}\(z\|x\)\\\|p\(z\)\\right\),\(2\)wherep\(z\)=𝒩\(0,I\)p\(z\)=\\mathcal\{N\}\(0,I\)\.
### 4\.2Latent Mahalanobis OOD Score
After training the VAE, a beam profilextx\_\{t\}is embedded using the encoder mean,
zt=μϕ\(xt\)\.z\_\{t\}=\\mu\_\{\\phi\}\(x\_\{t\}\)\.The VAE prior regularizes the latent space toward a standard Gaussian distribution\. In the reported experiments, we use a three\-dimensional latent space and model the reference latent distribution as
𝒵ID=𝒩\(0,I3\)\.\\mathcal\{Z\}\_\{\\mathrm\{ID\}\}=\\mathcal\{N\}\(0,I\_\{3\}\)\.More generally, the reference mean and covariance can be estimated from training\-distribution embeddings,
𝒵ID=𝒩\(z¯ID,ΣID\)\.\\mathcal\{Z\}\_\{\\mathrm\{ID\}\}=\\mathcal\{N\}\\left\(\\bar\{z\}\_\{\\mathrm\{ID\}\},\\Sigma\_\{\\mathrm\{ID\}\}\\right\)\.For the standard Gaussian reference used in Fig\.[3](https://arxiv.org/html/2606.11474#S4.F3), this corresponds toz¯ID=0\\bar\{z\}\_\{\\mathrm\{ID\}\}=0andΣID=I3\\Sigma\_\{\\mathrm\{ID\}\}=I\_\{3\}\.
We compute the squared Mahalanobis distance
dM2\(xt\)=\(zt−z¯ID\)⊤\(ΣID\+ϵI\)−1\(zt−z¯ID\),d\_\{M\}^\{2\}\(x\_\{t\}\)=\(z\_\{t\}\-\\bar\{z\}\_\{\\mathrm\{ID\}\}\)^\{\\top\}\\left\(\\Sigma\_\{\\mathrm\{ID\}\}\+\\epsilon I\\right\)^\{\-1\}\(z\_\{t\}\-\\bar\{z\}\_\{\\mathrm\{ID\}\}\),\(3\)whereϵI\\epsilon Iis a small regularization term for numerical stability\. This score is small when the current beam profile lies near the training\-distribution latent cloud and large when the beam profile is distributionally unfamiliar\.
The Mahalanobis score is used as the OOD evidence for test\-time switching\. As shown in Fig\.[3](https://arxiv.org/html/2606.11474#S4.F3), time\-varying magnet drift produces increasing Mahalanobis distance in latent space\.
The reconstruction error used to color the “recon error” plots in Fig\.[3](https://arxiv.org/html/2606.11474#S4.F3)is a simple absolute value sum over channels and accelerator locations, calculated as:
E=∑c=1:4∑z=1:4000\|o^\(z,c\)−o\(z,c\)\|\.E=\\sum\_\{c=1:4\}\\sum\_\{z=1:4000\}\|\\hat\{o\}\(z,c\)\-o\(z,c\)\|\.\(4\)
### 4\.3Binary Test\-Time Switch
The Mahalanobis score is used to set the binary switchβt\\beta\_\{t\}\. We use
βt=\{1,dM2\(xt\)≤τ,0,dM2\(xt\)\>τ,\\beta\_\{t\}=\\begin\{cases\}1,&d\_\{M\}^\{2\}\(x\_\{t\}\)\\leq\\tau,\\\\ 0,&d\_\{M\}^\{2\}\(x\_\{t\}\)\>\\tau,\\end\{cases\}\(5\)whereτ\\tauis the OOD threshold chosen from the training\-distribution latent distances\.
Thus,βt=1\\beta\_\{t\}=1selects the RL controller when the beam profile is close to the training\-distribution latent model, whileβt=0\\beta\_\{t\}=0selects ES when the beam profile is detected as OOD\.
The applied control is given by Eq\.[1](https://arxiv.org/html/2606.11474#S3.E1)\. The VAE acts as a test\-time supervisor that determines whether the current beam profile is reliable for the trained RL policy or whether control should switch to ES\.
## 5Results
#### Physical effect of magnet drift\.
During training, the quadrupole magnet locations are fixed\. At test time, one quadrupole magnet is moved spatially along the beamline, creating a time\-varying lattice that changes the beam dynamics\. This perturbation is therefore an OOD condition for the trained RL controller: although the RL actor has learned to tune magnet strengths for the training lattice, it has not seen beam profiles generated by a moving magnet\.
Figure[2](https://arxiv.org/html/2606.11474#S2.F2)illustrates the physical effect of this perturbation\. As the magnet is moved, the beam profiles transition from smooth, compact envelopes to non\-smooth profiles with large excursions\. In particular, the horizontal and vertical envelopes and their slopes develop sharp changes near and downstream of the moving magnet\. These changes indicate that the beam is no longer following the smooth behavior encouraged by the RL reward\. When the envelope grows toward or beyond the allowed aperture, the beam effectively hits the beam\-pipe wall, which corresponds to high beam loss and low reward\. In this regime, the RL action should not be trusted because the observed beam profile is outside the distribution encountered during training\.
#### Latent OOD evidence\.
To quantify the distribution shift, we pass beam\-profile observations through the trained VAE and analyze their embeddings in the three\-dimensional latent space\. Figure[3](https://arxiv.org/html/2606.11474#S4.F3)shows 8192 embedded test beam envelopes from the training\-distribution lattice together with 128 beam envelopes generated during the time\-varying magnet\-drift trajectory\.
The training\-distribution beam profiles form a compact latent cloud\. In contrast, the moving\-magnet trajectory appears as a structured path that leaves this cloud as the magnet drift increases\. This is important because the VAE is not using the magnet position directly as an input; it only observes the resulting beam profile\. Therefore, the separation in latent space indicates that the beam\-profile measurements themselves contain enough information to reveal the OOD condition\.
Figure[3](https://arxiv.org/html/2606.11474#S4.F3)also shows three complementary OOD indicators\. First, the reconstruction error increases along the drift trajectory, indicating that the VAE has more difficulty reconstructing beam profiles produced by the moving lattice\. Second, the magnet\-drift coloring confirms that the trajectory moves progressively as the magnet is displaced from its training\-time location\. Third, the Mahalanobis distance increases as the trajectory moves away from the training\-distribution latent cloud\. In the plotted color scale, the central training\-distribution region is mostly associated with lower Mahalanobis values, roughly in the range22–44, while the high\-drift points move toward larger values, approximately55–66\.
These trends connect the physical and latent views of OOD behavior\. In Figure[2](https://arxiv.org/html/2606.11474#S2.F2), increasing magnet movement produces non\-smooth beam envelopes and potential beam loss\. In Figure[3](https://arxiv.org/html/2606.11474#S4.F3), the same perturbation produces increasing reconstruction error and increasing Mahalanobis distance\. Thus, the Mahalanobis distance in the VAE latent space provides a compact test\-time signal that reflects physically meaningful degradation in the beam profile\.
#### Implication for switching\.
The Mahalanobis score is used to choose the binary switchβt\\beta\_\{t\}\. For the three\-dimensional latent model in Figure[3](https://arxiv.org/html/2606.11474#S4.F3), the plotted values show separation between the dense training\-distribution latent cloud and the high\-drift portion of the moving\-magnet trajectory\. This suggests that an OOD thresholdτ\\taucan be selected to distinguish beam profiles that remain close to the training\-distribution behavior from those that move into the higher\-distance OOD region\.
With this thresholding rule, beam profiles close to the training\-distribution latent cloud select the RL controller, while profiles with larger Mahalanobis distance select ES\. The thresholdτ\\taushould be interpreted as a calibration parameter rather than a universal constant\. A lower threshold would switch to ES earlier, while a higher threshold would keep RL active longer but may delay intervention\.
Overall, Figures[2](https://arxiv.org/html/2606.11474#S2.F2)and[3](https://arxiv.org/html/2606.11474#S4.F3)show that spatial magnet drift creates both a physical signature of OOD behavior and a corresponding latent\-space signature\. The physical signature is the transition to non\-smooth beam envelopes and increased beam\-loss risk\. The latent signature is the increase in reconstruction error and Mahalanobis distance\. This supports the use of latent Mahalanobis distance as the test\-time switching signal between RL and ES\.
## 6Conclusion
We presented a Mahalanobis\-guided latent OOD supervisor for test\-time switching in a combined ES–DRL controller for nonlinear time\-varying systems\. A VAE is trained to learn a compact latent representation of in\-distribution beam\-profile observations, and the Mahalanobis distance in this latent space is used to decide whether the trained RL policy should be trusted or whether control should switch to ES\. This allows the supervisor to use the full beam\-profile structure rather than relying only on hand\-designed physical thresholds\. In the accelerator tuning problem, spatial magnet motion creates beam profiles that were not seen during RL training\. These profiles become non\-smooth, indicate increased beam\-loss risk, and appear in the VAE latent space as a structured trajectory with increasing reconstruction error and Mahalanobis distance\. These results show that the proposed supervisor can identify OOD beam\-profile behavior and provide an interpretable switching signal between RL and ES\. This work is a step toward safer deployment of RL controllers in time\-varying physical systems\. Future work will evaluate the complete closed\-loop controller under broader accelerator perturbations and extend the framework so that detected OOD conditions can also trigger targeted RL fine\-tuning, allowing the policy to adapt to newly observed operating regimes\.
## References
- A\. Asai, Z\. Zhong, D\. Chen, P\. W\. Koh, L\. Zettlemoyer, H\. Hajishirzi, and W\. Yih \(2024\)Reliable, adaptable, and attributable language models with retrieval\.arXiv preprint arXiv:2403\.03187\.Cited by:[§1](https://arxiv.org/html/2606.11474#S1.p2.1)\.
- M\. H\. Danesh and A\. Fern \(2021\)Out\-of\-distribution dynamics detection: rl\-relevant benchmarks and results\.arXiv preprint arXiv:2107\.04982\.Cited by:[§1](https://arxiv.org/html/2606.11474#S1.p1.1),[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Gu, E\. Holly, T\. Lillicrap, and S\. Levine \(2017\)Deep reinforcement learning for robotic manipulation with asynchronous off\-policy updates\.In2017 IEEE international conference on robotics and automation \(ICRA\),pp\. 3389–3396\.Cited by:[§1](https://arxiv.org/html/2606.11474#S1.p1.1)\.
- A\. Guha and A\. M\. Annaswamy \(2021\)Online policies for real\-time control using mrac\-rl\.In2021 60th IEEE Conference on Decision and Control,pp\. 1808–1813\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Haider, K\. Roscher, F\. Schmoeller da Roza, and S\. Günnemann \(2023\)Out\-of\-distribution detection for reinforcement learning agents with probabilistic dynamics models\.InProceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems,pp\. 851–859\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Hung, S\. Sun, and P\. Hsieh \(2025\)Efficient action\-constrained reinforcement learning via acceptance\-rejection method and augmented mdps\.arXiv preprint arXiv:2503\.12932\.Cited by:[§1](https://arxiv.org/html/2606.11474#S1.p2.1)\.
- I\. Kapchinskij and V\. Vladimirskij \(1959\)Limitations of proton beam current in a strong focusing linear accelerator associated with the beam space charge\.InProceedings of the International Conference on High Energy Accelerators and Instrumentation,Vol\.1957,pp\. 274–288\.Cited by:[§3\.1](https://arxiv.org/html/2606.11474#S3.SS1.p2.5)\.
- D\. P\. Kingma and M\. Welling \(2013\)Auto\-encoding variational bayes\.arXiv preprint arXiv:1312\.6114\.Cited by:[§4\.1](https://arxiv.org/html/2606.11474#S4.SS1.p2.7)\.
- K\. Lee, K\. Lee, H\. Lee, and J\. Shin \(2018\)A simple unified framework for detecting out\-of\-distribution samples and adversarial attacks\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px3.p1.1)\.
- T\. P\. Lillicrap, J\. J\. Hunt, A\. Pritzel, N\. M\. O\. Heess, T\. Erez, Y\. Tassa, D\. Silver, and D\. P\. Wierstra \(2020\)Continuous control with deep reinforcement learning\.Google Patents\.Note:US Patent 10,776,692Cited by:[§3\.1](https://arxiv.org/html/2606.11474#S3.SS1.p3.2)\.
- L\. Nasvytis, K\. Sandbrink, J\. Foerster, T\. Franzmeyer, and C\. S\. de Witt \(2024\)Rethinking out\-of\-distribution detection for reinforcement learning: advancing methods for evaluation and detection\.arXiv preprint arXiv:2404\.07099\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Parker\-Holder, R\. Rajan, X\. Song, A\. Biedenkapp, Y\. Miao, T\. Eimer, B\. Zhang, V\. Nguyen, R\. Calandra, A\. Faust,et al\.\(2022\)Automated reinforcement learning \(autorl\): a survey and open problems\.Journal of Artificial Intelligence Research74,pp\. 517–568\.Cited by:[§1](https://arxiv.org/html/2606.11474#S1.p2.1)\.
- A\. Romero, E\. Aljalbout, Y\. Song, and D\. Scaramuzza \(2025\)Actor–critic model predictive control: differentiable optimization meets reinforcement learning for agile flight\.IEEE Transactions on Robotics42,pp\. 673–692\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Saxena, R\. Fierro, and A\. Scheinker \(2026\)Deep reinforcement learning for robotic manipulation under distribution shift with bounded extremum seeking\.arXiv preprint arXiv:2604\.01142\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Saxena, A\. Williams, R\. Fierro, and A\. Scheinker \(2025\)Improved robustness of deep reinforcement learning for control of time\-varying systems by bounded extremum seeking\.arXiv preprint arXiv:2510\.02490\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Scheinker, E\. Huang, and C\. Taylor \(2021\)Extremum seeking\-based control system for particle accelerator beam loss minimization\.IEEE Transactions on Control Systems Technology30\(5\),pp\. 2261–2268\.Cited by:[§3\.2](https://arxiv.org/html/2606.11474#S3.SS2.p1.1)\.
- A\. Scheinkeret al\.\(2013\)Model independent beam tuning\.InProceedings of the 2013 International Particle Accelerator Conference, Shanghai, China,pp\. 12–17\.Cited by:[§3\.2](https://arxiv.org/html/2606.11474#S3.SS2.p1.1)\.
- A\. Scheinker and D\. Scheinker \(2016\)Bounded extremum seeking with discontinuous dithers\.Automatica69,pp\. 250–257\.Cited by:[§3\.2](https://arxiv.org/html/2606.11474#S3.SS2.p1.1)\.
- A\. Sedlmeier, T\. Gabor, T\. Phan, L\. Belzner, and C\. Linnhoff\-Popien \(2020a\)Uncertainty\-based out\-of\-distribution classification in deep reinforcement learning\.InProceedings of the 12th International Conference on Agents and Artificial Intelligence,pp\. 522–529\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Sedlmeier, R\. Müller, S\. Illium, and C\. Linnhoff\-Popien \(2020b\)Policy entropy for out\-of\-distribution classification\.InArtificial Neural Networks and Machine Learning – ICANN 2020,pp\. 420–431\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px2.p1.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.2 edition,MIT Press\.Cited by:[§1](https://arxiv.org/html/2606.11474#S1.p1.1)\.
- G\. Thomas, Y\. Luo, and T\. Ma \(2021\)Safe reinforcement learning by imagining the near future\.Advances in Neural Information Processing Systems34,pp\. 13859–13869\.Cited by:[§1](https://arxiv.org/html/2606.11474#S1.p2.1)\.
- H\. Zhang, K\. Sun, B\. Xu, L\. Kong, and M\. Müller \(2024\)A distance\-based anomaly detection framework for deep reinforcement learning\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2606.11474#S2.SS0.SSS0.Px3.p1.1)\.
## Appendix ADDPG training details
Table 1:DDPG training settings for the accelerator controller\.#### Reward function\.
The DDPG actor is trained with a beam\-quality reward that encourages the beam to remain compact along the transport line, vary smoothly, and reach a well\-aligned terminal profile\. Let
⟨a⟩\+=max\(0,a\)\\langle a\\rangle\_\{\+\}=\\max\(0,a\)denote the hinge function\. For beam envelopes sampled atNz=4000N\_\{z\}=4000longitudinal grid points, define the path\-averaged beam sizes
X¯\(t\)=1Nz∑k=1NzX\(zk,t\),Y¯\(t\)=1Nz∑k=1NzY\(zk,t\)\.\\bar\{X\}\(t\)=\\frac\{1\}\{N\_\{z\}\}\\sum\_\{k=1\}^\{N\_\{z\}\}X\(z\_\{k\},t\),\\qquad\\bar\{Y\}\(t\)=\\frac\{1\}\{N\_\{z\}\}\\sum\_\{k=1\}^\{N\_\{z\}\}Y\(z\_\{k\},t\)\.We set the envelope band and terminal target as
rband=12rmax,rtt2=12rmax2,r\_\{\\mathrm\{band\}\}=\\frac\{1\}\{2\}r\_\{\\max\},\\qquad r\_\{\\mathrm\{tt\}\}^\{2\}=\\frac\{1\}\{2\}r\_\{\\max\}^\{2\},wherermaxr\_\{\\max\}is the operational beam\-pipe radius\.
The total penalty is
Pt=Penv\+Psmooth\+Pterm,P\_\{t\}=P\_\{\\mathrm\{env\}\}\+P\_\{\\mathrm\{smooth\}\}\+P\_\{\\mathrm\{term\}\},with
Penv=we\(⟨X¯\(t\)−rband⟩\+\+⟨Y¯\(t\)−rband⟩\+\),P\_\{\\mathrm\{env\}\}=w\_\{e\}\\left\(\\langle\\bar\{X\}\(t\)\-r\_\{\\mathrm\{band\}\}\\rangle\_\{\+\}\+\\langle\\bar\{Y\}\(t\)\-r\_\{\\mathrm\{band\}\}\\rangle\_\{\+\}\\right\),Psmooth=ws\(X′2¯\(t\)\+Y′2¯\(t\)\),P\_\{\\mathrm\{smooth\}\}=w\_\{s\}\\left\(\\overline\{X^\{\\prime 2\}\}\(t\)\+\\overline\{Y^\{\\prime 2\}\}\(t\)\\right\),whereX′2¯\\overline\{X^\{\\prime 2\}\}andY′2¯\\overline\{Y^\{\\prime 2\}\}are path averages of the squared envelope slopes\. The terminal penalty is
Pterm=wr\|X\(zmax,t\)−Y\(zmax,t\)\|\+ww\(\|X′\(zmax,t\)\|\+\|Y′\(zmax,t\)\|\)\+wt\|X2\(zmax,t\)\+Y2\(zmax,t\)−rtt2\|\.P\_\{\\mathrm\{term\}\}=w\_\{r\}\\left\|X\(z\_\{\\max\},t\)\-Y\(z\_\{\\max\},t\)\\right\|\+w\_\{w\}\\left\(\\left\|X^\{\\prime\}\(z\_\{\\max\},t\)\\right\|\+\\left\|Y^\{\\prime\}\(z\_\{\\max\},t\)\\right\|\\right\)\+w\_\{t\}\\left\|X^\{2\}\(z\_\{\\max\},t\)\+Y^\{2\}\(z\_\{\\max\},t\)\-r\_\{\\mathrm\{tt\}\}^\{2\}\\right\|\.The instantaneous reward is the bounded inverse
Rt=11\+Pt∈\(0,1\]\.R\_\{t\}=\\frac\{1\}\{1\+P\_\{t\}\}\\in\(0,1\]\.Thus, the reward increases when the beam envelopes remain small, the slopes remain smooth, and the terminal beam profile is circular, flat, and close to the desired radius\.
## Appendix BVAE Details
Each input to the VAE has size\[4000,4\]\[4000,4\]with 4000 locations along the beamline of\(x,x′,y,y′\)\(x,x^\{\\prime\},y,y^\{\\prime\}\)\. A 1D Residual Convolutional Neural Network repeatedly decreases the input’s initial size of 4000 by factors of 2 using 1D Convolutional layers with kernel size 3 and stride 2, while doubling the channel numbers\. For the encoder,\[\[tensor size, channel number\]\]are:
\[4000,4\]→\[4000,32\]→\[2000,64\]→\[1000,128\]→\[500,256\]→\[250,256\]→\[125,256\]→\[125,1\]→\[128\]→\[Ld\],\[4000,4\]\\rightarrow\[4000,32\]\\rightarrow\[2000,64\]\\rightarrow\[1000,128\]\\rightarrow\[500,256\]\\rightarrow\[250,256\]\\rightarrow\[125,256\]\\rightarrow\[125,1\]\\rightarrow\[128\]\\rightarrow\[L\_\{d\}\],where the tensor of size\[125,1\]\[125,1\]was flattened and then passed through dense layers, transformed into a vector of size\[128\]\[128\], which was finally compressed to the latent dimensionLdL\_\{d\}\. In our setup between each factor of 2 size reduction there are 2 residual blocks, with each block having 2 1D convolutional layers with the same number of channels as shown in that stage above, with each followed by a GroupNorm and a SiLU activation function\. The mean and log\-variance are then created from the last vector of size\[Ld\]\[L\_\{d\}\]by one dense layer each before being passed to the sampling layer\. The decoder architecture mirrors that of the encoder\. The network was trained with a weight of 1e\-3 on the KL divergence\.
Our study found that withLd=2L\_\{d\}=2the model struggled to accurately reproduce the beam envelopes, as shown on the left side in Figure[4](https://arxiv.org/html/2606.11474#A2.F4)for 4 random samples from the test data\. The VAE withLd=3L\_\{d\}=3was able to reproduce the beam envelopes more accurately, as shown on the right side in Figure[4](https://arxiv.org/html/2606.11474#A2.F4)for the same 4 samples\. Furthermore, the higher accuracy ofLd=3L\_\{d\}=3is quantified by the statistics of the reconstruction error from Equation[4](https://arxiv.org/html/2606.11474#S4.E4)for 8192 test samples in Figure[5](https://arxiv.org/html/2606.11474#A2.F5)\. Another difficulty with theLd=2L\_\{d\}=2model was a very non\-Gaussian latent space as shown on the right side of[5](https://arxiv.org/html/2606.11474#A2.F5), which would make Gaussian assuming Mahalanobis\-based distance measurements inaccurate\.


Figure 4:Left: The 4 rows show reconstructions of 4 random test samples shown on top of the correct values for theLd=2L\_\{d\}=2VAE\. The 4 columns from left to right show thexx,x′x^\{\\prime\},yy, andy′y^\{\\prime\}beam envelopes\. Right: The same is shown forLd=3L\_\{d\}=3\.

Figure 5:Left: Error statistics for 8192 test data points compared for theLd=2L\_\{d\}=2andLd=3L\_\{d\}=3VAEs\. Right: Non\-Gaussian latent space of theLd=2L\_\{d\}=2VAE colored by error\.Similar Articles
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
This paper introduces DOSER, a framework using diffusion models for out-of-distribution detection and selective regularization in offline reinforcement learning. It aims to improve performance on static datasets by distinguishing between beneficial and detrimental OOD actions.
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
DVAO adaptively weights objectives based on reward variance to improve multi-reward RL training stability and multi-objective performance.
How do you do OOD detection on a closed LLM API with no latent access?
Discusses methods for out-of-distribution detection on closed LLM APIs without latent access, highlighting techniques like SelfCheckGPT, token-level entropy, proxy embeddings, and verifier models, and notes the collapse of OOD and hallucination detection.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.
Temporal Attention for Adaptive Control of Euler-Lagrange Systems with Unobservable Memory
This paper proposes a meta-control architecture using temporal self-attention for adaptive control of Euler-Lagrange systems with unobservable memory states. It demonstrates improved tracking performance over baseline methods on a 2-DOF manipulator while identifying failure modes in long-memory regimes.