When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control
Summary
A benchmark study finding that a calibrated rule-based autoscaler beats six mainstream deep RL algorithms on cost across all tested workloads, with RL only showing benefits on bursty patterns at higher cost. The paper introduces RLScale-Bench to improve evaluation protocol and reproducibility.
View Cached Full Text
Cached at: 05/27/26, 09:10 AM
# When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control
Source: [https://arxiv.org/html/2605.26418](https://arxiv.org/html/2605.26418)
###### Abstract
A properly calibrated rule\-based autoscaler can beat every one of six mainstream deep reinforcement learning \(DRL\) algorithms on cost across every workload we test—so when, if ever, does DRL actually help? We study this inRLScale\-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service\-level constraints\. The literature reports conflicting claims about whether model\-free RL outperforms well\-tuned rule\-based controllers, with single\-seed runs, uncalibrated baselines, and inconsistent training budgets confounding cross\-study comparison\. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched network architectures, training budgets, and reward functions against a properly calibrated rule\-based baseline across six workload patterns and five seeds \(240 runs\), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution\-shift generalization by training on one workload and deploying on five shifted distributions\. Three findings challenge common assumptions: \(i\) a calibrated rule\-based controller achieves the lowest cost on*all*six workloads and zero constraint violations on steady\-state traffic, though it trails the best RL agents on bursty and flash patterns; \(ii\) discrete\-action algorithms outperform continuous\-action ones by one to two orders of magnitude in constraint violations due to action\-space mismatch; and \(iii\) no single algorithm dominates across workload types, with rankings shifting by up to four positions between steady\-state and bursty traffic\. On bursty workloads—where RL should shine—PPO reduces constraint violations by 54% relative to the calibrated baseline, but only at 24% higher cost, suggesting that the bottleneck in RL\-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols\.
reinforcement learning, benchmark, adaptive resource control, baseline calibration, reproducibility
## 1Introduction
Figure 1:RLScale\-Benchpipeline\. The six stages realize our contributions: matched RL agents \(C1\), calibrated HPA baseline \(C2\), 5\-seed training and 240\-run evaluation \(C3\), deployment on five shifted workloads \(C4\), and the three counter\-intuitive findings \(C5\)\.Adaptive resource control—allocating compute resources to a dynamic workload while respecting cost and service\-level constraints—is a canonical decision\-making problem that combines offline training on historical traces with online adaptation to shifting traffic patterns\. A growing body of work applies deep reinforcement learning \(DRL\) to this problem\(Rossiet al\.,[2019](https://arxiv.org/html/2605.26418#bib.bib9); Qiuet al\.,[2020](https://arxiv.org/html/2605.26418#bib.bib10); Wanget al\.,[2022](https://arxiv.org/html/2605.26418#bib.bib12); Tokaet al\.,[2021](https://arxiv.org/html/2605.26418#bib.bib13)\), typically comparing a single RL algorithm against a simple rule\-based controller on one or two workload patterns\. Yet the conclusions differ sharply across studies: some report that RL improves cost by 30%, others find that rule\-based controllers remain competitive\. This inconsistency hinders progress: without a shared evaluation protocol, practitioners cannot assess which algorithmic advances are real\.
We trace the inconsistency to three gaps in current practice:
\(1\) Uncalibrated baselines\.Rule\-based controllers such as threshold\-driven autoscalers have tunable parameters \(target utilization, cool\-down windows\) that strongly affect performance\. When RL studies compare against an uncalibrated baseline, apparent improvements may reflect baseline weakness rather than algorithmic gains\(Dulac\-Arnoldet al\.,[2021](https://arxiv.org/html/2605.26418#bib.bib27)\)\.
\(2\) Single\-seed reporting\.DRL training exhibits high variance across random seeds, and many reported improvements fall within the noise margin of a single method\(Hendersonet al\.,[2018](https://arxiv.org/html/2605.26418#bib.bib22); Islamet al\.,[2017](https://arxiv.org/html/2605.26418#bib.bib23); Agarwalet al\.,[2021](https://arxiv.org/html/2605.26418#bib.bib24)\)\. Without error bars computed across seeds, benchmark rankings are unreliable\.
\(3\) Narrow workload coverage\.Most studies evaluate on one or two traffic patterns, typically a diurnal or bursty trace collected from a single deployment\. Because workload characteristics strongly interact with scaling policies, narrow coverage yields conclusions that do not transfer to other deployments\.
We address these gaps withRLScale\-Bench, a benchmark that follows the reproducible\-evaluation principles advocated byAgarwalet al\.\([2021](https://arxiv.org/html/2605.26418#bib.bib24)\)and extends them to a real\-world decision\-making setting\. We instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling—a canonical adaptive resource control problem deployed in production at large scale\(Kubernetes Authors,[2024](https://arxiv.org/html/2605.26418#bib.bib17); Burnset al\.,[2016](https://arxiv.org/html/2605.26418#bib.bib19)\)—and release an open simulator, trained models, and evaluation data\. Our contributions are:
- •C1\.Abenchmark and evaluation protocolfor adaptive resource control that matches network architectures \(\[256,256\]\[256,256\]MLP\), training budgets \(50K steps\), and reward functions across PPO, DQN, A2C, SAC, TD3, and DDPG, eliminating confounds from implementation choices\.
- •C2\.Acalibrated rule\-based baselinetuned to realistic production settings \(70% target utilization\), serving as a strong comparator rather than a strawman\.
- •C3\.Statistically rigorous evaluationacross 5 seeds and 6 workload patterns \(constant, periodic, variable, bursty, ramp, flash\), yielding 240 evaluation runs with error bars reported throughout\.
- •C4\.Adistribution\-shift generalization studythat trains agents on a variable workload and deploys them on five shifted workloads, revealing which algorithms adapt and which collapse\.
- •C5\.Threecounter\-intuitive findings: \(i\) the calibrated baseline achieves lowest cost on all six workloads; \(ii\) discrete\-action algorithms outperform continuous\-action ones by orders of magnitude due to action\-space mismatch; and \(iii\) no single RL algorithm dominates across workloads\.
These findings challenge a common assumption in the decision\-making literature that deep RL straightforwardly outperforms rule\-based control\. We argue that progress requires moving beyond algorithm novelty toward reward engineering, environment calibration, and evaluation protocols that reflect real\-world deployment challenges—themes this benchmark is designed to support\.
## 2Related Work
#### Decision\-making benchmarks for real\-world control\.
A growing line of work calls for evaluation protocols that reflect the challenges of deploying RL in realistic, sequential decision\-making settings\.Agarwalet al\.\([2021](https://arxiv.org/html/2605.26418#bib.bib24)\)argue that single\-seed results in deep RL benchmarks are statistically unreliable and propose stratified bootstrap for confidence intervals\.Hendersonet al\.\([2018](https://arxiv.org/html/2605.26418#bib.bib22)\)show that hyperparameter choices, random seeds, and implementation details can flip algorithm rankings, with some reported gains falling within single\-method variance\.Dulac\-Arnoldet al\.\([2021](https://arxiv.org/html/2605.26418#bib.bib27)\)identify weak baselines as a systemic issue across applied RL\. Our benchmark applies these lessons to adaptive resource control, combining matched training budgets, multiple seeds, and distribution\-shift evaluation in a single protocol\.
#### RL for cloud and container resource management\.
RL has been applied to cloud resource management with increasing sophistication\.Rossiet al\.\([2019](https://arxiv.org/html/2605.26418#bib.bib9)\)applied Q\-learning to horizontal and vertical container scaling\.Qiuet al\.\([2020](https://arxiv.org/html/2605.26418#bib.bib10)\)proposed FIRM for fine\-grained microservice resource management under SLO constraints\.Wanget al\.\([2022](https://arxiv.org/html/2605.26418#bib.bib12)\)introduced DeepScaling for stable CPU utilization in large\-scale production systems\.Tokaet al\.\([2021](https://arxiv.org/html/2605.26418#bib.bib13)\)applied machine learning to Kubernetes edge cluster scaling;Zhanget al\.\([2025a](https://arxiv.org/html/2605.26418#bib.bib7)\)developed a GPU\-aware Kubernetes simulator with PPO\-based autoscaling, demonstrating a 75% reward improvement over CPU\-only baselines; andGaríet al\.\([2021](https://arxiv.org/html/2605.26418#bib.bib16)\)survey the broader field of RL\-based cloud autoscaling\. These studies typically compare one or two RL algorithms against a lightly\-tuned rule\-based controller on a narrow workload range; cross\-study comparison is difficult because network architectures, training budgets, and baseline configurations vary substantially\. Our matched\-budget, multi\-algorithm, multi\-workload protocol is designed to remove these confounds\.
#### Rule\-based autoscaling as a baseline\.
The Kubernetes HPA adjusts replica counts based on observed CPU or custom metrics\(Kubernetes Authors,[2024](https://arxiv.org/html/2605.26418#bib.bib17)\), and KEDA\(KEDA Authors,[2024](https://arxiv.org/html/2605.26418#bib.bib18)\)extends HPA with event\-driven scaling from external sources\. While rule\-based controllers are often dismissed as “simple,” we show that a properly calibrated HPA is a surprisingly strong comparator—consistent with observations fromBoothet al\.\([2023](https://arxiv.org/html/2605.26418#bib.bib28)\)that reward misdesign and baseline weakness can produce misleading comparisons in applied RL\.
#### Distribution\-shift generalization\.
A central challenge in real\-world RL deployment is generalizing from a training distribution to shifted deployment conditions—a concern that motivates the offline\-to\-online RL literature and broader work on distribution shift\. We do not study offline\-to\-online RL in the strict sense \(initializing from a fixed offline dataset and fine\-tuning online\); instead, we evaluate distribution\-shift generalization by training each agent online in simulation on a single workload \(*variable*\) and deploying it without retraining on five shifted distributions\. Our finding that the best training\-time algorithm is rarely the best deployment\-time algorithm connects to broader observations about distribution shift in applied RL\.
#### RL infrastructure and environments\.
Gymnasium\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.26418#bib.bib20)\)and Stable\-Baselines3\(Raffinet al\.,[2021](https://arxiv.org/html/2605.26418#bib.bib21)\)provide standardized environments and algorithm implementations\. Microservice benchmarks such as DeathStarBench\(Ganet al\.,[2019](https://arxiv.org/html/2605.26418#bib.bib26)\)provide realistic workloads but lack integrated RL evaluation frameworks\.RLScale\-Benchis, to our knowledge, the first benchmark that combines a realistic adaptive resource control environment, matched\-budget multi\-algorithm evaluation, and explicit distribution\-shift evaluation in a single protocol\.
## 3Benchmark Design
Figure[1](https://arxiv.org/html/2605.26418#S1.F1)summarizes the benchmark end\-to\-end; this section details each stage\.
### 3\.1Environment
We instantiate adaptive resource control as a Markov Decision Process \(MDP\) following the Gymnasium interface\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.26418#bib.bib20)\), using Kubernetes Horizontal Pod Autoscaling as a concrete testbed\. The agent allocates compute resources \(pod replicas\) to a service handling a dynamic request stream, subject to infrastructure cost and service\-level constraints\. While we instantiate the benchmark on Kubernetes to ensure production realism—following prior work on simulation\-based evaluation of RL autoscalers\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.26418#bib.bib7),[b](https://arxiv.org/html/2605.26418#bib.bib8)\)—the abstraction applies broadly to cloud scheduling, database provisioning, and edge inference deployment\.
#### State space\.
The agent observes a 6\-dimensional state vector at each decision step:st=\[CPUt,Memt,QPSt,p95t,ErrRatet,Replicast\]s\_\{t\}=\[\\text\{CPU\}\_\{t\},\\;\\text\{Mem\}\_\{t\},\\;\\text\{QPS\}\_\{t\},\\;p95\_\{t\},\\;\\text\{ErrRate\}\_\{t\},\\;\\text\{Replicas\}\_\{t\}\], whereCPUt∈\[0,100\]\\text\{CPU\}\_\{t\}\\in\[0,100\]is CPU utilization \(%\),Memt∈\[0,512\]\\text\{Mem\}\_\{t\}\\in\[0,512\]is memory usage \(MB\),QPSt\\text\{QPS\}\_\{t\}is the current request rate,p95tp95\_\{t\}is 95th\-percentile latency \(ms\),ErrRatet∈\[0,1\]\\text\{ErrRate\}\_\{t\}\\in\[0,1\]is the error rate, andReplicast∈\[1,10\]\\text\{Replicas\}\_\{t\}\\in\[1,10\]is the current replica count\.
#### Action space\.
The action space isDiscrete\(5\), representing replica count changes:at∈\{−2,−1,0,\+1,\+2\}a\_\{t\}\\in\\\{\-2,\-1,0,\+1,\+2\\\}\. This reflects a fundamental constraint of physical resource allocation—replicas are indivisible units, a property shared by many real\-world control problems \(allocation of virtual machines, database shards, or hardware accelerators\)\. For continuous\-action algorithms \(SAC, TD3, DDPG\), we apply aDiscreteToBoxWrapperthat mapsBox\(−1,1\)→Discrete\(5\)\\text\{Box\}\(\-1,1\)\\to\\text\{Discrete\}\(5\)via uniform bin edges at\[−0\.6,−0\.2,0\.2,0\.6\]\[\-0\.6,\-0\.2,0\.2,0\.6\]\.
#### Reward function\.
The reward balances infrastructure cost against SLO compliance:
rt=−\(crep⋅Replicast⏟cost\+λ⋅𝟙\[SLO violated\]⏟penalty\)r\_\{t\}=\-\\bigl\(\\underbrace\{c\_\{\\text\{rep\}\}\\cdot\\text\{Replicas\}\_\{t\}\}\_\{\\text\{cost\}\}\+\\lambda\\cdot\\underbrace\{\\mathbb\{1\}\[\\text\{SLO violated\}\]\}\_\{\\text\{penalty\}\}\\bigr\)\(1\)wherecrep=0\.01c\_\{\\text\{rep\}\}=0\.01USD per replica per step andλ=1\.0\\lambda=1\.0controls the SLO violation penalty\. We normalize rewards to\[−1,0\]\[\-1,0\]via min\-max scaling based on empirical bounds, which stabilizes training across all six algorithms\.
#### Workload generator\.
We implement six workload patterns that span the diversity of production traffic \(Table[1](https://arxiv.org/html/2605.26418#S3.T1)\):
Table 1:Workload types and their characteristics\.
### 3\.2Algorithms
We evaluate six DRL algorithms from Stable\-Baselines3\(Raffinet al\.,[2021](https://arxiv.org/html/2605.26418#bib.bib21)\)spanning three families:
- •On\-policy, discrete:PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.26418#bib.bib1)\), A2C\(Mnihet al\.,[2016](https://arxiv.org/html/2605.26418#bib.bib3)\)
- •Off\-policy, discrete:DQN\(Mnihet al\.,[2015](https://arxiv.org/html/2605.26418#bib.bib2)\)
- •Off\-policy, continuous:SAC\(Haarnojaet al\.,[2018](https://arxiv.org/html/2605.26418#bib.bib4)\), TD3\(Fujimotoet al\.,[2018](https://arxiv.org/html/2605.26418#bib.bib5)\), DDPG\(Lillicrapet al\.,[2016](https://arxiv.org/html/2605.26418#bib.bib6)\)
All algorithms use: \(i\) identical MLP architecture with two hidden layers of 256 units each, \(ii\) a training budget of 50,000 timesteps on the*variable*workload, and \(iii\) 5 random seeds per algorithm for statistical robustness\. Algorithm\-specific hyperparameters \(learning rate, batch size, buffer size\) follow Stable\-Baselines3 defaults except where noted in the appendix\.
### 3\.3Baselines
#### Calibrated rule\-based controller \(HPA\)\.
We implement the production\-faithful Kubernetes Horizontal Pod Autoscaler with a 70% CPU utilization target:desired=⌈current×\(CPU/70\)⌉\\text\{desired\}=\\lceil\\text\{current\}\\times\(\\text\{CPU\}/70\)\\rceil, clamped to\[1,10\]\[1,10\]replicas\. The 70% target follows the default production configuration and provides a built\-in safety margin for bursty traffic\. We emphasize that this baseline is*calibrated*, not a strawman: the target utilization, clamp bounds, and decision interval match realistic production deployments, enabling a fair comparison against RL agents\.
#### Random\.
A uniform random policy that selects actions from\{−2,−1,0,\+1,\+2\}\\\{\-2,\-1,0,\+1,\+2\\\}with equal probability, establishing a lower bound on performance\.
### 3\.4Evaluation Protocol
Each trained model is evaluated on all six workload types over 240 decision steps \(60 simulated minutes at 15\-second intervals\)\. We report:\(1\)total infrastructure cost \(USD\),\(2\)total SLO violations \(count of steps where latency exceeds threshold or error rate exceeds 5%\), and\(3\)mean replica count\. All metrics include mean±\\pmstd across 5 seeds\. For baselines \(HPA, Random\), we run 5 seeds with different random noise realizations\.
## 4Experiments
### 4\.1Main Results
Table[2](https://arxiv.org/html/2605.26418#S4.T2)presents infrastructure costs across all algorithms and workloads\. Our first and most striking finding is thatHPA achieves the lowest cost on all six workloads, outperforming every RL algorithm\. This result holds because HPA scales conservatively—maintaining fewer replicas on average \(2\.0–3\.0 vs\. 2\.7–3\.8 for RL agents\)—while still achieving reasonable SLO compliance\.
Table 2:Total infrastructure cost \(USD\) across 6 algorithms, 2 baselines, and 6 workload types\. Each cell shows mean±\\pmstd over 5 random seeds\.Bold: best RL algorithm per workload\.Underline: best overall\. Only algorithms with mean SLO violations<100<100are eligible for marking\.Table[3](https://arxiv.org/html/2605.26418#S4.T3)reports SLO violations\. Here the picture is more nuanced: while HPA achieves zero violations on constant, periodic, and ramp workloads, it incurs 30\.0 violations on bursty traffic—significantly more than PPO \(13\.7±\\pm5\.6\) and SAC \(8\.1±\\pm7\.5\)\. This suggests thatRL’s advantage emerges specifically on unpredictable workloadswhere proactive scaling can preempt SLO breaches\.
Table 3:SLO violations across algorithms and workloads\. Each cell shows mean±\\pmstd over 5 seeds\.Bold: best RL algorithm\. Lower is better; 0 indicates full SLO compliance\.
### 4\.2Discrete vs\. Continuous Action Spaces
Figure[2](https://arxiv.org/html/2605.26418#S4.F2)reveals a dramatic performance gap between algorithm families\. Continuous\-action algorithms \(SAC, TD3, DDPG\) exhibit SLO violations that areone to two orders of magnitude higherthan their discrete\-action counterparts \(PPO, DQN, A2C\)\.
TD3 shows extreme variance \(std = 4\.38 on mean replicas\), indicating that some seeds learn functional policies while others degenerate\. DDPG consistently converges to a single\-replica policy across all seeds, resulting in the lowest cost \($0\.0011\) but catastrophic SLO violations \(300–989 per workload\)\.
This failure mode arises from the*action\-space mismatch*: Kubernetes scaling is inherently discrete \(integer replicas\), yet continuous algorithms output real\-valued actions that must be bucketed at bin edges\. The discretization aliases the reward signal: within a bin, small changes to the actor’s output produce no change in reward \(near\-zero gradient\); at bin edges, infinitesimal changes produce discrete jumps \(high\-variance gradient\)\. The 50K budget is also short for off\-policy continuous algorithms, so part of the gap may reflect undertraining—a learning\-curve ablation is left to future work\.
Figure 2:Discrete\-action algorithms \(PPO, DQN, A2C\) achieve dramatically lower SLO violations than continuous\-action algorithms \(SAC, TD3, DDPG\)\. The continuous family’s median SLO count is\>\>100×\\timeshigher\.
### 4\.3Cost\-SLO Trade\-off
Figure[3](https://arxiv.org/html/2605.26418#S4.F3)shows the composite score \(normalized cost \+ SLO\) across algorithm\-workload pairs\. HPA achieves the lowest composite scores overall \(0\.00–0\.18\); among RL methods, SAC is strongest on SLO\-dominated workloads and PPO on cost\-dominated ones, together forming the cost–SLO frontier analyzed in Appendix[D](https://arxiv.org/html/2605.26418#A4)\. Notably, DDPG achieves a perfect 1\.00 \(worst\) on every non\-constant workload due to its degenerate single\-replica policy, confirming thatcost minimization without SLO awareness is not a viable strategy\.
Figure 3:Composite performance heatmap \(0 = best, 1 = worst\)\. HPA and SAC form the Pareto frontier among viable algorithms\. DDPG’s degenerate policy scores 1\.00 on all dynamic workloads\.
### 4\.4Ranking Stability Under Workload Shift
Figure[4](https://arxiv.org/html/2605.26418#S4.F4)tracks how algorithm rankings \(by SLO violations\) shift across workload types—a stress test for the common practice of training on one distribution and deploying on another\. No single algorithm maintains a consistent rank: HPA is rank 1 on constant and periodic workloads but drops to rank 5 on bursty traffic; PPO maintains the most stable ranking \(rank 2–3\); DQN is consistently the weakest viable algorithm \(rank 4–5\); SAC shows the widest rank variance, excelling on variable workloads \(rank 1\) but performing poorly on ramp traffic \(rank 4\)\.
This instability carries a direct implication for train\-to\-deploy distribution shift:the best algorithm selected on a training distribution is unlikely to remain best after workload shift\. Benchmarks that evaluate only on the training distribution systematically overestimate deployed performance\.
Figure 4:Algorithm ranking \(by SLO violations\) shifts across workloads\. No algorithm maintains rank 1 on all patterns\. Only viable algorithms shown \(TD3, DDPG excluded\)\.
### 4\.5Distribution\-Shift Generalization
All agents were trained on the*variable*workload—a random\-walk trace designed to expose the policy to diverse conditions—and deployed on the other five workloads without retraining\. This probes distribution\-shift generalization: can a policy trained on one traffic distribution adapt when deployment conditions shift?
Table[3](https://arxiv.org/html/2605.26418#S4.T3)answers this question through the lens of SLO compliance\. PPO, trained on*variable*, achieves 0 violations on*constant*,*periodic*, and*ramp*—it generalizes successfully to these distributions\. On*bursty*and*flash*, however, violations rise to 13\.7 and 19\.1 respectively, showing that the policy fails to extrapolate to heavy\-tail traffic it did not see during training\. SAC generalizes more robustly on dynamic workloads \(1\.9 violations on*variable*itself, 8\.1 on*bursty*\) but at the cost of systematically higher resource usage \(3\.19–3\.78 mean replicas vs\. 2\.68–3\.18 for PPO\)\. DQN and A2C fall between these regimes: both inherit PPO’s cost efficiency on steady workloads but degrade more than PPO on bursty traffic\. We call this trade\-off thetransfer tax: algorithms that generalize well under distribution shift pay for robustness with additional resource cost\.
Critically, the calibrated rule\-based baseline incurs no transfer tax because it does not train on any distribution—its policy is a fixed function of current CPU utilization\. This explains why a properly tuned rule\-based controller beats many RL agents under distribution shift: it simply does not overfit\.
## 5Discussion
#### Why is the calibrated baseline so competitive?
A properly tuned rule\-based controller is surprisingly hard to beat\. We attribute this to three factors: \(i\) CPU utilization is a strong, low\-latency proxy for load in most production workloads, making reactive scaling effective for steady\-state and mildly dynamic traffic; \(ii\) the 70% target provides a built\-in safety margin that prevents constraint violations under gradual load changes; \(iii\) the rule\-based policy does not train on any distribution, so it incurs no transfer tax when workload shifts\. This echoes observations in broader applied RL that simple, well\-tuned baselines are routinely underestimated\(Dulac\-Arnoldet al\.,[2021](https://arxiv.org/html/2605.26418#bib.bib27); Hendersonet al\.,[2018](https://arxiv.org/html/2605.26418#bib.bib22)\)\.
#### When does RL help?
RL’s advantage concentrates on*unpredictable*workloads \(bursty, flash\) where proactive scaling—learned from experience—can preempt violations that a reactive policy cannot avoid\. On bursty traffic, PPO reduces violations by 54% relative to the calibrated baseline \(13\.7 vs\. 30\.0\) at 24% higher cost\. The practical implication: RL is most valuable in deployments where the cost of constraint violations \(revenue loss, user churn\) substantially exceeds resource cost\. In cost\-dominated settings, the rule\-based controller wins\.
#### The action\-space mismatch problem\.
TD3 and DDPG fail catastrophically because the discretization wrapper aliases the reward signal: the actor sees a piecewise\-constant reward surface with flat plateaus inside bins and sharp jumps at bin edges, producing near\-zero gradients most of the time and high\-variance estimates at transitions\. SAC partially overcomes this through entropy\-regularized exploration but still pays 15–30% higher cost than discrete alternatives\. The lesson generalizes beyond autoscaling: any decision\-making problem with inherently discrete actions \(allocation of indivisible units, discrete network flows, integer programming\) should prefer discrete\-action algorithms or develop differentiable abstractions\.
#### Implications for benchmark design in decision\-making research\.
Our findings carry three implications for benchmark designers in the decision\-making community: \(1\)Calibrate baselines\.A weak baseline inflates apparent gains; a strong baseline reveals where progress is real\. Benchmarks should document baseline tuning as carefully as RL hyperparameters\. \(2\)Evaluate under distribution shift\.Rankings on the training distribution do not predict rankings under deployment shift\. Benchmarks should include held\-out distributions as a first\-class requirement\. \(3\)Report error bars across seeds\.Single\-seed runs can flip rankings between discrete and continuous algorithms, or between PPO and DQN\. Claims based on one seed are unfalsifiable\.
#### Implications for practitioners\.
For production deployments: \(i\) always calibrate the rule\-based controller before evaluating RL—many reported gains may reflect baseline weakness rather than algorithmic superiority; \(ii\) prefer discrete\-action algorithms \(PPO, A2C\) for indivisible\-unit resource allocation; \(iii\) evaluate on a diversity of workload patterns, because algorithm rankings are workload\-dependent\.
#### Limitations\.
The benchmark uses a simulated environment calibrated to real Kubernetes metrics from a Kind cluster\. This choice enables reproducibility and scale \(240 runs\), but may not capture all dynamics of production clusters \(network latency under load, node failures, multi\-tenancy\)\. The training budget \(50K steps\) is modest; longer training may narrow the gap between algorithms, though our matched\-budget design ensures fair cross\-algorithm comparison\. We evaluate single\-service scaling; multi\-service and cluster\-level scaling remain open challenges that we plan to address in future work via learned graph\-based dynamics models\.
## 6Conclusion
We introducedRLScale\-Bench, a reproducible benchmark and evaluation protocol for deep reinforcement learning on adaptive resource control, instantiated on Kubernetes Horizontal Pod Autoscaling\. A systematic comparison of six algorithms across six workload patterns and five random seeds reveals that \(1\) a calibrated rule\-based controller remains the most cost\-effective solution; \(2\) discrete\-action algorithms outperform continuous\-action ones by orders of magnitude due to action\-space mismatch; and \(3\) no single algorithm dominates across workloads, with rankings shifting by up to four positions under distribution shift\.
These findings challenge the prevailing narrative that deep RL straightforwardly outperforms rule\-based control in resource management\. We argue instead that the bottleneck lies in reward engineering, baseline calibration, and evaluation protocols that reflect real\-world deployment challenges—problems the decision\-making community can address through shared benchmarks and higher evaluation standards\. We release the full benchmark suite, trained models, and evaluation data to support this goal\.111Code available upon publication\.
#### Future work\.
We plan to extendRLScale\-Benchin three directions: \(i\) learned world models for cascade\-aware planning, where an agent rolls out trajectories in a predicted dynamics model to preempt chain failures before they occur; \(ii\) safe planning via constrained rollouts that guarantee zero violations during deployment; and \(iii\) graph\-structured resource control across multi\-service topologies, where scaling decisions propagate through a dependency graph\.
## References
- R\. Agarwal, M\. Schwarzer, P\. S\. Castro, A\. C\. Courville, and M\. Bellemare \(2021\)Deep reinforcement learning at the edge of the statistical precipice\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 29304–29320\.Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p4.1),[§1](https://arxiv.org/html/2605.26418#S1.p6.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Booth, W\. B\. Knox, J\. Shah, S\. Niekum, P\. Stone, and A\. Allievi \(2023\)The perils of trial\-and\-error reward design: misdesign through overfitting and invalid task specifications\.InAAAI Conference on Artificial Intelligence,Vol\.37,pp\. 5920–5929\.Cited by:[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Brockman, V\. Cheung, L\. Pettersson, J\. Schneider, J\. Schulman, J\. Tang, and W\. Zaremba \(2016\)OpenAI Gym\.arXiv preprint arXiv:1606\.01540\.Cited by:[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px5.p1.1),[§3\.1](https://arxiv.org/html/2605.26418#S3.SS1.p1.1)\.
- B\. Burns, B\. Grant, D\. Oppenheimer, E\. Brewer, and J\. Wilkes \(2016\)Borg, omega, and Kubernetes: lessons learned from three container\-management systems over a decade\.ACM Queue14\(1\),pp\. 70–93\.Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p6.1)\.
- G\. Dulac\-Arnold, N\. Levine, D\. J\. Mankowitz, J\. Li, C\. Paduraru, S\. Gowal, and T\. Hester \(2021\)Challenges of real\-world reinforcement learning: definitions, benchmarks and analysis\.Machine Learning110,pp\. 2419–2468\.Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p3.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.26418#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Fujimoto, H\. Hoof, and D\. Meger \(2018\)Addressing function approximation error in actor\-critic methods\.InInternational Conference on Machine Learning,pp\. 1587–1596\.Cited by:[3rd item](https://arxiv.org/html/2605.26418#S3.I1.i3.p1.1)\.
- Y\. Gan, Y\. Zhang, D\. Cheng, A\. Shetty, P\. Rathi, N\. Katarki, A\. Bruno, J\. Hu, B\. Ritchken, B\. Jackson,et al\.\(2019\)An open\-source benchmark suite for microservices and their hardware\-software implications for cloud & edge systems\.ACM International Conference on Architectural Support for Programming Languages and Operating Systems \(ASPLOS\),pp\. 3–18\.Cited by:[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px5.p1.1)\.
- Y\. Garí, D\. A\. Monge, E\. Pacini, C\. Mateos, and C\. García Garino \(2021\)Reinforcement learning\-based application autoscaling in the cloud: a survey\.Engineering Applications of Artificial Intelligence102,pp\. 104288\.External Links:[Document](https://dx.doi.org/10.1016/j.engappai.2021.104288)Cited by:[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InInternational Conference on Machine Learning,pp\. 1861–1870\.Cited by:[3rd item](https://arxiv.org/html/2605.26418#S3.I1.i3.p1.1)\.
- P\. Henderson, R\. Islam, P\. Bachman, J\. Pineau, D\. Precup, and D\. Meger \(2018\)Deep reinforcement learning that matters\.InAAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p4.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.26418#S5.SS0.SSS0.Px1.p1.1)\.
- R\. Islam, P\. Henderson, M\. Gomrokchi, and D\. Precup \(2017\)Reproducibility of benchmarked deep reinforcement learning tasks for continuous control\.InICML Workshop on Reproducibility in Machine Learning,Note:arXiv:1708\.04133Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p4.1)\.
- KEDA Authors \(2024\)KEDA: Kubernetes\-based event driven autoscaling\.Note:[https://keda\.sh/](https://keda.sh/)Accessed: 2026\-03\-15Cited by:[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px3.p1.1)\.
- Kubernetes Authors \(2024\)Kubernetes horizontal pod autoscaler\.Note:[https://kubernetes\.io/docs/tasks/run\-application/horizontal\-pod\-autoscale/](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)Accessed: 2026\-03\-15Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p6.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px3.p1.1)\.
- T\. P\. Lillicrap, J\. J\. Hunt, A\. Pritzel, N\. Heess, T\. Erez, Y\. Tassa, D\. Silver, and D\. Wierstra \(2016\)Continuous control with deep reinforcement learning\.arXiv preprint arXiv:1509\.02971\.Cited by:[3rd item](https://arxiv.org/html/2605.26418#S3.I1.i3.p1.1)\.
- V\. Mnih, A\. P\. Badia, M\. Mirza, A\. Graves, T\. Lillicrap, T\. Harley, D\. Silver, and K\. Kavukcuoglu \(2016\)Asynchronous methods for deep reinforcement learning\.InInternational Conference on Machine Learning,pp\. 1928–1937\.Cited by:[1st item](https://arxiv.org/html/2605.26418#S3.I1.i1.p1.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski,et al\.\(2015\)Human\-level control through deep reinforcement learning\.Nature518\(7540\),pp\. 529–533\.External Links:[Document](https://dx.doi.org/10.1038/nature14236)Cited by:[2nd item](https://arxiv.org/html/2605.26418#S3.I1.i2.p1.1)\.
- H\. Qiu, S\. S\. Banerjee, S\. Jha, Z\. T\. Kalbarczyk, and R\. K\. Iyer \(2020\)FIRM: an intelligent fine\-grained resource management framework for SLO\-oriented microservices\.USENIX Symposium on Operating Systems Design and Implementation \(OSDI\),pp\. 805–825\.Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p1.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Raffin, A\. Hill, A\. Gleave, A\. Kanervisto, M\. Ernestus, and N\. Dormann \(2021\)Stable\-Baselines3: reliable reinforcement learning implementations\.Journal of Machine Learning Research22\(268\),pp\. 1–8\.Cited by:[Appendix A](https://arxiv.org/html/2605.26418#A1.p1.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px5.p1.1),[§3\.2](https://arxiv.org/html/2605.26418#S3.SS2.p1.1)\.
- F\. Rossi, M\. Nardelli, and V\. Cardellini \(2019\)Horizontal and vertical scaling of container\-based applications using reinforcement learning\.InIEEE International Conference on Cloud Computing \(CLOUD\),pp\. 329–338\.Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p1.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.InarXiv preprint arXiv:1707\.06347,Cited by:[1st item](https://arxiv.org/html/2605.26418#S3.I1.i1.p1.1)\.
- L\. Toka, G\. Dobreff, B\. Fodor, and B\. Sonkoly \(2021\)Machine learning\-based scaling management for Kubernetes edge clusters\.IEEE Transactions on Network and Service Management18\(1\),pp\. 958–972\.External Links:[Document](https://dx.doi.org/10.1109/TNSM.2021.3052837)Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p1.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Wang, S\. Zhu, J\. Li, W\. Jiang, K\. K\. Ramakrishnan, Y\. Zheng, M\. Yan, X\. Zhang, and A\. X\. Liu \(2022\)DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems\.InACM Symposium on Cloud Computing \(SoCC\),pp\. 16–30\.External Links:[Document](https://dx.doi.org/10.1145/3542929.3563469)Cited by:[§1](https://arxiv.org/html/2605.26418#S1.p1.1),[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Zhang, W\. Guo, Z\. Tan, Q\. Guan, and H\. Jiang \(2025a\)KIS\-S: a GPU\-aware Kubernetes inference simulator with RL\-based auto\-scaling\.InIEEE International Performance, Computing, and Communications Conference \(IPCCC\),External Links:[Document](https://dx.doi.org/10.1109/IPCCC62082.2025.11304654)Cited by:[§2](https://arxiv.org/html/2605.26418#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.26418#S3.SS1.p1.1)\.
- G\. Zhang, W\. Guo, Z\. Tan, and H\. Jiang \(2025b\)AMP4EC: adaptive model partitioning framework for efficient deep learning inference in edge computing environments\.InIEEE International Conference on Fog and Mobile Edge Computing \(FMEC\),External Links:[Document](https://dx.doi.org/10.1109/FMEC65595.2025.11119384)Cited by:[§3\.1](https://arxiv.org/html/2605.26418#S3.SS1.p1.1)\.
## Appendix AHyperparameter Details
Table[4](https://arxiv.org/html/2605.26418#A1.T4)lists the full hyperparameter configuration for each algorithm\. All algorithms share the same network architecture and training budget; algorithm\-specific parameters follow Stable\-Baselines3\(Raffinet al\.,[2021](https://arxiv.org/html/2605.26418#bib.bib21)\)defaults except where noted\.
Table 4:Hyperparameter configuration\. Shared parameters are listed under “Common”; algorithm\-specific parameters follow SB3 defaults\.
## Appendix BEnvironment Calibration
The simulated environment is calibrated against metrics collected from a real Kubernetes cluster running Online Boutique on Kind with Prometheus monitoring\. Key calibration points:
- •CPU response:CPU%=5\.0\+\(QPS/replicas\)×0\.7\\text\{CPU\\%\}=5\.0\+\(\\text\{QPS\}/\\text\{replicas\}\)\\times 0\.7, calibrated so HPA triggers scale\-up at∼\\sim100 req/min per replica\.
- •Latency model:p95=50\+\(QPS/replicas\)1\.5×0\.08p95=50\+\(\\text\{QPS\}/\\text\{replicas\}\)^\{1\.5\}\\times 0\.08ms, with exponential degradation under overload\.
- •SLO threshold:p95<500p95<500ms and error rate<5%<5\\%\.
- •Cost model: $0\.01 per replica per decision step \(15 seconds\), approximating on\-demand cloud pricing at $0\.04/vCPU\-hour\.
## Appendix CFull Results: Viable Algorithm Comparison
Figure[5](https://arxiv.org/html/2605.26418#A3.F5)shows the complete cost and SLO comparison for viable algorithms \(HPA, PPO, DQN, A2C, SAC\) across all six workloads with 95% confidence intervals\.
Figure 5:Cost and SLO violations for viable algorithms across all workloads\. Error bars show 95% CI over 5 seeds\. HPA achieves lowest cost on all workloads\. RL algorithms show advantage only on bursty/flash SLO compliance\.
## Appendix DCost\-SLO Pareto Analysis
Figure[6](https://arxiv.org/html/2605.26418#A4.F6)shows the cost\-SLO scatter for all algorithms\. Each point represents one \(algorithm, workload\) pair\. The Pareto front connects methods that are not dominated on both cost and SLO simultaneously\.
Figure 6:Cost\-SLO Pareto front\. Left panel: all algorithms \(symlog scale\)\. Right panel: viable algorithms only\. HPA dominates on cost; SAC achieves lowest SLO on dynamic workloads\.Similar Articles
Benchmarking safe exploration in deep reinforcement learning
OpenAI proposes standardizing constrained RL as the formalism for safe exploration and introduces Safety Gym, a benchmark suite for evaluating safe deep RL algorithms in high-dimensional continuous control tasks with safety constraints.
From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator
This paper theoretically identifies and mitigates context distribution shift in multi-turn dialogue RL, proposing Calibrated Interactive RL that couples interactive RL with simulator alignment to reduce the sim-to-real gap and achieve state-of-the-art performance.
Gotta Learn Fast: A new benchmark for generalization in RL
OpenAI presents a new reinforcement learning benchmark based on Sonic the Hedgehog to measure transfer learning and few-shot learning performance in RL agents, along with baseline algorithm evaluations.
@Gorden_Sun: Achieving heuristic learning through coding agents. Continuously maintain and iterate a system of programmatic strategies using a coding agent to replace gradient updates in neural networks. In tests, this approach reached baseline levels of Deep RL. It may become the next paradigm following "pre-training → RLHF → large-scale RL." Heuristic learning has existed in the past, but...
The article proposes using coding agents to maintain and iterate a system of programmatic strategies to replace neural network gradient updates. This approach achieved baseline performance in Deep RL tests and is considered a potential new paradigm following pre-training and RLHF.
@adithya_s_k: https://x.com/adithya_s_k/status/2054961319179420035
An analysis of why RL for coding tasks is gaining traction due to verifiable rewards, and why the emerging framework Harbor addresses the bottleneck of environment complexity in RL training.