Performance Variation in Deep Reinforcement Learning
Summary
This paper identifies limitations of conventional uncertainty estimates for deep reinforcement learning and proposes percentile-based statistics and visualization to better assess run-to-run performance variation. Case studies demonstrate the method on PPO, SAC, TD-MPC, DQN, and Rainbow algorithms.
View Cached Full Text
Cached at: 06/08/26, 09:18 AM
# Performance Variation in Deep Reinforcement Learning
Source: [https://arxiv.org/html/2606.06746](https://arxiv.org/html/2606.06746)
\\nameHaruto Tanaka\\emailharuto@ualberta\.ca \\addrDepartment of Computing Science University of Alberta Alberta Machine Intelligence Institute \(Amii\)\\nameA\. Rupam Mahmood\\emailarmahmood@ualberta\.ca \\addrDepartment of Computing Science University of Alberta Alberta Machine Intelligence Institute \(Amii\) CIFAR AI Chair
###### Abstract
Deep reinforcement learning \(RL\) algorithms often suffer from low run\-to\-run robustness, manifesting as significant performance variation across independent runs of identically configured agents\. Although this issue poses a spectrum of challenges across research and practice, relatively few studies develop methods to evaluate it; RL research instead often reports uncertainty in the estimated mean performance\. In this paper, we outline the limitations of conventional uncertainty and variation estimates, particularly their misalignment with purpose and the risk of underreporting\. We then propose an alternative percentile\-based statistic and visualization method,min\-max IPRandrun\-wise percentile highlighting, respectively\. These percentile\-based tools are easy to interpret and rely on standard properties of sample percentiles, providing rich information about run\-to\-run performance variation\. We demonstrate this through three case studies\. First, we show that LayerNorm and penultimate\-layer normalizations narrow performance variation in PPO, whereas the variation is mostly unchanged in SAC\. Second, we compare PPO, SAC, TD\-MPC, and TD\-MPC2, and show TD\-MPC exhibits the least variation while being the most data efficient among the four\. Finally, in a comparison of DQN and Rainbow on five Atari environments, we show that both algorithms exhibit similar levels of performance variation\.111The code and data are available at[https://github\.com/WINUprj/eval\-perf\-variation](https://github.com/WINUprj/eval-perf-variation)
Keywords:Deep Reinforcement Learning, Performance Variation, Evaluation
## 1Introduction
Although deep reinforcement learning \(RL\) algorithms are known to learn complex behaviors \(e\.g\.,Mnihet al\.,[2015](https://arxiv.org/html/2606.06746#bib.bib98); Bellemareet al\.,[2020](https://arxiv.org/html/2606.06746#bib.bib13); Wurmanet al\.,[2022](https://arxiv.org/html/2606.06746#bib.bib12); Haarnojaet al\.,[2024](https://arxiv.org/html/2606.06746#bib.bib10)\), they are also notorious for performing differently with slight changes to their settings\. Here, we specifically consider online learning variants of deep RL algorithms and letperformancedenote the average of online episodic returns across all episodes in a single run\. The spectrum of components can trigger performance differences, including the design of deep neural networks, independent runs, hyperparameter configurations, stochasticity in the environment or learning dynamics, hardware specifications\(Hausknecht and Stone,[2015](https://arxiv.org/html/2606.06746#bib.bib89); Hendersonet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib14)\), and delays in real\-time learning systems\(Mahmoodet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib68)\)\.
\(a\)Supervised Learning\(b\)Test Accuracies
\(c\)Test Losses
\(d\)Reinforcement Learning\(e\)DMC
\(f\)ALE
Figure 1:Variation of standard supervised learning \(SL\) \(a & b\), continuous deep RL \(c\), and discrete deep RL \(d\) settings\. First two plots presents\(a\)test accuracy and\(b\)test loss of100100independent runs of MLP on MNIST and ResNet\-1818on CIFAR\-1010\. The step\-size of Adam optimizer is decayed from3×10−43\\times 10^\{\-4\}to3×10−53\\times 10^\{\-5\}at the100100th epoch\. Third and fourth plots depict\(c\)episodic return of PPO onpendulum\-swingupand\(d\)Human Normalized Score of DQN onBattleZone\. The number beside each set of curves is min\-max IPR\-9090, our proposed measurement of spread \(smaller the better\. See[Section5](https://arxiv.org/html/2606.06746#S5)\)\. All curves are plotted with RPH \(see[Section6](https://arxiv.org/html/2606.06746#S6)\)\. Curves related to SL settings exhibit lower performance variation than those for deep RL settings\.In particular, the performance sensitivity across independent runs profoundly hinders progress in deep RL, both in research and in practical applications\. Within the context of research, such brittleness causes difficulty in reproducing results\(Islamet al\.,[2017](https://arxiv.org/html/2606.06746#bib.bib46)\), fair comparison between algorithms\(Claryet al\.,[2019](https://arxiv.org/html/2606.06746#bib.bib48)\), and hyperparameter tuning\(Eimeret al\.,[2023](https://arxiv.org/html/2606.06746#bib.bib44); Hertelet al\.,[2020](https://arxiv.org/html/2606.06746#bib.bib45)\)\. The current best practice to ensure rigor in these aspects is to conduct a sufficiently large number of independent trials\(Colaset al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib47); Eggenspergeret al\.,[2019](https://arxiv.org/html/2606.06746#bib.bib55); Pattersonet al\.,[2024](https://arxiv.org/html/2606.06746#bib.bib6)\)\. However, such a procedure demands substantial computational resources, creating a high barrier to producing scientifically sound evidence from large\-scale experiments\. The lottery\-like behavior over independent runs also severely undermines the practicality of deep RL\-driven systems in real\-world tasks\. Ineffective behaviors are not only futile but can also pose a safety concern if the task requires stringent safety measures\. Such real\-world risk is one of the primary reasons why performance variability in online deep RL is important in comparison to, for example, the variability of training results in supervised learning \(SL\) or offline RL\. SL training is often performed offline, effectively minimizing the risks associated with poor training outcomes\. Additionally, SL tends to exhibit smaller variations, which reduces the importance of investigating its performance variability across runs, as we show in our results in[Figure1](https://arxiv.org/html/2606.06746#S1.F1)\(see[AppendixE](https://arxiv.org/html/2606.06746#A5)for more examples and experiment details\)\. Nevertheless, performance sensitivity across independent runs on a single task, which we refer to asperformance variation, warrants further investigation for the development of deep RL algorithms\.
Despite its importance, performance variation is often overlooked in empirical deep RL studies\. This is attributed to the focus of many recent works on overall improvements in the aggregated performance of proposed methods relative to baselines\. Such a trend encourages many works to merely report uncertainty in aggregated performance across multiple tasks\. One of the most popular uncertainty measures is the bootstrapped confidence interval of the interquartile mean \(IQM\)Agarwalet al\.\([2021](https://arxiv.org/html/2606.06746#bib.bib4)\)\. Although similar in appearance, confidence intervals and other uncertainty estimates are not measures of variation \([Section3](https://arxiv.org/html/2606.06746#S3)\)\. These also appear in a rigorous comparison of algorithm performance over many tasks\. As a result, many empirical deep RL studies neglect the performance variation in a single task\. Some work reports standard deviation as a measure of performance variation \(e\.g\.,Lianget al\.,[2016](https://arxiv.org/html/2606.06746#bib.bib95); Bjorcket al\.,[2022](https://arxiv.org/html/2606.06746#bib.bib3)\)\. However, we argue that reporting data variation with standard deviation entails the risk of underreporting and yields an inaccurate summary of the data due to the specific characteristics of the performance distributions of control policies learned by modern deep RL algorithms \([Section4](https://arxiv.org/html/2606.06746#S4)\)\. Furthermore, statistically rigorous methods have been proposed to measure performance variation, such as performance profiles and the tolerance intervals \(TIs\)\(Agarwalet al\.,[2021](https://arxiv.org/html/2606.06746#bib.bib4); Pattersonet al\.,[2024](https://arxiv.org/html/2606.06746#bib.bib6)\)\. While statistically rigorous methods are often robust and accurate, they are also often costly and difficult to interpret \([Section5](https://arxiv.org/html/2606.06746#S5)\)\. Therefore, there is a need for evaluation tools that accurately capture variability of performance on a single task, while remaining easily interpretable\.
In this paper, we propose quantification and visualization methods that achieve an appropriate balance between accuracy and interpretability in capturing performance variation in deep RL\. Particularly, we claim thatmin\-max IPR\-9090, a min\-max normalized interpercentile range \(IPR\) from55th to9595th percentiles, is a practical and reliable quantification of performance variation\. We discuss that min\-max IPR\-9090captures performance variation more robustly than the standard deviation, while it is more interpretable than rigorous options, such as TI \([Section5](https://arxiv.org/html/2606.06746#S5)\)\. In parallel, we also propose a learning curve visualization method calledrun\-wise percentile highlighting\(RPH\)\. The core idea of this visualization technique is to highlight the individual learning curves corresponding to the55th,5050th, and9595th percentiles of performance\. We show that RPH further clarifies how each individual learning curve behaves and allows investigators to readily examine performance variability across runs \([Section6](https://arxiv.org/html/2606.06746#S6)\)\. Then, we demonstrate the use cases of min\-max IPR\-9090and RPH on three case studies \([Section7](https://arxiv.org/html/2606.06746#S7)\)\. In the first case study, we analyze changes in performance variation after applying LayerNorm, penultimate layer normalization, or both to PPO and SAC\(Bjorcket al\.,[2022](https://arxiv.org/html/2606.06746#bib.bib3)\)\. Using our proposed methods, we show that normalization techniques narrow performance variation in PPO, whereas it remains mostly unchanged in SAC\. In the second case study, we use all DeepMind Control Suite \(DMC\) tasks to systematically compare four deep RL algorithms: PPO, SAC, TD\-MPC, and TD\-MPC2\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.06746#bib.bib8); Haarnojaet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib7); Hansenet al\.,[2022](https://arxiv.org/html/2606.06746#bib.bib94),[2024](https://arxiv.org/html/2606.06746#bib.bib97)\)\. Through comparisons using our proposed methods, we find that TD\-MPC exhibits the lowest performance variation and the highest data efficiency\. The third case study repeats the same process as the second case study for two discrete control algorithms: DQN and Rainbow\(Mnihet al\.,[2015](https://arxiv.org/html/2606.06746#bib.bib98); Hesselet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib99)\)on Atari\-5 tasks\(Aitchisonet al\.,[2023](https://arxiv.org/html/2606.06746#bib.bib101)\)\. With our methods, we find that both algorithms exhibit significant performance variation in some tasks\.
## 2Experiment Settings
To illustrate and examine the performance variation issue, we mainly use PPO and SAC algorithms on two robotic control suites\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.06746#bib.bib8); Haarnojaet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib7)\)\. For both algorithms, our implementation is based on CleanRL\(Huanget al\.,[2022](https://arxiv.org/html/2606.06746#bib.bib23)\)\. We use5959task environments selected from Gymnasium’s MuJoCo environments and the DeepMind Control Suite \(DMC\) as the testbed\(Todorovet al\.,[2012](https://arxiv.org/html/2606.06746#bib.bib26); Tassaet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib27); Towerset al\.,[2025](https://arxiv.org/html/2606.06746#bib.bib91)\)\. Specifically, we choose1111tasks from MuJoCo and4848tasks from DMC\. The time limit per episode for all tasks is10001000steps\. In addition to the time limit, each MuJoCo environment has its own termination condition\. The details of the tasks used in this paper are summarized inLABEL:table:rl\_tasks\. Note that we do not use parallelized environments, as is sometimes done \(e\.g\.,Stooke and Abbeel[2019](https://arxiv.org/html/2606.06746#bib.bib90); Liet al\.[2023](https://arxiv.org/html/2606.06746#bib.bib92); Leeet al\.[2025](https://arxiv.org/html/2606.06746#bib.bib29)\) to better align with real\-world RL settings\. For each task, we run100100independent runs with different random seeds\. Each run lasts1010million environment steps for PPO and11million environment steps for SAC\. Other hyperparameters of the algorithms are given in[Tables6](https://arxiv.org/html/2606.06746#A4.T6)and[7](https://arxiv.org/html/2606.06746#A4.T7)in[AppendixD](https://arxiv.org/html/2606.06746#A4)\. All learning curves are binned for visual clarity \(see[AppendixA](https://arxiv.org/html/2606.06746#A1)for the details\)\. Visualizations of all binned learning curves for PPO and SAC are provided in[AppendixF](https://arxiv.org/html/2606.06746#A6)\.
\(a\)reacher\-hard
\(b\)walker\-stand
\(c\)pendulum\-swingup
Figure 2:Visualization of performance distribution and variations of some PPO experiments\. Each performance distribution exhibits roughly Gaussian, unimodal\-skewed, and bimodal shapes\. Vertical red, pink, and orange lines represent the range where the average performance±\\pmstandard error, stratified bootstrapped 95%\\%CI of IQM, and the average performance±\\pmstandard deviation cover, respectively\. The green boxplots depict the IQR, and the whiskers represent the range between55th and9595th percentiles\. The boxplots robustly cover most of the data range, unlike the other options\.
## 3Uncertainty Estimates Do Not Capture Performance Variation
A large volume of deep RL studies focus on the aggregated performance of their proposed algorithm relative to a baseline\. This naturally led many RL studies to report the uncertainty estimates of the aggregated performance\. For instance, confidence intervals or standard errors are often utilized\(Hendersonet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib14); Agarwalet al\.,[2021](https://arxiv.org/html/2606.06746#bib.bib4); Tang and Berseth,[2024](https://arxiv.org/html/2606.06746#bib.bib96)\)\. Despite their popularity, these uncertainty estimates are not suitable for capturing performance variation\.
Intuitively, uncertainty estimates reflect the probable discrepancy between the sample and the true statistics\. For example, the95%95\\%stratified bootstrap confidence interval for IQM byAgarwalet al\.\([2021](https://arxiv.org/html/2606.06746#bib.bib4)\)is an interval estimate of the range within which the population IQM lies with a certain confidence\. Although these values provide additional information about the performance sample statistics, they do not reflect the spread of the performance distribution itself\. Thus, by design, the uncertainty estimates do not reflect the variation\.
Furthermore, uncertainty estimates vanish as the number of independent runs increases\. For instance, standard error eventually converges to zero at a rate of𝒪\(1/n\)\\mathcal\{O\}\(1/\\sqrt\{n\}\)\. This is an undesirable property as a measure of variation because a degree of spread in data is independent of the number of data points\. Also, due to this property, uncertainty estimators often mark values smaller than variation measurements, as shown in[Figure2](https://arxiv.org/html/2606.06746#S2.F2)\. Each solid red, pink, and orange vertical line represents the range covered by the standard error, the95%95\\%stratified bootstrap confidence interval of the IQM, and the standard deviation, respectively\. Visually, the standard error and confidence interval merely cover a small subinterval of the range covered by the standard deviation\. In contrast, the standard deviation covers a relatively wide range of performance distributions\. Henceforth, reporting uncertainty as performance variation is a misuse and risks obscuring high performance variation\.
## 4Limitations of Standard Deviation as a Measurement of Performance Variation
The standard deviation is a popular measure of the variability of performance in a single task \(e\.g\.,Lianget al\.[2016](https://arxiv.org/html/2606.06746#bib.bib95); Bjorcket al\.[2022](https://arxiv.org/html/2606.06746#bib.bib3)\)\. However, deep RL returns often have distributional features that can make the standard deviation a misleading summary of performance variation\. In this section, we analyze its sensitivity to return distribution and show how it can understate risk in common deep RL evaluation settings\.
\(a\)PPO
\(b\)SAC
Figure 3:Histogram and kernel density estimations \(KDEs\) of performance distribution for PPO and SAC experiments with a default configuration\. Each subplot \(the small boxes\) corresponds to a control task\. Results are obtained by running100100independent runs, and the number of histogram bins is1010\. For each KDE, a Gaussian KDE is employed with Scott’s rule for bandwidth selection\(Scott,[1992](https://arxiv.org/html/2606.06746#bib.bib2)\)\. Different instance of experiments exhibits different performance distributions, many of which are not necessarily Gaussian\.### Performance Distribution is Often Not Gaussian
The standard deviation provides comprehensive information about the spread of samples when the distribution is close to Gaussian\(Altman and Bland,[2005](https://arxiv.org/html/2606.06746#bib.bib106); Campbell,[2021](https://arxiv.org/html/2606.06746#bib.bib107); Greenet al\.,[2026](https://arxiv.org/html/2606.06746#bib.bib108)\)\. However, in non\-Gaussian distributions, it may provide misleading information\. Thus, it is important to verify that the data roughly conform to the Gaussian assumption when applying the standard deviation\.
Our empirical results show that the Gaussian assumption is hardly satisfied by the PPO and SAC algorithms\.[Figure3](https://arxiv.org/html/2606.06746#S4.F3)depicts the histograms and kernel density estimation \(KDE\) of the performance from the PPO and SAC experiments described in[Section2](https://arxiv.org/html/2606.06746#S2)\. Visually, performance distributions are either\(1\)\(1\)roughly Gaussian,\(2\)\(2\)unimodal but skewed, or\(3\)\(3\)bimodal\. Visual taxonomization already suggests the existence of non\-Gaussian performances\. To solidify this discovery, we conduct a Shapiro\-Wilk test on each sample distribution, which is the hypothesis test of normality\(Shapiro and Wilk,[1965](https://arxiv.org/html/2606.06746#bib.bib1)\)\. The null hypothesis of the Shapiro\-Wilk test states that the underlying distribution of the given data points is normal\. With the significance level ofα=0\.05\\alpha=0\.05, we find that only about32%32\\%of the results shown in[Figure3\(a\)](https://arxiv.org/html/2606.06746#S4.F3.sf1)and[Figure3\(b\)](https://arxiv.org/html/2606.06746#S4.F3.sf2)have no significant evidence to be classified as a non\-Gaussian distribution\. In addition to these insights,Mathieuet al\.\([2024](https://arxiv.org/html/2606.06746#bib.bib103)\)also notes the non\-Gaussianity of the performance distribution\. Overall, the performance distribution does not conform to the Gaussian assumption, making the standard deviation an insufficient measure of variation\.
### Risk of Underreporting
Standard deviation is also prone to underreporting, making it insufficient for measuring variation\. This is illustrated in[Figure2](https://arxiv.org/html/2606.06746#S2.F2), which shows three performance distributions of PPO\. The orange vertical lines show the range of the sample mean±\\pmsample standard deviation\. The underreporting by standard deviation is prominent in[Figure2\(c\)](https://arxiv.org/html/2606.06746#S2.F2.sf3)\. There, the standard deviation does not cover the mode lying around the performance of0\. This is highly problematic because it implies that the standard deviation does not capture30%30\\%failed PPO runs inpendulum\-swinguptask\. Underreporting can also be seen in[Figure2\(b\)](https://arxiv.org/html/2606.06746#S2.F2.sf2)\. In[Figure2\(b\)](https://arxiv.org/html/2606.06746#S2.F2.sf2), the range covered by the standard deviation is shifted towards the mode of distribution, resulting in insufficient coverage over the heavy tail of the distribution\. Therefore, due to the high risk of underreporting, the standard deviation is not suitable to capture performance variation\.
## 5Inter\-percentile Range as a Measurement of Performance Variation
Nonparametric statistics are preferred when it is infeasible to make distributional assumptions\. Here, we consider using percentiles as a measure of performance variation in a single task\. We first discuss why percentile\-based statistics could be a suitable choice\. Then, we discuss how the existing percentile\-based method, the tolerance interval \(TI\), can be further simplified\. Lastly, we formalize a performance variation metric: min\-max normalized inter\-percentile range \(IPR\)\.
### Suitability of Percentile
An alternative measure of variation to standard deviation is range estimation using percentiles\. The advantage of percentiles is that they do not require assumptions about the data distribution\. Hence, it often captures the characteristics of different distributions more robustly\. For instance, the box plots in[Figure2](https://arxiv.org/html/2606.06746#S2.F2)describe the essential features of the performance distributions, including symmetry, skewness, and variability\. The robust ability to capture data variability makes percentile\-based variation measurement well\-suited for our study\.
Furthermore, the sample percentile is known to be a consistent estimator of the true population percentile\. One reason for the frequent adoption of the sample mean and standard deviation is that they are asymptotically guaranteed to match their population values\. This shared theoretical property further supports IPR as a measure of performance spread \(see[AppendixB](https://arxiv.org/html/2606.06746#A2)for mathematical details\)\.
### Tolerance Interval and Performance Profile
Although IPR appears to be a favorable option, choosing a suitable percentile range is challenging\. This is because the choice of the range essentially decides the portion of outliers that do not contribute to representing the performance variation\. As a solution to this problem,Pattersonet al\.\([2024](https://arxiv.org/html/2606.06746#bib.bib6)\)suggests using a TI\. Formally,\(α,β\)\(\\alpha,\\beta\)\-TI is an interval that captures the centerβ\\betaproportion of the population with a nominal error ofα\\alpha\. In other words, TI takes an IPR that covers a centerβ×100%\\beta\\times 100\\%proportion of the distribution and expands that range based on the number of data points\. If the number of data points is small, the TI broadens further\. In contrast, if the number of data points is sufficiently large, the TI coincides with IPR\-\(β×100\)\(\\beta\\times 100\)\. In this way, TI aims to estimate the population IPR\-\(β×100\)\(\\beta\\times 100\)\. While TI rigorously reasons about the population IPR\-\(β×100\)\(\\beta\\times 100\), this rigor adds an additional layer of complexity that may not be necessary only to capture performance variation\. In the worst case, such complexity can reduce the interpretability of the results\. Hence, a simple IPR that covers a sufficiently large portion of the data is well\-suited for analyzing performance variability\.
The issue of choosing the IPR range can also be avoided by plotting the entire empirical distribution function, namely a performance profile, as proposed byAgarwalet al\.\([2021](https://arxiv.org/html/2606.06746#bib.bib4)\)\. The performance profile provides a comprehensive visual of the performance variation at different IPRs\. Although it provides a comprehensive view, the method is specifically designed for cross\-algorithm comparisons\. In other words, it primarily captures how the performance of different deep RL algorithms are distributed when deployed over multiple tasks\. This fails to capture our desired information on performance variation in a single task\. Indeed, it is possible to restrict to a single task by plotting a performance profile on a per\-task basis\. However, since such a process generates a cumulative distribution curve per algorithm\-task pair, it lacks scalability against the number of algorithms and tasks\. For example, in our PPO and SAC experiments, we run both algorithms on5959tasks\. In our case, the analysis of the performance profile per task results in investigation of108108plots, which is laborious and thus lacks scalability\. In contrast, IPR provides a representative summary of the performance profile for each algorithm\-task pair with a single quantity, which is easy to interpret\. Overall, in a single\-task analysis, the IPR is a more suitable option than the performance profile\.
### Formalization of Performance Variation as Min\-max Normalized IPR
As a simple variation metric, we propose a min\-max normalized IPR of the given data\. We normalize IPR using min\-max normalization to allow the direct comparison of variation across different algorithm\-task pairs\. For all DMC tasks, the min\-max values are0to10001000by their reward design and termination condition\. In contrast, estimating episodic returns analytically for MuJoCo tasks is difficult\. Instead of analytically deriving them, we set the min–max values to the minimum and maximum episodic returns observed across all experiments in this paper\. See[AppendixK](https://arxiv.org/html/2606.06746#A11)for a complete list of min\-max values\. Min\-max normalized IPR can be mathematically formalized as
Min\-max Normalized IPR\(𝒰T,X\):=𝒰T\(50\+X2\)−𝒰T\(50−X2\)MT−mT\(%\),\\displaystyle\\text\{Min\-max Normalized IPR\}\(\\mathcal\{U\}\_\{T\},X\):=\\frac\{\\mathcal\{U\}\_\{T\}^\{\\left\(50\+\\frac\{X\}\{2\}\\right\)\}\-\\mathcal\{U\}\_\{T\}^\{\\left\(50\-\\frac\{X\}\{2\}\\right\)\}\}\{M\_\{T\}\-m\_\{T\}\}\\text\{ \}\(\\%\),where𝒰T\\mathcal\{U\}\_\{T\}is the set of performances on taskTT,𝒰T\(x\)\\mathcal\{U\}\_\{T\}^\{\(x\)\}is thexx\-th percentile value in𝒰T\\mathcal\{U\}\_\{T\},X∈\[0,100\]X\\in\[0,100\]is the range of central region of data to cover,MTM\_\{T\}\(mTm\_\{T\}\) is the theoretical/empirical maximum \(minimum\) performance of taskTT\.
We choose to cover the central90%90\\%portion, since it allows us to use IPR for the risk assessment of the algorithm\. Although our primary goal is to estimate performance variation, one underlying motivation for variation estimation is to assess the risk of using a deep RL algorithm, as outlined in[Section1](https://arxiv.org/html/2606.06746#S1)\. To achieve this purpose, the min\-max normalized IPR must cover a sufficiently wide range of performance distributions\. On the other hand, it is undesirable for IPR to rely on extremes, as a single run can drastically distort the resulting quantity\. Such high sensitivity undermines the reliability of the resulting statistics as a measure of spread\. Coverage of central90%90\\%is a result of taking a trade\-off between these two contradictory perspectives\. We refer to this specific variant of min\-max normalized IPR as amin\-max IPR\-9090\. Note that we omit the term “normalized” for conciseness\. As a heuristic, we choose5%5\\%as a reasonable value of min\-max IPR\-9090, which is the maximum min\-max IPR\-9090value in our SL experiments \(see[AppendixE](https://arxiv.org/html/2606.06746#A5)\)\.
Despite its simplicity, min\-max IPR\-9090provides clear insights into the performance variation\. For example, the bar plots in[Figure4](https://arxiv.org/html/2606.06746#S5.F4)show the min–max IPR\-9090of PPO and SAC for each task, from which we can readily see that SAC exhibits lower performance variation than PPO\.
\(a\)PPO IPR\-9090
\(b\)SAC IPR\-9090
Figure 4:Task\-wise min\-max IPR\-9090and median for the PPO and SAC experiments\. In each plot, the bars are sorted by IPR\-9090, and the units are%\\%\. The color of each bar represents non\-overlapping range of percentages with cut off points of2%2\\%,5%5\\%,10%10\\%,25%25\\%, and50%50\\%\. Warmer colors represent the higher variation, and vice versa\. Both PPO and SAC exhibit high performance variation on some tasks\. Moreover, SAC generally shows lower performance variation than PPO over the5959robotic control tasks\.
## 6Reporting Learning Curve Variability Time\-wise vs\. Run\-wise
Visualization of learning curves is often achieved by plotting a representative statistic aggregated across runs at each timestep, with a shaded region indicating the associated measure of variability or uncertainty\. Here, we emphasize that this time\-wise learning curve aggregation is not necessarily suitable for visualizing performance variation\. We then propose a visualization method that takes a run\-wise perspective and demonstrate its suitability for presenting performance variation\.
### Limitations of Time\-wise Approach
A popular way to visualize learning curves is to plot the sample mean aggregated across runs at each timestep, along with the associated standard deviation/error as a shaded region\. Despite its popularity, this time\-wise format of mean±\\pmstandard error/deviation potentially leads to inaccurate conclusions, due to the risk of underreporting the variability by standard error/deviation \([Section4](https://arxiv.org/html/2606.06746#S4)\) and the misrepresentation of actual learning curves\. A prominent case where two of these issues arises is presented in the bottom plots of the first two columns in[Figure5](https://arxiv.org/html/2606.06746#S6.F5)\. Here, the shaded region underestimates the variation, and both the mean curves and the shaded regions represent trends not followed by any individual learning curve, making these plots an inaccurate summary of learning curves as a whole\. Even in a case with a skewed, unimodal performance distribution, the mean±\\pmstandard deviation band leaves many individual learning curves outside the shaded region, illustrating that point\-wise aggregation is not a reliable run\-wise visualization of performance variation\. Due to the tendency of underreporting and the risk of misrepresentation of learning curves, the use of mean±\\pmstandard deviation/error with time\-wise aggregation is not suitable for visualizing performance variation\.
Figure 5:Examples of different visualization methods applied to the PPO’s learning curves for thereacher\-hard,walker\-stand, andpendulum\-swinguptasks\. Each column from left corresponds with the mean±\\pmstandard error, mean±\\pmstandard deviation, functional boxplot, and RPH, respectively\. All learning curves are preprocessed with binning \(details in[AppendixA](https://arxiv.org/html/2606.06746#A1)\)\. Unlike the other three, RPH provides a simple run\-wise perspective, which aligns with the notion of performance variation\.\(a\)PPO Learning Curves
\(b\)SAC Learning Curves
Figure 6:Learning curves for PPO and SAC across all5959continuous control tasks, with variation visualized using RPH\.
### Run\-wise Percentile Highlighting
Inheriting the idea of using percentiles from[Section5](https://arxiv.org/html/2606.06746#S5), the most straightforward method is to highlight the55th,5050th, and9595th percentile learning curves\. We highlight those with high opacity, and colorize the rest with transparency\. The55th and9595th percentile curves are depicted with a line style different from others to emphasize the variation \(we choose a dotted line in this paper\)\. Because we highlight learning curves at a particular percentile, we refer to this visualization protocol as run\-wise percentile highlighting \(RPH\)\.
The main competitor to RPH is the functional boxplot \(FB\), a percentile\-based visualization method considered rigorous\(Sun and Genton,[2011](https://arxiv.org/html/2606.06746#bib.bib63)\)\. FB constructs the error band by ranking the closeness of all the curves from the center, which is robust against the distribution of learning curves\. In fact, the third column of[Figure5](https://arxiv.org/html/2606.06746#S6.F5)shows a wide range of area covered by FB, and less misrepresentation compared to the mean±\\pmstandard error/deviation by highlighting the actual curve that reside in the center of distribution\. Note that the “center” in FB is not necessarily equivalent to the median, but rather derived from their ranking of closeness \(see[AppendixC](https://arxiv.org/html/2606.06746#A3)for the technical details\)\. Although FB seems compelling enough, RPH is advantageous compared to FB for visualization of performance variation\.
The first advantage of RPH over FB is its computational efficiency and simplicity\. Even without costly and complex operations in FB, RPH is more representative of the actual learning curve distribution, as shown in the examples in the right\-most column of[Figure5](https://arxiv.org/html/2606.06746#S6.F5)\. RPH also retains the shape of individual learning curves, effectively avoiding the shortcomings of time\-wise aggregation\. In contrast, FB utilizes time\-wise aggregation to compute a band, which includes the risk of misrepresenting the learning curves\. Finally, RPH only requires the definition of a percentile as prior knowledge to interpret the results, while FB requires knowledge of band depth and other technical details described in[AppendixC](https://arxiv.org/html/2606.06746#A3)\. Overall, RPH aligns with our purpose of capturing performance variation while achieving high interpretability without advanced statistical knowledge\. All curves from the PPO and SAC runs, annotated by RPH, are shown in[Figure6](https://arxiv.org/html/2606.06746#S6.F6)for reference\.
## 7Case Studies
Figure 7:Plotting format for variation change and overhead\. The plot takeslog10\\log\_\{10\}\-scaled values of overhead \(κ\\kappa\) and change in variation \(ρ\\rho\) on itsxxandyyaxis, respectively\. The modification is ideal if most or all points fall within the blue\-shaded region, as this indicates reduced performance variation and higher overall performance\.So far, we have shown that both the min–max IPR\-9090and RPH are relatively suitable tools for analyzing performance variation\. The natural next question is their effectiveness in practice\. To address this, we present three case studies\. First, we examine whether LayerNorm and penultimate layer normalization \(PNorm\) reduce a performance variation in PPO and SAC\(Baet al\.,[2016](https://arxiv.org/html/2606.06746#bib.bib53); Bjorcket al\.,[2022](https://arxiv.org/html/2606.06746#bib.bib3)\)\. Second, we compare the performance distributions of popular continuous control algorithms, including PPO, SAC, TD\-MPC, and TD\-MPC2\. Lastly, we repeat a similar comparative analysis for discrete control algorithms, namely DQN and Rainbow\.
### Case Study 1: Effect of Normalization on Performance Variation
As a first case, we examine the change in performance variation after applying LayerNorm and PNorm to the default PPO/SAC settings described in[Section2](https://arxiv.org/html/2606.06746#S2)\. In this experiment, LayerNorm is applied to pre\-activations of each layer in actor and critic networks\. PNorm appliesL2L^\{2\}normalization to penultimate layer outputs of both networks\. Otherwise, the configurations remain the same\. We apply both methods individually and jointly\. Variants with either modification are referred to as LayerNorm PPO/SAC and PNorm PPO/SAC, and the variant with both modifications as Normalized PPO/SAC\.
\(a\)PPO vs\. LayerNorm PPO
\(b\)PPO vs\. PNorm PPO
\(c\)PPO vs\. Normalized PPO
\(d\)SAC vs\. LayerNorm SAC
\(e\)SAC vs\. PNorm SAC
\(f\)SAC vs\. Normalized SAC
Figure 8:Change of performance variation and median after applying modifications to PPO and SAC\. Zoomed\-in plots compare RPH learning curves for default and modified algorithms onfinger\-spin\. The positive effect of normalization techniques, especially their combination, is evident for PPO\. In contrast, modifications to SAC generally leave performance variation unchanged or increase it\.For each task, we report the ratio of performance variation between the variants of algorithms as a change in performance variation and denote withρ\\rho\. Mathematically,ρ\\rhois defined as
ρ\(𝒰baseline,𝒰modified\):=Min\-max IPR\(𝒰modified,90\)Min\-max IPR\(𝒰baseline,90\)\+ϵ,\\displaystyle\\rho\(\\mathcal\{U\}\_\{\\textnormal\{baseline\}\},\\mathcal\{U\}\_\{\\textnormal\{modified\}\}\):=\\frac\{\\textnormal\{Min\-max IPR\}\(\\mathcal\{U\}\_\{\\textnormal\{modified\}\},90\)\}\{\\textnormal\{Min\-max IPR\}\(\\mathcal\{U\}\_\{\\textnormal\{baseline\}\},90\)\+\\epsilon\},\(1\)where𝒰baseline\\mathcal\{U\}\_\{\\textnormal\{baseline\}\}is a set of performance for running algorithm with default configuration,𝒰modified\\mathcal\{U\}\_\{\\textnormal\{modified\}\}is a set of performance for running the same algorithm with either of normalization techniques, andϵ\>0\\epsilon\>0is a small constant to avoid zero division\. While[Equation1](https://arxiv.org/html/2606.06746#S7.E1)captures the change in performance variation, it cannot identify the cause of the change\. For instance,[Equation1](https://arxiv.org/html/2606.06746#S7.E1)cannot identify whether improvement is sourced by collapse of all learning curves to the minimum episodic returns or improvement of lower performing learning curves\. To disambiguate the cause of change in performance variation, we also obtain the overhead, the ratio between the median performances\. The overhead, denoted byκ\\kappa, is formulated as
κ\(𝒰^baseline,𝒰^modified\):=𝒰^baseline\(50\)𝒰^modified\(50\)\+ϵ\.\\displaystyle\\kappa\(\\mathcal\{\\widehat\{U\}\}\_\{\\textnormal\{baseline\}\},\\mathcal\{\\widehat\{U\}\}\_\{\\textnormal\{modified\}\}\):=\\frac\{\\mathcal\{\\widehat\{U\}\}\_\{\\textnormal\{baseline\}\}^\{\(50\)\}\}\{\\mathcal\{\\widehat\{U\}\}\_\{\\textnormal\{modified\}\}^\{\(50\)\}\+\\epsilon\}\.\(2\)Note that we use𝒰^baseline\\mathcal\{\\widehat\{U\}\}\_\{\\textnormal\{baseline\}\}and𝒰^modified\\mathcal\{\\widehat\{U\}\}\_\{\\textnormal\{modified\}\}, which are𝒰baseline\\mathcal\{U\}\_\{\\textnormal\{baseline\}\}and𝒰modified\\mathcal\{U\}\_\{\\textnormal\{modified\}\}shifted by−min\(\{min\(𝒰baseline\),min\(𝒰modified\),0\}\)\-\\min\(\\\{\\min\(\\mathcal\{U\}\_\{\\textnormal\{baseline\}\}\),\\min\(\\mathcal\{U\}\_\{\\textnormal\{modified\}\}\),0\\\}\)\. This is done to satisfy a condition ofκ≥0\\kappa\\geq 0\. For a compact visualization, we report both values ofρ\\rhoandκ\\kappaby plotting them as a scatter plot inlog10\\log\_\{10\}scale, and treat points in the third quadrant as an instance where modifications improve the performance without much overhead \(see[Figure7](https://arxiv.org/html/2606.06746#S7.F7)\)\.
[Figure8](https://arxiv.org/html/2606.06746#S7.F8)depicts a comparison plot between the default and normalized variants of PPO and SAC\. Graphically, normalization techniques either reduce performance variation with less overhead or increase the overall performance for PPO\. On the other hand, performance variation generally stays the same or increases after applying normalization techniques to SAC\. These results imply that, at least when one inherits hyperparameter from the default configuration, normalization techniques benefit PPO by a considerable amount, while they do not for SAC\. Notice that these results are easily captured by application of min\-max IPR\-9090\.
Improvements in performance variation is also presented by overlaying the RPH curves\. Each subfigure in[Figure8](https://arxiv.org/html/2606.06746#S7.F8)includes a zoomed\-in view of the RPH\-visualized learning curves forfinger\-spin\. Blue curves represent the learning curves from default runs, and alternative colors represent those from modified runs\. We readily observe how PNorm, especially when combined with LayerNorm, narrows the gap between PPO learning curves\. Also, the gap between learning curves remains unchanged in the case of SAC\. The learning curve comparisons for all algorithm\-task pairs are shown in[AppendixH](https://arxiv.org/html/2606.06746#A8), and bar plots for performance variation and median performance are in[AppendixG](https://arxiv.org/html/2606.06746#A7)\.
### Case Study 2: Cross\-algorithm Comparative Study
Another use case of our proposed methods is on a comparative study of different algorithms\. Here, we consider two deep RL algorithms, TD\-MPC and TD\-MPC2, in addition to PPO and SAC\. TD\-MPC is a major model\-based deep RL algorithm for continuous control, which combines a short\-term reward estimate from a latent dynamics model with a long\-term return estimate from a value function\. TD\-MPC2 is an upgraded version of TD\-MPC that offers greater scalability and improved robustness across tasks\. In this paper, we adopt the original implementation and run100100independent runs with hyperparameters given byHansenet al\.\([2022](https://arxiv.org/html/2606.06746#bib.bib94)\)andHansenet al\.\([2024](https://arxiv.org/html/2606.06746#bib.bib97)\)\(also listed in[Tables8](https://arxiv.org/html/2606.06746#A4.T8)and[9](https://arxiv.org/html/2606.06746#A4.T9)\)\. Performance variation and median performance of TD\-MPC/TD\-MPC2 are given in[Figure17](https://arxiv.org/html/2606.06746#A7.F17)\. Now, we consider the following question: among PPO, SAC, TD\-MPC, and TD\-MPC2, which algorithm is superior in average sample mean, sample standard deviation, IPR\-9090,55th percentile, median, and9595th percentile performance? We answer this question by comparing the histogram of performance statistics in the4848DMC tasks\.
\(a\)Full Timestep
\(b\)Last100100k Steps
Figure 9:Distribution of task\-level performance statistics computed from the corresponding algorithm’s performance across multiple DMC tasks\. Performance in\(a\)considers episodic returns over entire training steps, whereas\(b\)only considers episodic returns over the last100100k training steps\. Row and column in each subfigure correspond to the statistic and the algorithm, respectively\. Each subplot shows a histogram with KDE, where the x\-axis represents probability and the y\-axis shows min\-max normalized statistics\. Histograms for Min\-max IPR\-9090\(labeled as M\-m IPR\-9090\) uses performances from2121DMC tasks where none of algorithms’9595th percentile performance are under200200\. As a variant, histograms labeled as Per\-algo M\-m IPR\-9090uses performances from tasks where the corresponding algorithm’s9595th percentile performance is above200200\. All other histograms are produced from the performances on4848tasks\. TD\-MPC tends to exhibit less performance variation while marking relatively higher performance\.[Figure9](https://arxiv.org/html/2606.06746#S7.F9)shows the histogram of the performance statistics for each task\. Each row and column correspond to a statistic and an algorithm, respectively, and all quantities are min\-max normalized\. Note that in computing the average min–max IPR\-9090\(referred to as M\-m IPR\-9090in[Figure9](https://arxiv.org/html/2606.06746#S7.F9)\), we exclude2727tasks where at least one algorithm marks the performance below200200in its9595th percentile run\. This is because failed runs often have low performance spread, which can obscure comparisons of performance variation among algorithms\. Of the2727excluded tasks, the9595th percentile performance of PPO, SAC, TD\-MPC, and TD\-MPC2 is under200200in2626,2424,1717, and1919tasks, respectively\. For completeness, we also report the distribution of min\-max IPR\-9090for each algorithm without the tasks they failed to surpass the performance of200200\(referred to as Per\-algo M\-m IPR\-9090in[Figure9](https://arxiv.org/html/2606.06746#S7.F9)\)\. Additionally,[Table1](https://arxiv.org/html/2606.06746#S7.T1)summarize the sample mean of histograms in[Figure9](https://arxiv.org/html/2606.06746#S7.F9)\. Both[Figure9](https://arxiv.org/html/2606.06746#S7.F9)and[Table1](https://arxiv.org/html/2606.06746#S7.T1)indicates the dominance of TD\-MPC and TD\-MPC2 against the other two in all aspects\. The average median performance of the TD\-MPC algorithms is almost1\.51\.5times that of the PPO/SAC\. Similar patterns also hold for mean,55th percentile, and9595th percentile performance\. Furthermore, min\-max IPR\-9090histogram for TD\-MPC concentrates more around the lower values\. In fact, the average min\-max IPR\-9090for TD\-MPC is around10%10\\%, which is surprisingly low\. Thus, when all four algorithms are trained on4848DMC tasks, TD\-MPC outperforms the others in a data efficient manner, while exhibiting a lower or at least equivalent variation\.
Table 1:Average statistics of performance of continuous control algorithms over multiple DMC tasks\. All the statistics are min\-max normalized, and hence the units of value are percentages\. Mean, standard deviation, median, and55/9595th percentiles are calculated with all4848tasks\. Min\-max IPR\-9090\(labeled as M\-m IPR\-9090\) is calculated with2121tasks where none of algorithms’9595th percentile performance are under200200\. As a variant of M\-m IPR\-9090, we also report Min\-max IPR\-9090with tasks where each algorithm’s9595th percentile performance is above200200, which is labeled as Per\-algo M\-m IPR\-9090\. Parentheses in each column name indicate the range of final timesteps used to compute each statistic, separated by slash\. Each entry in table is separated by slash, where each value indicates statistic computed with corresponding number of final timesteps\. TD\-MPC achieves the lowest average performance variation while being more data efficient than PPO/SAC\.Figure 10:Comparison of learning curves of four continuous control algorithms on two selected DMC tasks in different plotting styles\. For visual clarity, RPH curves are plotted without non\-highlighted curves\. The x\-axis of each subfigure is log\-scaled\. RPH most accurately represents the fact that TD\-MPC/TD\-MPC2 learns more rapidly than PPO/SAC, while exhibiting less performance variation\.Although TD\-MPC achieves a relatively low min\-max IPR\-9090of around10%10\\%, it is still higher than the reasonable value of5%5\\%\(see[Section5](https://arxiv.org/html/2606.06746#S5)\)\. This implies that, even with modern advanced methods, deep RL algorithms have yet to achieve sufficiently low performance variation for practical use\. Also, recall that TD\-MPC does not achieve the performance above200200in1717out of4848tasks with its9595th percentile performance\. This is around35%35\\%of the failure rate\. In contrast, none of the performance of our SL experiments collapses, while achieving low variation \(see[AppendixE](https://arxiv.org/html/2606.06746#A5)\)\. These insights pose another important challenge for modern advanced deep RL methods: consistently learning successfully with low performance variation\.
Note that similar insights gained from[Figure9](https://arxiv.org/html/2606.06746#S7.F9)and[Table1](https://arxiv.org/html/2606.06746#S7.T1)can also be drawn by observing[Figure10](https://arxiv.org/html/2606.06746#S7.F10), thanks to RPH\. For instance, on thewalker\-standtask,[Figure10](https://arxiv.org/html/2606.06746#S7.F10)clearly shows a wide range of performance variation in the PPO algorithm, from an episodic return of200200to900900\. It also captures the remarkably small performance variation of the TD\-MPC and TD\-MPC2 algorithms on the same task\. Furthermore, the information that is not fully described in[Figure9](https://arxiv.org/html/2606.06746#S7.F9)and[Table1](https://arxiv.org/html/2606.06746#S7.T1)is perceivable from the RPH curves in[Figure10](https://arxiv.org/html/2606.06746#S7.F10)\. For example, some failure modes of TD\-MPC2 can be observed from the55th percentile curve ofcartpole\-balance\_sparsetask\. These insights cannot even be drawn from the standard error/deviation plots in[Figure10](https://arxiv.org/html/2606.06746#S7.F10), where the shaded regions representing the standard error are narrow or barely visible\. For a full comparison between different plotting styles, see[Figures24](https://arxiv.org/html/2606.06746#A10.F24),[26](https://arxiv.org/html/2606.06746#A10.F26)and[25](https://arxiv.org/html/2606.06746#A10.F25)\. Nevertheless, RPH is advantageous and practical for visualizing a performance variation\.
### Case Study 3: Cross\-algorithm Comparative Study with Discrete Control Algorithms
The empirical results in previous sections focus solely on continuous control settings\. For completeness, we repeat the cross\-algorithm comparative study from the previous section for discrete control deep RL algorithms, particularly for Deep Q\-Network \(DQN\) and Rainbow\(Mnihet al\.,[2015](https://arxiv.org/html/2606.06746#bib.bib98); Hesselet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib99)\)\. We compare the performance statistics of these two algorithms by running them for5050million steps on five Arcade Learning Environments \(ALE\)\(Bellemareet al\.,[2013](https://arxiv.org/html/2606.06746#bib.bib100)\)\. We adopted the CleanRL implementation, and ran100100independent runs per environment\(Huanget al\.,[2022](https://arxiv.org/html/2606.06746#bib.bib23)\)\. The five environments have been claimed to be essential among all ALEs byAitchisonet al\.\([2023](https://arxiv.org/html/2606.06746#bib.bib101)\)\. To ensure stochasticity in environments, we reset all environments with a random no\-op period and enable sticky actions\(Machadoet al\.,[2018](https://arxiv.org/html/2606.06746#bib.bib102)\)\. Hyperparameter details are provided in[Tables10](https://arxiv.org/html/2606.06746#A4.T10)and[11](https://arxiv.org/html/2606.06746#A4.T11)\.
Figure 11:Comparison of learning curves of discrete control algorithms on five Atari tasks\. Each subfigure shows the RPH Human Normalized Score curves for DQN and Rainbow on corresponding task\. Rainbow generally outperforms DQN, and both algorithms exhibit significant performance variation in some tasks\.Table 2:Average statistics of performance of discrete control algorithms over55Atari tasks\. All the statistics are min\-max normalized, and hence the units of value are percentages\. Parentheses in each column name indicate the range of final timesteps used to compute each value, separated by slash\. Each entry in table is separated by slash, where each value indicates statistic computed with corresponding number of final timesteps\. In all aspects, Rainbow outperforms DQN\. Also, both algorithms exhibit similar levels of performance variation\.Statistics in[Table2](https://arxiv.org/html/2606.06746#S7.T2)indicate the dominance of Rainbow over DQN\. Thanks to the learning curves with RPH in[Figure11](https://arxiv.org/html/2606.06746#S7.F11), we readily observe that this dominance arises because the near\-worst runs of Rainbow outperform the near\-best runs of DQN\. Additionally, min\-max IPR\-9090in[Table2](https://arxiv.org/html/2606.06746#S7.T2)shows that both algorithms experience a similar level of high performance variation\. Together, Rainbow indeed improves overall performance in comparison to DQN, but does not solve the problem of high performance variation\.
## 8Limitations
One of the prominent limitations of IPR\-9090is the choice of90%90\\%as the range of variation\. Although we discuss this number strikes a reasonable tradeoff between the range it covers and the sensitivity against extrema, it still lacks theoretical motivation\. A potentially superior approach is to report barplot for various IPR ranges, but this can decrease interpretability by increasing the amount of information \(but see[AppendixI](https://arxiv.org/html/2606.06746#A9)\)\. The pursuit of a concise multi\-level IPR is certainly an exciting direction\.
Another limitation is the number of independent runs for which our proposed method functions well\. Since both min\-max IPR\-9090and RPH naively use the55\-th and9595\-th percentile data, the number of independent runs is a deciding factor of accuracy\. This property restricts the range of applicable settings\. Especially, it is less applicable when a single run leverages substantial computational resources \(e\.g\., large language models\)\. Although, even with fewer runs, our methods are beneficial compared to conventional methods\. Regardless of the smaller number of samples \(e\.g\.,55runs\), IPR is more faithful to the observed spread compared to parametric statistics, such as standard deviation\. RPH provides a clear and accurate depiction of all the learning curves, which offers richer information of how the learning curves behave than in plots with shaded region\. Although plotting all the learning curves may raise concerns about a drop in visual clarity, the visualizations remain sufficiently clear\. In fact, while[Figure11](https://arxiv.org/html/2606.06746#S7.F11)depicts all100100independent runs in RPH format, the visual clarity is retained\.
## 9Conclusion
In this paper, we proposed min\-max IPR\-9090and RPH as intuitive, yet robust methods to capture performance variation in a single task, compared to previous approaches such as standard error/deviation\. These characteristics enable the broad applicability of our proposed methods to deep RL research\. In fact, using our proposed methods, we have shown that normalizations reduce performance variation in PPO, TD\-MPC exhibits surprisingly low performance variation, and DQN and Rainbow experience a similar level of performance variation\. Although effective, because both proposed methods are based on specific statistics, relying on them alone can mislead insights\. To avoid such a situation and provide complementary insights, our methods should be used alongside other statistical tools\. We hope that our methods will aid future deep RL research by providing additional insights into empirical results\.
### Acknowledgments
The authors sincerely thank John D\. Martin and Levi H\. S\. Lelis for their thorough feedback and comments on this work\. The authors also thank the University of Alberta, Amii, the Natural Sciences and Engineering Research Council of Canada \(NSERC\), the Canada CIFAR AI Chairs program, and the Digital Research Alliance of Canada for the funding and computational resources\.
## References
- Deep reinforcement learning at the edge of the statistical precipice\.InNeural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p3.1),[§3](https://arxiv.org/html/2606.06746#S3.p1.1),[§3](https://arxiv.org/html/2606.06746#S3.p2.1),[§5](https://arxiv.org/html/2606.06746#S5.SS0.SSSx2.p2.2)\.
- M\. Aitchison, P\. Sweetser, and M\. Hutter \(2023\)Atari\-5: Distilling the arcade learning environment down to five games\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p4.8),[§7](https://arxiv.org/html/2606.06746#S7.SS0.SSSx3.p1.2)\.
- D\. G\. Altman and J\. M\. Bland \(2005\)Standard deviations and standard errors\.BMJ: British Medical Journal331\.Cited by:[§4](https://arxiv.org/html/2606.06746#S4.SS0.SSSx1.p1.1)\.
- B\. C\. Arnold, N\. Balakrishnan, and H\. N\. Nagaraja \(2008\)A first course in order statistics\.Society for Industrial and Applied Mathematics\.Cited by:[Appendix B](https://arxiv.org/html/2606.06746#A2.p1.14)\.
- J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.CoRRabs/1607\.06450\.Cited by:[§7](https://arxiv.org/html/2606.06746#S7.p1.1)\.
- M\. Bellemare, Y\. Naddaf, J\. Veness, and M\. Bowling \(2013\)The arcade learning environment: An evaluation platform for general agents\.Journal of Artificial Intelligence Research47\.Cited by:[§7](https://arxiv.org/html/2606.06746#S7.SS0.SSSx3.p1.2)\.
- M\. G\. Bellemare, S\. Candido, P\. S\. Castro, J\. Gong, M\. C\. Machado, S\. Moitra, S\. S\. Ponda, and Z\. Wang \(2020\)Autonomous navigation of stratospheric balloons using reinforcement learning\.Nature588\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p1.1)\.
- J\. Bjorck, C\. P\. Gomes, and K\. Q\. Weinberger \(2022\)Is high variance unavoidable in RL? A case study in continuous control\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p3.1),[§1](https://arxiv.org/html/2606.06746#S1.p4.8),[§4](https://arxiv.org/html/2606.06746#S4.p1.1),[§7](https://arxiv.org/html/2606.06746#S7.p1.1)\.
- M\. J\. Campbell \(2021\)Statistics at square one\.John Wiley & Sons, Ltd\.Cited by:[§4](https://arxiv.org/html/2606.06746#S4.SS0.SSSx1.p1.1)\.
- K\. Clary, E\. Tosch, J\. Foley, and D\. Jensen \(2019\)Let’s play again: Variability of deep reinforcement learning agents in atari environments\.CoRRabs/1904\.06312\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p2.1)\.
- C\. Colas, O\. Sigaud, and P\. Oudeyer \(2018\)How many random seeds? Statistical power analysis in deep reinforcement learning experiments\.CoRRabs/1806\.08295\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p2.1)\.
- K\. Eggensperger, M\. Lindauer, and F\. Hutter \(2019\)Pitfalls and best practices in algorithm configuration\.Journal of Artificial Intelligence Research64\(1\)\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p2.1)\.
- T\. Eimer, M\. Lindauer, and R\. Raileanu \(2023\)Hyperparameters in reinforcement learning and how to tune them\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p2.1)\.
- D\. J\. Green, M\. J\. Campbell, and E\. Koutoumanou \(2026\)When means and standard deviations are an incomplete summary of a continuous variable: problems, solutions, and utilising the reference ranges to check normality\.BMJ Medicine5\(1\)\.Cited by:[§4](https://arxiv.org/html/2606.06746#S4.SS0.SSSx1.p1.1)\.
- T\. Haarnoja, B\. Moran, G\. Lever, S\. H\. Huang, D\. Tirumala, J\. Humplik, M\. Wulfmeier, S\. Tunyasuvunakool, N\. Y\. Siegel, R\. Hafner, M\. Bloesch, K\. Hartikainen, A\. Byravan, L\. Hasenclever, Y\. Tassa, F\. Sadeghi, N\. Batchelor, F\. Casarini, S\. Saliceti, C\. Game, N\. Sreendra, K\. Patel, M\. Gwira, A\. Huber, N\. Hurley, F\. Nori, R\. Hadsell, and N\. Heess \(2024\)Learning agile soccer skills for a bipedal robot with deep reinforcement learning\.Science Robotics9\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p1.1)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018\)Soft actor\-critic: Off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p4.8),[§2](https://arxiv.org/html/2606.06746#S2.p1.7)\.
- N\. A\. Hansen, H\. Su, and X\. Wang \(2022\)Temporal difference learning for model predictive control\.InInternational Conference on Machine Learning,Cited by:[Table 8](https://arxiv.org/html/2606.06746#A4.T8),[Table 8](https://arxiv.org/html/2606.06746#A4.T8.53.2),[§1](https://arxiv.org/html/2606.06746#S1.p4.8),[§7](https://arxiv.org/html/2606.06746#S7.SS0.SSSx2.p1.5)\.
- N\. Hansen, H\. Su, and X\. Wang \(2024\)TD\-MPC2: Scalable, robust world models for continuous control\.InInternational Conference on Learning Representations,Cited by:[Table 9](https://arxiv.org/html/2606.06746#A4.T9),[Table 9](https://arxiv.org/html/2606.06746#A4.T9.20.2),[§1](https://arxiv.org/html/2606.06746#S1.p4.8),[§7](https://arxiv.org/html/2606.06746#S7.SS0.SSSx2.p1.5)\.
- M\. J\. Hausknecht and P\. Stone \(2015\)Deep recurrent q\-learning for partially observable mdps\.InAAAI Fall Symposia,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p1.1)\.
- P\. Henderson, R\. Islam, P\. Bachman, J\. Pineau, D\. Precup, and D\. Meger \(2018\)Deep reinforcement learning that matters\.InAAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p1.1),[§3](https://arxiv.org/html/2606.06746#S3.p1.1)\.
- L\. Hertel, P\. Baldi, and D\. L\. Gillen \(2020\)Quantity vs\. quality: on hyperparameter optimization for deep reinforcement learning\.CoRRabs/2007\.14604\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p2.1)\.
- M\. Hessel, J\. Modayil, H\. van Hasselt, T\. Schaul, G\. Ostrovski, W\. Dabney, D\. Horgan, B\. Piot, M\. Azar, and D\. Silver \(2018\)Rainbow: Combining improvements in deep reinforcement learning\.InAAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p4.8),[§7](https://arxiv.org/html/2606.06746#S7.SS0.SSSx3.p1.2)\.
- S\. Huang, R\. F\. J\. Dossa, C\. Ye, J\. Braga, D\. Chakraborty, K\. Mehta, and J\. G\.M\. Araújo \(2022\)CleanRL: high\-quality single\-file implementations of deep reinforcement learning algorithms\.Journal of Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2606.06746#S2.p1.7),[§7](https://arxiv.org/html/2606.06746#S7.SS0.SSSx3.p1.2)\.
- R\. Islam, P\. Henderson, M\. Gomrokchi, and D\. Precup \(2017\)Reproducibility of benchmarked deep reinforcement learning tasks for continuous control\.InReproducibility in Machine Learning Workshop \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p2.1)\.
- G\. Jocher and J\. Qiu \(2024\)Ultralytics YOLO11\.Ultralytics\.Note:\[Computer software\]External Links:[Link](https://github.com/ultralytics/ultralytics)Cited by:[Table 5](https://arxiv.org/html/2606.06746#A4.T5.3.3.1),[Appendix E](https://arxiv.org/html/2606.06746#A5.p1.9)\.
- H\. Lee, D\. Hwang, D\. Kim, H\. Kim, J\. J\. Tai, K\. Subramanian, P\. R\. Wurman, J\. Choo, P\. Stone, and T\. Seno \(2025\)SimBa: Simplicity bias for scaling up parameters in deep reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.06746#S2.p1.7)\.
- Z\. Li, T\. Chen, Z\. Hong, A\. Ajay, and P\. Agrawal \(2023\)ParallelQQ\-learning: scaling off\-policy reinforcement learning under massively parallel simulation\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.06746#S2.p1.7)\.
- Y\. Liang, M\. C\. Machado, E\. Talvitie, and M\. Bowling \(2016\)State of the art control of atari games using shallow reinforcement learning\.InInternational Conference on Autonomous Agents & Multiagent Systems,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p3.1),[§4](https://arxiv.org/html/2606.06746#S4.p1.1)\.
- S\. López\-Pintado and J\. Romo \(2009\)On the concept of depth for functional data\.Journal of the American Statistical Association104\.Cited by:[Appendix C](https://arxiv.org/html/2606.06746#A3.p3.4)\.
- M\. C\. Machado, M\. G\. Bellemare, E\. Talvitie, J\. Veness, M\. Hausknecht, and M\. Bowling \(2018\)Revisiting the Arcade Learning Environment: Evaluation protocols and open problems for general agents\.Journal of Artificial Intelligence Research61\.Cited by:[§7](https://arxiv.org/html/2606.06746#S7.SS0.SSSx3.p1.2)\.
- A\. R\. Mahmood, D\. Korenkevych, B\. J\. Komer, and J\. Bergstra \(2018\)Setting up a reinforcement learning task with a real\-world robot\.InIEEE/RSJ International Conference on Intelligent Robots and Systems,Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p1.1)\.
- T\. Mathieu, M\. M\. Centa, R\. D\. Vecchia, H\. Kohler, A\. Shilova, O\. Maillard, and P\. Preux \(2024\)AdaStop: adaptive statistical testing for sound comparisons of deep rl agents\.Transactions on Machine Learning Research\.Cited by:[§4](https://arxiv.org/html/2606.06746#S4.SS0.SSSx1.p2.5)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski, S\. Petersen, C\. Beattie, A\. Sadik, I\. Antonoglou, H\. King, D\. Kumaran, D\. Wierstra, S\. Legg, and D\. Hassabis \(2015\)Human\-level control through deep reinforcement learning\.Nature518\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p1.1),[§1](https://arxiv.org/html/2606.06746#S1.p4.8),[§7](https://arxiv.org/html/2606.06746#S7.SS0.SSSx3.p1.2)\.
- A\. Patterson, S\. Neumann, M\. White, and A\. White \(2024\)Empirical design in reinforcement learning\.Journal of Machine Learning Research25\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p2.1),[§1](https://arxiv.org/html/2606.06746#S1.p3.1),[§5](https://arxiv.org/html/2606.06746#S5.SS0.SSSx2.p1.7)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.CoRRabs/1707\.06347\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p4.8),[§2](https://arxiv.org/html/2606.06746#S2.p1.7)\.
- D\. W\. Scott \(1992\)Multivariate density estimation: Theory, practice, and visualization\.John Wiley & Sons\.Cited by:[Figure 3](https://arxiv.org/html/2606.06746#S4.F3),[Figure 3](https://arxiv.org/html/2606.06746#S4.F3.4.2)\.
- S\. S\. Shapiro and M\. B\. Wilk \(1965\)An analysis of variance test for normality \(complete samples\)\.Biometrika52\.Cited by:[§4](https://arxiv.org/html/2606.06746#S4.SS0.SSSx1.p2.5)\.
- A\. Stooke and P\. Abbeel \(2019\)Accelerated methods for deep reinforcement learning\.CoRRabs/1803\.02811\.Cited by:[§2](https://arxiv.org/html/2606.06746#S2.p1.7)\.
- Y\. Sun and M\. G\. Genton \(2011\)Functional boxplots\.Journal of Computational and Graphical Statistics20\(2\)\.Cited by:[Appendix C](https://arxiv.org/html/2606.06746#A3.p1.2),[Appendix C](https://arxiv.org/html/2606.06746#A3.p3.20),[§6](https://arxiv.org/html/2606.06746#S6.SS0.SSSx2.p2.1)\.
- H\. Tang and G\. Berseth \(2024\)Improving deep reinforcement learning by reducing the chain effect of value and policy churn\.InNeural Information Processing Systems,Cited by:[§3](https://arxiv.org/html/2606.06746#S3.p1.1)\.
- Y\. Tassa, Y\. Doron, A\. Muldal, T\. Erez, Y\. Li, D\. de Las Casas, D\. Budden, A\. Abdolmaleki, J\. Merel, A\. Lefrancq, T\. Lillicrap, and M\. Riedmiller \(2018\)DeepMind control suite\.CoRRabs/1801\.00690\.Cited by:[§2](https://arxiv.org/html/2606.06746#S2.p1.7)\.
- E\. Todorov, T\. Erez, and Y\. Tassa \(2012\)MuJoCo: A physics engine for model\-based control\.InIEEE/RSJ International Conference on Intelligent Robots and Systems,Cited by:[§2](https://arxiv.org/html/2606.06746#S2.p1.7)\.
- M\. Towers, A\. Kwiatkowski, J\. U\. Balis, G\. D\. Cola, T\. Deleu, M\. Goulão, K\. Andreas, M\. Krimmel, A\. KG, R\. D\. L\. Perez\-Vicente, J\. K\. Terry, A\. Pierré, S\. V\. Schulhoff, J\. J\. Tai, H\. Tan, and O\. G\. Younis \(2025\)Gymnasium: A standard interface for reinforcement learning environments\.InNeural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§2](https://arxiv.org/html/2606.06746#S2.p1.7)\.
- A\. M\. Walker \(1968\)A note on the asymptotic distribution of sample quantiles\.Journal of the Royal Statistical Society: Series B \(Methodological\)30\.Cited by:[Appendix B](https://arxiv.org/html/2606.06746#A2.p1.14)\.
- P\. R\. Wurman, S\. Barrett, K\. Kawamoto, J\. MacGlashan, K\. Subramanian, T\. J\. Walsh, R\. Capobianco, A\. Devlic, F\. Eckert, F\. Fuchs, L\. Gilpin, P\. Khandelwal, V\. Kompella, H\. Lin, P\. MacAlpine, D\. Oller, T\. Seno, C\. Sherstan, M\. D\. Thomure, H\. Aghabozorgi, L\. Barrett, R\. Douglas, D\. Whitehead, P\. Dürr, P\. Stone, M\. Spranger, and H\. Kitano \(2022\)Outracing champion gran turismo drivers with deep reinforcement learning\.Nature602\.Cited by:[§1](https://arxiv.org/html/2606.06746#S1.p1.1)\.
## Appendix ABinning Operation
\(a\)Pre\-binned
\(b\)Binned
Figure 12:Exemplar binning process for synthetic data\. Datapoint in different time\-series data can be collected at different timesteps, as shown in\(a\)\. By aggregating the data on a bin basis \(where each bin is represented as a shaded/unshaded region in\(a\)\), one can compare two time\-series data more easily as shown in\(b\)\.By the nature of learning through interaction, a lot of data/statistics collected from RL experiments have a chronological structure\. Because such time\-series data are challenging to handle, we often employ multiple preprocessing procedures to simplify them\. Here, we cover one instance of such methods that are employed throughout this paper, thebinningoperation\.
The primary motivation for binning is to provide a common basis for the distinct time\-series data\. As a premise, a pair of scalar time\-series data is not guaranteed to align in time\. For example, consider two independent runs of the RL agent in an episodic environment with an explicit termination condition other than the time limit\. Depending on the termination condition, the timesteps at which the episodic return is received may differ across runs\. Such misalignment in time hinders time\-wise comparisons between pairs of time series, underscoring the necessity of a common basis for comparison\. Binning provides this basis by dividing the time space into non\-overlapping intervals \(bins\) and aggregating each time series data on a per\-bin basis\. Formally, let us define a scalar time series data collected over discrete timesteps11throughNNas𝒬=\{qt∈ℝ\|t∈ℐ\}\\mathcal\{Q\}=\\\{q\_\{t\}\\in\\mathbb\{R\}\|t\\in\\mathcal\{I\}\\\}, whereℐ⊆\[N\]\\mathcal\{I\}\\subseteq\[N\]\. Note that here we denote the set of positive integers\{1,…,N\}\\\{1,\\dots,N\\\}for some positive integerNNas\[N\]\[N\]\. Then, the binned statistics of𝒬\\mathcal\{Q\}is given as𝒬¯=\{f\(\{qi\}i∈ℐ∩\{\(b−1\)⋅C\+1,…,b⋅C\}\)\|qi∈𝒬,b∈\[B\]\}\\bar\{\\mathcal\{Q\}\}=\\left\\\{f\\left\(\\\{q\_\{i\}\\\}\_\{i\\in\\mathcal\{I\}\\cap\\\{\(b\-1\)\\cdot C\+1,\\dots,b\\cdot C\\\}\}\\right\)\|q\_\{i\}\\in\\mathcal\{Q\},b\\in\[B\]\\right\\\}\. Here,BBdenotes the number of bins,C=⌈NB⌉C=\\left\\lceil\\frac\{N\}\{B\}\\right\\rceilis the size of each bin, andf:2𝒬→ℝf:2^\{\\mathcal\{Q\}\}\\rightarrow\\mathbb\{R\}is the aggregation function\. The default value ofBBis100100and the default for the functionffis a sample mean, if not explicitly specified\.
[Figure12](https://arxiv.org/html/2606.06746#A1.F12)shows the benefit of binning via a synthetic example\. Suppose we collect two series of scalar data over the timesteps11through100100\. Assume we collect each series of data𝒬1\\mathcal\{Q\}\_\{1\}and𝒬2\\mathcal\{Q\}\_\{2\}at timestepsℐ1\\mathcal\{I\}\_\{1\}andℐ2\\mathcal\{I\}\_\{2\}, respectively\. For example, let𝒬1:=\{0\.5t\|t∈ℐ1\}\\mathcal\{Q\}\_\{1\}:=\\\{0\.5t\|t\\in\\mathcal\{I\}\_\{1\}\\\}, whereℐ1:=\{20x\+5y\|x∈\{0,…,4\},y∈\[3\]\}\\mathcal\{I\}\_\{1\}:=\\\{20x\+5y\|x\\in\\\{0,\\dots,4\\\},y\\in\[3\]\\\}\. Similarly, let𝒬2:=\{0\.25t\|t∈ℐ2\}\\mathcal\{Q\}\_\{2\}:=\\\{0\.25t\|t\\in\\mathcal\{I\}\_\{2\}\\\}, whereℐ2:=\{20x\+5y\+2\|x∈\{0,…,4\},y∈\[3\]\}\\mathcal\{I\}\_\{2\}:=\\\{20x\+5y\+2\|x\\in\\\{0,\\dots,4\\\},y\\in\[3\]\\\}\.[Figure12\(a\)](https://arxiv.org/html/2606.06746#A1.F12.sf1)depicts both𝒬1\\mathcal\{Q\}\_\{1\}and𝒬2\\mathcal\{Q\}\_\{2\}\. Since both timesteps of when the datapoints are collected do not align, comparing a single datapoint from𝒬1\\mathcal\{Q\}\_\{1\}to another in𝒬2\\mathcal\{Q\}\_\{2\}\(and vice versa\) is non\-trivial\. Now, consider dividing the whole timeline from11to100100into55non\-overlapping bins, as presented by the alternating shaded regions in[Figure12\(a\)](https://arxiv.org/html/2606.06746#A1.F12.sf1)\. By taking an average within each bin, we get𝒬¯1:=\{10b−5\|b∈\[5\]\}\}\\bar\{\\mathcal\{Q\}\}\_\{1\}:=\\left\\\{10b\-5\|b\\in\[5\]\\\}\\right\\\}and𝒬¯2:=\{5b−2\|b∈\[5\]\}\\bar\{\\mathcal\{Q\}\}\_\{2\}:=\\left\\\{5b\-2\|b\\in\[5\]\\right\\\}\. As it is observable from[Figure12\(b\)](https://arxiv.org/html/2606.06746#A1.F12.sf2)and the range of values the variablebbtakes in both𝒬¯1\\bar\{\\mathcal\{Q\}\}\_\{1\}and𝒬¯2\\bar\{\\mathcal\{Q\}\}\_\{2\}, binning process provides a common basis to compare different time\-series data in side\-by\-side manner\.
## Appendix BOrder Statistics is a Consistent Estimator
In the main text, we mentioned that the sample percentile is a consistent estimator\. Here, we provide further mathematical details on how the sample percentile \(or, more broadly, order statistics\) are consistent with their population values\. LetX1,X2,…,XnX\_\{1\},X\_\{2\},\\dots,X\_\{n\}be a sequence ofnnrandom variables, and its ascendedly sorted counterpart asX\(1\)≤X\(2\)≤⋯≤X\(n\)X\_\{\(1\)\}\\leq X\_\{\(2\)\}\\leq\\dots\\leq X\_\{\(n\)\}\. Also, letq:=p100q:=\\frac\{p\}\{100\}for the convenience\. Then, the samplepp\-th percentile isX\(\[nq\]\)X\_\{\\left\(\\left\[nq\\right\]\\right\)\}, where\[⋅\]\[\\cdot\]denotes the nearest integer of a given value\. When a target distribution is absolutely continuous, the asymptotic distribution of the sample order statisticX\(\[nq\]\)X\_\{\\left\(\\left\[nq\\right\]\\right\)\}at quantileqqfollows the Gaussian distribution:
X\(\[nq\]\)∼𝒩\(ξq,q\(1−q\)nf\(ξq\)2\),\\displaystyle X\_\{\\left\(\\left\[nq\\right\]\\right\)\}\\sim\\mathcal\{N\}\\left\(\\xi\_\{q\},\\frac\{q\(1\-q\)\}\{nf\(\\xi\_\{q\}\)^\{2\}\}\\right\),\(3\)whereξq\\xi\_\{q\}is the population quantile value at quantileqq, andffis a true pdf \(for elementary proof, seeWalker[1968](https://arxiv.org/html/2606.06746#bib.bib5)for example\)\. This property also implies that the sample order statisticX\(\[nq\]\)X\_\{\\left\(\\left\[nq\\right\]\\right\)\}is a consistent estimator of population quantileξq\\xi\_\{q\}\(Arnoldet al\.,[2008](https://arxiv.org/html/2606.06746#bib.bib75)\)\. Together, these theoretical results strengthen the suitability of IPR to represent performance spread\.
## Appendix CFunctional Boxplot
Functional boxplot \(FB\) is an order statistics\-based method for plotting time\-series \(functional\) data\(Sun and Genton,[2011](https://arxiv.org/html/2606.06746#bib.bib63)\)\. In standard practice, it highlights the median curve, shades50%50\\%central region \(called50%50\\%envelope\), and emphasizes the outliers if applicable\. An FB is a generalization of an ordinary boxplot to time\-series data\. In this paper, we use FB as a competitor of the other visualization methods\.
For the succeeding arguments, we first introduce the notations and basic concepts related to an FB\. Let𝒴:=\{\{yi\(t\)\|t∈ℐ\}\|i=1,…,n\}\\mathcal\{Y\}:=\\left\\\{\\left\\\{y\_\{i\}\(t\)\|t\\in\\mathcal\{I\}\\right\\\}\|i=1,\\dots,n\\right\\\}denote a set of time\-series data, whereyiy\_\{i\}is a real function,nnis a number of time\-series data andℐ\\mathcal\{I\}is an interval inℝ\\mathbb\{R\}\. Given an arbitrary real functiony\(t\)y\(t\), its graph is the subset of the planeG\(y\):=\{\(t,y\(t\)\)\|t∈ℐ\}G\(y\):=\\left\\\{\(t,y\(t\)\)\|t\\in\\mathcal\{I\}\\right\\\}\. Also, the band inℝ2\\mathbb\{R\}^\{2\}delimited byKKcurves from𝒴\\mathcal\{Y\}is given as
B\(yi1,…,yiK\):=\{\(t,x\(t\)\)\|t∈ℐ,mink=1,…,Kyik\(t\)≤x\(t\)≤maxk=1,…,Kyik\(t\)\}\.B\(y\_\{i\_\{1\}\},\\dots,y\_\{i\_\{K\}\}\):=\\left\\\{\(t,x\(t\)\)\\;\\middle\|t\\in\\mathcal\{I\},\\min\_\{k=1,\\dots,K\}y\_\{i\_\{k\}\}\(t\)\\leq x\(t\)\\leq\\max\_\{k=1,\\dots,K\}y\_\{i\_\{k\}\}\(t\)\\right\\\}\.With these notations, we now proceed to the method of constructing an FB\.
FB employs the concept of band depth \(BD\) to order time\-series data\(López\-Pintado and Romo,[2009](https://arxiv.org/html/2606.06746#bib.bib76)\)\. Intuitively, BD provides a ranking of how close each time\-series data in𝒴\\mathcal\{Y\}is to the center of all data\. Hence, when time\-series data is sorted according to BD, they are ordered in a center\-to\-outwards fashion\. For instance, the deepest data \(with the highest BD\) is closest to the center and therefore a median among the given set of time\-series data\. Mathematically, canonical BD is defined as the fraction of bands determined bykksample curves in𝒴\\mathcal\{Y\}that contains the whole graph ofy\(t\)y\(t\):
BDn,K\(y\)=∑k=2KBDn\(k\)\(y\),\\displaystyle\\textnormal\{BD\}\_\{n,K\}\(y\)=\\sum\_\{k=2\}^\{K\}\\textnormal\{BD\}^\{\(k\)\}\_\{n\}\(y\),\(4\)whereBDn\(k\)\(y\)=\(nk\)−1∑1≤i1<i2<⋯<ik≤nI\{G\(y\)⊆B\(yi1,…,yik\)\},\\displaystyle\\textnormal\{where \}\\textnormal\{BD\}^\{\(k\)\}\_\{n\}\(y\)=\\begin\{pmatrix\}n\\\\ k\\end\{pmatrix\}^\{\-1\}\\sum\_\{1\\leq i\_\{1\}<i\_\{2\}<\\dots<i\_\{k\}\\leq n\}I\\left\\\{G\(y\)\\subseteq B\(y\_\{i\_\{1\}\},\\dots,y\_\{i\_\{k\}\}\)\\right\\\},whereI\{⋅\}I\\\{\\cdot\\\}is an indicator function,BDn\(k\)\(y\)BD^\{\(k\)\}\_\{n\}\(y\)is a band depth of given curveyyderived fromkkcurves out ofnncurves, andBDn,K\(y\)BD\_\{n,K\}\(y\)is an overall band depth of given curveyy\. Although this is the canonical form of BD, we use a more flexible variant called modified BD \(MBD\)\. Instead of the indicator function in BD, MBD measures the proportion of timettthaty\(t\)y\(t\)resides within a band\. Suppose
Ak\(y\)\\displaystyle A\_\{k\}\(y\)≡A\(y\|yi1,yi2,…,yik\)\\displaystyle\\equiv A\(y\|y\_\{i\_\{1\}\},y\_\{i\_\{2\}\},\\dots,y\_\{i\_\{k\}\}\)≡\{t∈ℐ\|minr=i1,…,ikyr\(t\)≤y\(t\)≤maxr=i1,…,ikyr\(t\)\}\.\\displaystyle\\equiv\\left\\\{t\\in\\mathcal\{I\}\\;\\middle\|\\min\_\{r=i\_\{1\},\\dots,i\_\{k\}\}y\_\{r\}\(t\)\\leq y\(t\)\\leq\\max\_\{r=i\_\{1\},\\dots,i\_\{k\}\}y\_\{r\}\(t\)\\right\\\}\.Then, MBD is formally given as
MBDn,K\(y\)=∑k=2KMBDn\(k\)\(y\),\\displaystyle\\textnormal\{MBD\}\_\{n,K\}\(y\)=\\sum\_\{k=2\}^\{K\}\\textnormal\{MBD\}^\{\(k\)\}\_\{n\}\(y\),\(5\)whereMBDn\(k\)\(y\)=\(nk\)−1∑1≤i1<i2<⋯<ik≤nλr\(Ak\(y\|yi1,yi2,…,yik\)\),\\displaystyle\\textnormal\{where \}\\textnormal\{MBD\}^\{\(k\)\}\_\{n\}\(y\)=\\begin\{pmatrix\}n\\\\ k\\end\{pmatrix\}^\{\-1\}\\sum\_\{1\\leq i\_\{1\}<i\_\{2\}<\\dots<i\_\{k\}\\leq n\}\\lambda\_\{r\}\\left\(A\_\{k\}\(y\|y\_\{i\_\{1\}\},y\_\{i\_\{2\}\},\\dots,y\_\{i\_\{k\}\}\)\\right\),whereλr\(Ak\(y\)\)=λ\(Ak\(y\)\)λ\(ℐ\)\\lambda\_\{r\}\(A\_\{k\}\(y\)\)=\\frac\{\\lambda\(A\_\{k\}\(y\)\)\}\{\\lambda\(\\mathcal\{I\}\)\}andλ\\lambdais the Lebesgue measure onℐ\\mathcal\{I\}\. WhileKKcan take any integer from22tonn, we useK=2K=2by followingSun and Genton \([2011](https://arxiv.org/html/2606.06746#bib.bib63)\)\.
Figure 13:FB applied to the learning curves of PPO onfinger\-spintask\. The necessary components of the FB are highlighted\. Namely, the median,50%50\\%envelope, and outlier curves\.Now, all the necessary elements that consist of an FB —median,50%50\\%envelope, and outliers— can be determined by using MBD for each curve in𝒴\\mathcal\{Y\}\. Suppose all curves in𝒴\\mathcal\{Y\}are ranked according to its correspondingMBDn,2\\textnormal\{MBD\}\_\{n,2\}, and let us denote a curve withjj\-th largestMBDn,2\\textnormal\{MBD\}\_\{n,2\}asy\[j\]y\_\{\[j\]\}\. Then, the median curve according to the MBD isy\[1\]y\_\{\[1\]\}by definition\. We refer to this median curve as the FB median curve\. A50%50\\%envelope can be constructed by taking the time\-wise minimum and maximum over the50%50\\%of deepest curves:
C0\.5:=\{\(t,y\(t\)\)\|t∈ℐ,minr=1,…,⌈n2⌉y\[r\]\(t\)≤y\(t\)≤maxr=1,…,⌈n2⌉y\[r\]\(t\)\}\.C\_\{0\.5\}:=\\left\\\{\(t,y\(t\)\)\\;\\middle\|t\\in\\mathcal\{I\},\\min\_\{r=1,\\dots,\\lceil\\frac\{n\}\{2\}\\rceil\}y\_\{\[r\]\}\(t\)\\leq y\(t\)\\leq\\max\_\{r=1,\\dots,\\lceil\\frac\{n\}\{2\}\\rceil\}y\_\{\[r\]\}\(t\)\\right\\\}\.Lastly, the curves are classified as outliers if the value of the curve falls outside the fence at any point in time\. The fence is determined by inflating the50%50\\%envelope by1\.51\.5times the range of itself, which is analogous to the1\.51\.5times IQR criterion for outlier detection in a canonical boxplot\. By plotting these together with appropriate graphing styles, we get an FB \(see[Figure13](https://arxiv.org/html/2606.06746#A3.F13)for example\)\.
## Appendix DHyperparameters For The Experiments
Table 3:MNIST Configurations\.Table 4:CIFAR\-1010Configurations\.Table 5:Pascal VOC Configurations\.Table 6:PPO Configurations\.ParameterConfigurationTotal timesteps \(TT\)10710^\{7\}Parameter Update Frequency \(BB\)81928192Number of Epochs \(NN\)1010Minibatch Size \(MM\)256256Clipping Parameter \(ϵ\\epsilon\)0\.20\.2Value Loss Coefficient0\.50\.5Maximum Gradient Norm0\.50\.5GAEγ\\gamma0\.990\.99GAEλ\\lambda0\.950\.95Network Architecture \(𝜽\\operatorname\{\\boldsymbol\{\\theta\}\}&𝝍\\operatorname\{\\boldsymbol\{\\psi\}\}\)Fully\-connected NNHidden Layer Dims \(𝜽\\operatorname\{\\boldsymbol\{\\theta\}\}&𝝍\\operatorname\{\\boldsymbol\{\\psi\}\}\)256256Number of Hidden Layers \(𝜽\\operatorname\{\\boldsymbol\{\\theta\}\}&𝝍\\operatorname\{\\boldsymbol\{\\psi\}\}\)22Activation Function \(Actor & Critic\)TanhOptimizer \(Actor & Critic\)AdamStep\-size for Actor & Critic \(η\\eta\)3×10−43\\times 10^\{\-4\}Table 7:SAC Configurations\.ParameterConfigurationTotal timesteps \(TT\)10610^\{6\}Length of Initial Sampling Phase \(BB\)5×1035\\times 10^\{3\}Buffer Size10610^\{6\}Minibatch Size \(MM\)256256Frequency of Policy Update \(NpN\_\{p\}\)22Frequency of Target Update \(NtN\_\{t\}\)11τ\\tau5×10−35\\times 10^\{\-3\}Minimum std\. deviation in log\-scale−5\-5Maximum std\. deviation in log\-scale22Network Architecture \(𝜽\\operatorname\{\\boldsymbol\{\\theta\}\}&\{𝝍\}i=1,2,\{𝝍targi\}i=1,2\\\{\\operatorname\{\\boldsymbol\{\\psi\}\}\\\}\_\{i=1,2\},\\\{\\operatorname\{\\boldsymbol\{\\psi\}\}\_\{\\textnormal\{targ\}\_\{i\}\}\\\}\_\{i=1,2\}\)Fully\-connected NNHidden Layer Dims \(𝜽\\operatorname\{\\boldsymbol\{\\theta\}\}&\{𝝍\}i=1,2,\{𝝍targi\}i=1,2\\\{\\operatorname\{\\boldsymbol\{\\psi\}\}\\\}\_\{i=1,2\},\\\{\\operatorname\{\\boldsymbol\{\\psi\}\}\_\{\\textnormal\{targ\}\_\{i\}\}\\\}\_\{i=1,2\}\)256256Number of Hidden Layers \(𝜽\\operatorname\{\\boldsymbol\{\\theta\}\}&\{𝝍\}i=1,2,\{𝝍targi\}i=1,2\\\{\\operatorname\{\\boldsymbol\{\\psi\}\}\\\}\_\{i=1,2\},\\\{\\operatorname\{\\boldsymbol\{\\psi\}\}\_\{\\textnormal\{targ\}\_\{i\}\}\\\}\_\{i=1,2\}\)22Activation Function \(Actor & Critic\)ReLUOptimizer \(Actor & Critic\)AdamStep\-size for Actor & Critic \(η\\eta\)3×10−43\\times 10^\{\-4\}Initial entropy coefficient \(α\\alpha\)1\.01\.0Optimizer \(Entropy Coefficient\)AdamStep\-size \(Entropy Coefficient\)10−410^\{\-4\}Target Entropy \(ℋ\\mathcal\{H\}\)\|𝒜\|\|\\mathcal\{A\}\|Table 8:TD\-MPC Hyperparameter Configurations \(fromHansenet al\.,[2022](https://arxiv.org/html/2606.06746#bib.bib94)\)\.Table 9:TD\-MPC2 Hyperparameter Configurations \(fromHansenet al\.,[2024](https://arxiv.org/html/2606.06746#bib.bib97)\)\.Table 10:DQN Configurations\. Network architecture is written in PyTorch syntax without activation function between layers\.Table 11:Rainbow Configurations\. Network architecture is written in PyTorch syntax without activation function between layers\.
## Appendix ESupervised Learning Tasks
\(a\)Test Accuracies
\(b\)Test Losses
Figure 14:Variation of test accuracies and losses in standard supervised learning \(SL\) settings\. Plots presents\(a\)\. test accuracy and\(b\)\. test loss of100100independent runs of MLP on MNIST and ResNet\-1818on CIFAR\-1010\. By default, step\-size for the optimizer does not change throughout the learning\. The plots also provide the results with step\-size decay from3×10−43\\times 10^\{\-4\}to3×10−53\\times 10^\{\-5\}at the100100th epoch \(labeled as “\(ss decay\)” in the legends\)\. While some loss curves exhibit relatively high variation, many sets of curves achieve low variation\.Figure 15:Metrics across100100independent YOLO runs on Pascal VOC dataset\. All metrics are computed on the test data at the end of each epoch, and curves are plotted using RPH\. For all the metrics, variation over runs is mostly unrecognizable\.Here, we empirically show that SL algorithms tend to exhibit only minor performance variability\. We consider three SL tasks: MNIST with an MLP, CIFAR\-1010with a ResNet\-1818, and Pascal VOC with a YOLO\(Jocher and Qiu,[2024](https://arxiv.org/html/2606.06746#bib.bib105)\)\. All experiments runs100100independent runs, where each lasting200200epochs\. Additionally, all the experiments use Adam optimizer for all experiments with step\-size3×10−43\\times 10^\{\-4\}\. We set mini\-batch size to256256for the first two tasks and3232for the last\. Also, for the first two tasks, we conduct two variants of the experiment, one with step\-size annealing and one without\. With annealing, the step\-size drops to10%10\\%of its original value at the50%50\\%point of the total number of epochs\. For the other hyperparameters, see[Tables3](https://arxiv.org/html/2606.06746#A4.T3),[4](https://arxiv.org/html/2606.06746#A4.T4)and[5](https://arxiv.org/html/2606.06746#A4.T5)\.
[Figure14](https://arxiv.org/html/2606.06746#A5.F14)reports test accuracies and losses over training epochs for the first two tasks in RPH format\. We observe that the spread in test accuracies is nearly imperceptible across all experiment configurations\. Note that the majority of the accuracy curves of each variant of the experiment overlap\. Although the accuracy is wider than the test losses, the spread of test losses is still small\. In particular, the MNIST experiments exhibit almost no fluctuation across independent runs\. While CIFAR\-1010experiments show relatively wider variation in test losses, its degree is relatively small when compared to those with deep RL\. Notably, step\-size annealing increases the loss and reduces its variation in CIFAR\-1010experiments\. This suggests that step\-size annealing promotes greater learning stability but does not necessarily improve the loss\. Despite such loss behavior, step\-size annealing improves test accuracy in CIFAR\-1010experiments\. Such behavior may occur for several reasons, such as low logits for correct predictions\. Nevertheless, the variations over independent runs tend to be small in the first two SL tasks\.
A similar conclusion also holds for the YOLO on the Pascal VOC dataset\. Curves in[Figure15](https://arxiv.org/html/2606.06746#A5.F15)represent various test metric values\. The first three correspond to different loss functions: binary cross\-entropy \(BCE\), complete intersection over union \(CIoU\), and distribution focal loss \(DFL\)\. The remaining four are the evaluation metrics: precision, recall, mean average precision5050\(mAP5050\), and mAP5050\-9595\. Despite its task complexity, YOLO yields nearly identical values across runs in all the metrics\. This further supports the claim of lower performance variation in SL tasks\.
## Appendix FAll the Learning Curves for Default Experiments
\(a\)Default PPO Learning Curves
\(b\)Default SAC Learning Curves
Figure 16:Learning curves for PPO and SAC\. Each subplot \(the small boxes\) corresponds to one robotic control task\. Results are obtained by running100100independent runs\.
## Appendix GPerformance Variation and Median Bar Plots from Case Studies
\(a\)TD\-MPC Min\-max IPR\-9090
\(b\)TD\-MPC Median
\(c\)TD\-MPC2 Min\-max IPR\-9090
\(d\)TD\-MPC2 Median
Figure 17:Bar plots for performance variation and median performance of TD\-MPC and TD\-MPC2 experiments\.\(a\)Default IPR\-9090
\(b\)Default Median
\(c\)LayerNorm IPR\-9090
\(d\)LayerNorm Median
\(e\)PNorm IPR\-9090
\(f\)PNorm Median
\(g\)Normalized IPR\-9090
\(h\)Normalized Median
Figure 18:Comparison of performance variation and median bar plots for PPO with different normalization techniques\.\(a\)Default IPR\-9090
\(b\)Default Median
\(c\)LayerNorm IPR\-9090
\(d\)LayerNorm Median
\(e\)PNorm IPR\-9090
\(f\)PNorm Median
\(g\)Normalized IPR\-9090
\(h\)Normalized Median
Figure 19:Comparison of performance variation and median bar plots for SAC with different normalization techniques\.
## Appendix HComparison Plots of Learning Curves Before and After Applying Normalization Techniques
\(a\)PPO
\(b\)SAC
Figure 20:Visual comparison of performance variation between PPO/SAC and LayerNorm PPO/SAC\. Blue and orange curves represent the baseline and LayerNorm variants of each algorithm, respectively\. Learning curves are visualized by RPH\.\(a\)PPO
\(b\)SAC
Figure 21:Visual comparison of performance variation between PPO/SAC and PNorm PPO/SAC\. Blue and green curves represent the baseline and PNorm variants of each algorithm, respectively\. Learning curves are visualized by RPH\.\(a\)PPO
\(b\)SAC
Figure 22:Visual comparison of performance variation between PPO/SAC and Normalized PPO/SAC\. Blue and red curves represent the baseline and Normalized variants of each algorithm, respectively\. Learning curves are visualized by RPH\.
## Appendix IPerformance Variation Bar Plots with Different IPR Ranges
\(a\)PPO IPR\-8080
\(b\)SAC IPR\-8080
\(c\)PPO IPR\-9090
\(d\)SAC IPR\-9090
\(e\)PPO IPR\-9595
\(f\)SAC IPR\-9595
Figure 23:Comparison of IPR with a different range of central coverage\. Different rates of central coverage result in different sensitivity against the tail behavior of performance distributions\.
## Appendix JComparison of Learning Curves in Different Plotting Styles
Figure 24:Comparison of learning curves of continuous control algorithms\. Each subfigure shows the RPH learning curves of PPO, SAC, TD\-MPC, and TD\-MPC2 on a given task\. For visual clarity, the learning curves are shown with RPH without non\-highlighted curves, the number of bins is reduced to2020, and x\-axis is log\-scaled\. TD\-MPC/TD\-MPC2 learns more rapidly than PPO/SAC in many tasks, while exhibiting less performance variation\.Figure 25:Learning curve comparison plot with standard error\. Each subfigure shows the mean and standard error of the learning curves of PPO, SAC, TD\-MPC, and TD\-MPC2 on a given task\. For visual clarity, x\-axis is log\-scaled\. The standard errors are mostly imperceptible, which does not allow a visual discovery of TD\-MPC/TD\-MPC2 having tighter variability as in[Figure24](https://arxiv.org/html/2606.06746#A10.F24)\.Figure 26:Learning curve comparison plot with standard deviation\. Each subfigure shows the mean and standard deviation of the learning curves of PPO, SAC, TD\-MPC, and TD\-MPC2 on a given task\. For visual clarity, the x\-axis is log\-scaled\.
## Appendix KEnvironment Specification
Table 12:List of tasks in this paper and their state/action dimensions, and min/max episodic returns\. Episodic returns of ALE tasks are in human\-normalized scale\.EnvironmentTaskdim\(𝒪\)\\dim\(\\mathcal\{O\}\)dim\(𝒜\)\\dim\(\\mathcal\{A\}\)Min\.Max\.MuJoCoAnt\-v4278\-3655787MuJoCoHalfCheetah\-v4176\-5298718MuJoCoHopper\-v4113133749MuJoCoHumanoid\-v437617966583MuJoCoHumanoidStandup\-v43761726492216170MuJoCoInvertedDoublePendulum\-v4111509355MuJoCoInvertedPendulum\-v44171000MuJoCoPusher\-v4237\-165\-21MuJoCoReacher\-v4112\-76\-4MuJoCoSwimmer\-v482\-52347MuJoCoWalker2d\-v4176\-16704DMCacrobot\-swingup6101000DMCacrobot\-swingup\_sparse6101000DMCball\_in\_cup\-catch8201000DMCcartpole\-balance5101000DMCcartpole\-balance\_sparse5101000DMCcartpole\-swingup5101000DMCcartpole\-swingup\_sparse5101000DMCcartpole\-three\_poles11101000DMCcartpole\-two\_poles8101000DMCcheetah\-run17601000DMCdog\-fetch2323801000DMCdog\-run2233801000DMCdog\-stand2233801000DMCdog\-trot2233801000DMCdog\-walk2233801000DMCfinger\-spin9201000DMCfinger\-turn\_easy12201000DMCfinger\-turn\_hard12201000DMCfish\-swim24501000DMCfish\-upright21501000DMChopper\-hop15401000DMChopper\-stand15401000DMChumanoid\-run672101000DMChumanoid\-run\_pure\_state552101000DMChumanoid\-stand672101000DMChumanoid\-walk672101000DMChumanoid\_CMU\-run1375601000DMChumanoid\_CMU\-stand1375601000DMChumanoid\_CMU\-walk1375601000DMCmanipulator\-bring\_ball44501000DMCmanipulator\-bring\_peg44501000DMCmanipulator\-insert\_ball44501000DMCmanipulator\-insert\_peg44501000DMCpendulum\-swingup3101000DMCpoint\_mass\-easy4201000DMCpoint\_mass\-hard4201000DMCquadruped\-fetch901201000DMCquadruped\-run781201000DMCquadruped\-walk781201000DMCreacher\-easy6201000DMCreacher\-hard6201000DMCstacker\-stack\_249501000DMCstacker\-stack\_463501000DMCswimmer\-swimmer15611401000DMCswimmer\-swimmer625501000DMCwalker\-run24601000DMCwalker\-stand24601000DMCwalker\-walk24601000ALEBattleZone\-v5\(84, 84, 1\)18\-0\.011\.47ALEDoubleDunk\-v5\(84, 84, 1\)18\-2\.539\.68ALENameThisGame\-v5\(84, 84, 1\)18\-0\.292\.26ALEPhoenix\-v5\(84, 84, 1\)18\-0\.114\.41ALEQbert\-v5\(84, 84, 1\)18\-0\.011\.45Similar Articles
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs
This paper proposes a quantile Bayesian risk-aware MDP framework for online RL that adaptively balances robustness and exploration over time, providing theoretical regret bounds and demonstrating strong empirical performance.
Better exploration with parameter noise
OpenAI presents parameter noise, a technique that adds adaptive noise to neural network policy parameters rather than action spaces, enabling agents to learn tasks significantly faster than traditional action noise approaches. The method achieves 2x faster learning on HalfCheetah and represents a middle ground between evolution strategies and deep RL approaches like TRPO and DDPG.
Benchmarking safe exploration in deep reinforcement learning
OpenAI proposes standardizing constrained RL as the formalism for safe exploration and introduces Safety Gym, a benchmark suite for evaluating safe deep RL algorithms in high-dimensional continuous control tasks with safety constraints.
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
This paper introduces Approximate Next Policy Sampling (ANPS) as an alternative to conservative policy updates in deep reinforcement learning. It proposes Stable Value Approximate Policy Iteration (SV-API) and SV-RL, which align training data with the next policy's state distribution to allow for larger and safer policy updates.
Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
Bebop proposes entropy-aware multi-token prediction with rejection sampling and a novel TV loss to accelerate RL training of LLMs, achieving up to 1.8x speedup. The method addresses the degradation of acceptance rates during RL by optimizing training objectives.