Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

arXiv cs.LG Papers

Summary

This paper presents the first implementation of an infra-Bayesian reinforcement learning agent, demonstrating that it outperforms classical RL in worst-case regret and handles Newcomb's problem optimally, offering a step toward robustness under model misspecification.

arXiv:2605.23146v1 Announce Type: new Abstract: Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate the agent's behavior, including environments crucial to AI safety, where the agent interacts with predictors, humans, other AI agents, and institutions. In such settings, the agent's model class fails to capture the world in which it operates. Under such misspecification, classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain. Infra-Bayesianism is a decision-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty, where priors can be reasonably chosen, from Knightian uncertainty, where no grounds exist for the construction of such a prior. It does so by evaluating actions on their worst-case outcomes, rather than from posterior expectations or weighted averaging. We present the first proof-of-concept implementation of an infra-Bayesian reinforcement learning architecture for finite-outcome stateless decision problems. Our agent maintains a set of imprecise hypotheses, updates them using infra-Bayesian conditioning, and selects actions by maximizing worst-case expected value. We apply this implementation of the infra-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst-case regret as compared to classical reinforcement learning agents. We also investigate Newcomb's problem and show that the infra-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy-dependent uncertainty.
Original Article
View Cached Full Text

Cached at: 05/25/26, 09:01 AM

# Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
Source: [https://arxiv.org/html/2605.23146](https://arxiv.org/html/2605.23146)
Manish Aryal Purdue University aryalm@purdue\.edu&Faiyaz Azam11footnotemark:1 Carnegie Mellon University fazam@andrew\.cmu\.edu&Agnivo Banerjee11footnotemark:1 WorldQuant University agnivo\.stem\.iiserk@gmail\.comSai Sidhanth Manoharan Jayanthi11footnotemark:1 UC Berkeley sidmanoharan\_2512@berkeley\.edu&Allegra Laro11footnotemark:1 Independent allegralaro@gmail\.com&Clément Legentilhomme11footnotemark:1 Aix\-Marseille University lgtclem@gmail\.comAndrew Lin11footnotemark:1 MIT a\_lin@mit\.edu&Florian Lorkowski11footnotemark:1 University of Zurich florian@lorkow\.ski&Radman Rakhshandehroo11footnotemark:1 University of British Columbia rdmnr@student\.ubc\.caPatric Rommel11footnotemark:1 University of Stuttgart patric\.rommel@itp1\.uni\-stuttgart\.de&Emanuel Ruzak11footnotemark:1 University of Buenos Aires eruzak@dc\.uba\.ar&Nathan Theng11footnotemark:1 California State University, Fresno theng\_nathan@mail\.fresnostate\.eduPaul Yushin Rapoport University of Chicago pyrapoport@uchicago\.eduAll authors contributed equally and should be considered first authors; order is alphabetical, apart from Paul Rapoport, who was the mentor for the stream\.Corresponding author:pyrapoport@uchicago\.edu\. All funding and logistical support was provided by Kairos and the SPAR program\.

\(May 6, 2026\)

###### Abstract

Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent’s policy\. This assumption breaks down in non\-realizable settings where other actors might anticipate the agent’s behavior – most notably those environments crucial to AI safety, where any given agent interacts with predictors, human and other AI agents, and institutions\. In such environments, the agent’s model class fails to capture the world in which it operates\. Under such misspecification, Classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain\. Infra\-Bayesianism is a decision\-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty – where priors can be reasonably chosen – from Knightian uncertainty, where no grounds exist for the construction of such a prior\. Infra\-Bayesianism does so by evaluating actions on their worst\-case outcomes in environments, rather than from posterior expectations or weighted averaging\.

We present the first proof\-of\-concept implementation of an infra\-Bayesian reinforcement learning architecture for finite\-outcome stateless decision problems\. Our agent maintains a set of imprecise hypotheses, updates them using infra\-Bayesian conditioning, and selects actions by maximizing worst\-case expected value\. We apply this implementation of the infra\-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst\-case regret as compared to classical reinforcement learning agents\. We also investigate Newcomb’s problem and show that the infra\-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents\. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy\-dependent uncertainty\.

## 1Introduction

Reinforcement learning \(RL\) is usually analyzed under the assumption that the agent interacts with a fixed environment: a Markov decision process \(MDP\), a partially observable Markov decision process \(POMDP\), or a Bayesian posterior over such models\. This assumption is mathematically convenient and has enabled a large body of convergence and regret guarantees\. However, it is poorly suited to settings in which the agent is embedded in an environment too complex to be represented by the agent, or environments containing other actors, that model the agent’s behavior and thereby respond to the agent’s policy rather than merely to its realized actions\. Examples include autonomous vehicles whose driving style is anticipated by human drivers, recommendation systems whose users adapt to the system’s known behavior, and decision\-theoretic problems such as Newcomb’s problem\. In these settings, the environment is not merely unknown; it can be policy\-dependent and non\-realizable from the agent’s point of view\(Bellet al\.,[2021](https://arxiv.org/html/2605.23146#bib.bib9)\)\.

Classical Bayesian RL provides a principled treatment of ordinary uncertainty by placing a prior over possible environments\. Its strongest guarantees, however, rely on a*realizability*or*grain of truth*assumption: the true environment, or a sufficiently exact model of it, must lie in the agent’s hypothesis class\. This assumption is implausible for open\-ended deployment\. The real world contains the agent itself and other systems of comparable or greater complexity, so no tractable hypothesis class can be expected to contain a complete description of the environment\. Under such misspecification, Bayesian agents can become confidently wrong, and value\-based RL agents can fail to converge to optimal policies\(Gelb,[2025b](https://arxiv.org/html/2605.23146#bib.bib10)\)\.

Infra\-Bayesianism \(IB\) was developed to address this problem by replacing precise Bayesian hypotheses with imprecise, worst\-case\-evaluated hypotheses\. Instead of representing uncertainty only by a probability distribution over worlds, an IB agent may represent*Knightian uncertainty*over possible worlds and evaluate actions by a lower expectation: the value guaranteed against the worst admissible member of the hypothesis class\. It then selects the action that maximizes this worst\-case expected value\. This maximin structure is safety\-relevant because it turns the agent’s objective into a lower\-bound guarantee, rather than an average\-case prediction\. IB also supplies an update rule for these objects that is designed to preserve dynamic consistency across sequential observations\(Diffractor and Kosoy,[2020](https://arxiv.org/html/2605.23146#bib.bib5); Appel,[2020](https://arxiv.org/html/2605.23146#bib.bib11)\)\.

More broadly, IB belongs to the agent foundations and AI safety research agenda, which seeks mathematically precise models of embedded agency\. From this perspective, non\-realizability is not an edge case but a central feature of advanced agents operating in the real world: the agent is itself part of the environment it is trying to model\.

Despite substantial theoretical work on IB and its role in the Learning\-Theoretic Agenda for AI alignment, comparatively little work has translated its mathematical objects into concrete reinforcement learning agents\. This leaves a gap between the formal promise of IB – robust reasoning under non\-realizability, Knightian uncertainty, and policy\-dependent environments – and the practical question of how an agent could actually represent, update, and plan within it\. In this paper, we present a proof\-of\-concept IB RL architecture\. The implementation\-level object is not a posterior distribution over MDPs, but a finite set of extremal affine evaluators whose lower envelope defines the agent’s value estimate\. This representation allows the agent to perform tractable lower\-expectation evaluation, distinguish classical probabilistic mixtures from Knightian uncertainty, and apply dynamically consistent IB\-style updates without enumerating full history trees\.

Our contributions are:

1. 1\.We formulate a finite\-outcome infra\-Bayesian reinforcement learning \(IBRL\) architecture based on a\-measures and infradistributions, including classical and Knightian mixtures, lower\-expectation evaluation, and IB\-style updates\.
2. 2\.We show that the architecture recovers ordinary Bayesian behavior in the degenerate case where each infradistribution has a single minimal point and all uncertainty is classical\.
3. 3\.We demonstrate empirically that an IB agent outperforms a Bayesian agent in terms of worst\-case performance in a Knightian uncertain environment\.
4. 4\.We show that IB decision theory obtains optimal rewards in simple policy\-dependent environments, where classical decision theories fail to comprehensively model the causal structure\.

## 2Background

### 2\.1Classical and Bayesian reinforcement learning

In RL, an agent repeatedly chooses actions and receives observations from an environment\. The usual formalism is a Markov decision process, which models the environment as a set of states through which the agent can transition by choosing from a set of actions, receiving a reward upon each transition\. The agent aims to optimize its policy, to maximize the cumulative expected reward over future actions\. Maximizing rewards is equivalent to minimizing*regret*, the difference between the expected reward obtainable by the optimal policy and the actually realized reward\.

Classical Bayesian RL places a prior over environments and updates by Bayes’ rule, which is appropriate in realizable settings where the agent’s hypothesis class contains the truth, and is the basis for many standard convergence arguments\(Sutton and Barto,[2018](https://arxiv.org/html/2605.23146#bib.bib12); Ghavamzadehet al\.,[2015](https://arxiv.org/html/2605.23146#bib.bib13)\)\. This assumption is a poor fit for embedded or open\-world agents, where no computationally feasible hypothesis class can contain the exact environment\. In such non\-realizable settings, Bayesian updating still produces a posterior, but the posterior need not converge to a useful model for decision\-making\. This is one of the core motivations for IB in the Learning\-Theoretic Agenda\(Gelb,[2025b](https://arxiv.org/html/2605.23146#bib.bib10); Kosoy and Kosoy,[2023](https://arxiv.org/html/2605.23146#bib.bib7)\)\.

### 2\.2Non\-realizability and policy\-dependent environments

The central motivation for this work is that difficulties in RL arise not merely from statistical uncertainty\. They also include model misspecification, Knightian uncertainty, and environments whose behavior depends on the agent’s policy rather than only on its realized actions\.

Policy\-dependence is especially important for embedded agents\. In ordinary RL, the environment is usually modeled as responding to the current state and action\. In a policy\-dependent environment, however, rewards or transitions may depend directly on the agent’s policy\. This can occur when other agents, predictors, markets, users, or institutions form expectations about the agent’s behavior and respond to those expectations\.

Newcomb\-like decision processes formalize these environments by allowing rewards or transitions to depend on the policy itself\. Bell et al\. show that value\-based RL agents in such environments can converge only to ratifiable policies, and may fail to converge to optimal policies at all\(Bellet al\.,[2021](https://arxiv.org/html/2605.23146#bib.bib9)\)\. This suggests that policy\-dependent environments require a decision\-theoretic treatment that goes beyond ordinary value learning in fixed MDPs\.

### 2\.3Brief description of Newcomb’s Problem

Newcomb’s problem \(named after physicist William Newcomb\) is a classic decision theory dilemma\(Nozick,[1969](https://arxiv.org/html/2605.23146#bib.bib17)\)\. In this problem, the agent is presented with a transparent box and an opaque box, and may choose either to take only the opaque box \(one\-boxing\) or to take both boxes \(two\-boxing\)\. The agent can see that the transparent box contains one thousand dollars\. The opaque box contains either nothing or one million dollars, depending on a prediction that has already been made\. Crucially, the prediction was about the agent’s choice: the opaque box contains a million dollars if and only if the agent is predicted to take just the opaque box\. The prediction is highly accurate, but not necessarily perfect\. All aspects of the scenario are known to the agent\. This scenario poses a dilemma because in it, the principle of dominance conflicts with the principle of expected\-utility maximization\. In particular, agents following causal or evidential decision theories come to different conclusions, and both fail to capture the causal structure of the environment accurately\.

### 2\.4Infra\-Bayesianism at a high level

Infra\-Bayesianism generalizes Bayesian reasoning by allowing hypotheses to contain Knightian uncertainty\. Instead of representing uncertainty only by a probability distribution over possible worlds, an IB agent may represent a set of admissible evaluators\. The agent then evaluates a policy by its lower expectation: the value guaranteed under the least favorable admissible evaluator\.

This distinction is important\. A Bayesian agent averages over hypotheses using posterior probabilities\. An IB agent can also represent ambiguity that should not be averaged away\. In this sense, IB separates ordinary probabilistic uncertainty from Knightian uncertainty\. The resulting decision rule is maximin: choose the policy whose worst admissible value is best\.

This lower\-bound perspective is especially relevant in safety\-motivated settings\. When the agent’s model class may be misspecified, or when the environment may respond to the agent’s policy, average\-case posterior value may be the wrong object to optimize\. IB instead provides a formalism for reasoning with partial information and worst\-case guarantees\.

### 2\.5Affine measures and infradistributions

The basic implementation\-level object used in this paper is an affine measure, or a\-measure\. In the finite non\-signed setting considered here, an a\-measure is a pair

a=\(λ​μ,b\),a=\(\\lambda\\mu,b\),\(1\)whereμ\\muis a probability measure over possible observation histories,λ≥0\\lambda\\geq 0is a scale factor, andb≥0b\\geq 0is an offset\. Given a bounded return functionffover histories, the a\-measure evaluatesffby

a​\(f\)=λ​𝔼μ​\[f\]\+b\.a\(f\)=\\lambda\\mathbb\{E\}\_\{\\mu\}\[f\]\+b\.\(2\)
The probability componentμ\\murepresents ordinary stochastic uncertainty over histories\. The scaleλ\\lambdadetermines the weight assigned to that component\. The offsetbbrecords value associated with branches of the history tree that have been ruled out by observation\. Intuitively, when an observation eliminates some branches, IB does not simply delete the value associated with those branches\. Instead, their contribution is carried forward into the affine offset, which helps preserve dynamic consistency between ex ante and ex post evaluation\.

An infradistributionΨ\\Psican be viewed, for our finite purposes, as a set of affine evaluators\. Its lower expectation is

𝔼¯Ψ​\[f\]=infa∈Ψa​\(f\)\.\\underline\{\\mathbb\{E\}\}\_\{\\Psi\}\[f\]=\\inf\_\{a\\in\\Psi\}a\(f\)\.\(3\)Thus, a policy is evaluated by the worst admissible a\-measure in the infradistribution\. In reward\-maximization language, the agent chooses a policy maximizing this lower expectation\.

### 2\.6Classical mixtures and Knightian uncertainty

IB distinguishes between classical probabilistic uncertainty and Knightian uncertainty\. Classical uncertainty is represented by a mixture\. IfΨi\\Psi\_\{i\}are infradistributions andwiw\_\{i\}are prior weights, their classical mixture is

∑iwi​Ψi=\{∑iwi​ai:ai∈Ψi\},∑iwi=1\.\\sum\_\{i\}w\_\{i\}\\Psi\_\{i\}=\\Big\\\{\\sum\_\{i\}w\_\{i\}a\_\{i\}\\;:\\;a\_\{i\}\\in\\Psi\_\{i\}\\Big\\\},\\qquad\\sum\_\{i\}w\_\{i\}=1\.\(4\)This corresponds to uncertainty that can be averaged over using prior weights, as in Bayesian reasoning\.

Knightian uncertainty is different\. It represents ambiguity that remains exposed to the outer infimum rather than being averaged away\. A singleton infradistribution recovers ordinary probabilistic evaluation\. A non\-singleton infradistribution represents ambiguity over several admissible evaluators, where the value of a policy is determined by the least favorable one\.

This distinction is central to the behavior of an IB agent\. A Bayesian agent facing several possible models assigns probabilities to them and optimizes posterior expected value\. An IB agent may instead treat several models as admissible without assigning meaningful probabilities between them, and then choose the policy whose lower expectation is highest\.

### 2\.7IB\-style updating

The IB update rule can be understood as a dynamically consistent analogue of Bayesian conditioning\. After observing an eventLL, each a\-measure is restricted to the observed branch\. However, unlike ordinary Bayesian conditioning, the value associated with the unobserved branch is not simply discarded\. Instead, it is transferred into the offset term\.

To formalize this procedure, we must define the return functiongg\. Unlike classical RL, where reward is typically a stepwise scalarR​\(s,a\)R\(s,a\)evaluated at each transition, IB evaluates policies using a bounded return \(or loss\) functionggdefined over entire histories, formally, over infinite sequences of actions and observations \(destinies\)\. During evaluation, we will setg≡fg\\equiv f\. Letμ​\(g\)\\mu\(g\)denote the expected value ofggunder the measureμ\\mu\. Viewing the observationLLas an indicator function for the realized event, the raw update takes the following schematic form:

\(λ​μ,b\)↦\(λ​μ​L,b\+λ​μ​\(\(1−L\)​g\)\)\.\(\\lambda\\mu,b\)\\mapsto\(\\lambda\\mu L,\\;b\+\\lambda\\mu\(\(1\-L\)g\)\)\.\(5\)Here,μ​L\\mu Ldenotes the restriction ofμ\\muto the observed event\. The complementary indicator\(1−L\)\(1\-L\)isolates the branches of the history tree that were ruled out by the observation\. The termλ​μ​\(\(1−L\)​g\)\\lambda\\mu\(\(1\-L\)g\)calculates the exact expected return of those unrealized branches, which gets added to the offsetbb\. Intuitively, the offset records value that has already been settled by the observation, so that future lower\-expectation evaluation remains dynamically consistent with the original ex ante evaluation\.

After this restriction step, the resulting infradistribution is renormalized, such that

𝔼¯Ψ​\(0\)=0,𝔼¯Ψ​\(1\)=1\.\\underline\{\\mathbb\{E\}\}\_\{\\Psi\}\(0\)=0,\\qquad\\underline\{\\mathbb\{E\}\}\_\{\\Psi\}\(1\)=1\.\(6\)This normalization plays a role analogous to posterior normalization in Bayesian conditioning\. In the special case where every infradistribution has exactly one minimal point and all uncertainty is represented by classical mixtures, the IB update reduces to ordinary Bayesian updating\. This recovery of Bayes’ rule is an important sanity check for both the formalism and our implementation\.

An important property of the raw IB update \([5](https://arxiv.org/html/2605.23146#S2.E5)\) is its linearity, implying that the IB update is a transformation in the space of a\-measures mapping straight lines to straight lines\. Intuitively, this means that the IB update does not produce new vertices\. To reconstruct all relevant information contained in the infradistribution, the implementation thus only needs to track and update the extremal minimal points, which are the vertices of the set of minimal points\.

## 3Related Work

Existing IB work is mostly theoretical, covering inframeasures, update rules, dynamic consistency, learnability, and regret\-style guarantees\(Diffractor and Kosoy,[2020](https://arxiv.org/html/2605.23146#bib.bib5); Appel,[2020](https://arxiv.org/html/2605.23146#bib.bib11); Gelb,[2025a](https://arxiv.org/html/2605.23146#bib.bib16)\)\. Our work complements this by giving a finite RL implementation that explicitly represents infradistributions, performs IB\-style updates, and plans using lower expectations\.

We situate this contribution relative to three nearby areas: robust RL, imprecise probability and credal sets, and policy\-dependent RL\.

### 3\.1Robust reinforcement learning

Robust RL studies agents that optimize performance under worst\-case uncertainty over environment models\. In robust MDPs, uncertainty is usually represented as a set of possible transition kernels or reward functions, and the agent chooses a policy that performs well against the least favorable model in that set\(Iyengar,[2005](https://arxiv.org/html/2605.23146#bib.bib14); Nilim and El Ghaoui,[2005](https://arxiv.org/html/2605.23146#bib.bib15)\)\.

This literature is closely related to our work because both approaches replace average\-case evaluation with worst\-case evaluation\. The main difference is the representation of uncertainty\. Robust MDPs typically retain a fixed, policy\-independent environment model with uncertainty over transition or reward parameters\. Our approach instead represents uncertainty using infradistributions over affine evaluators and updates those evaluators using the IB update rule\.

### 3\.2Imprecise probability and credal sets

Imprecise probability theory represents uncertainty using sets of probability measures rather than a single precise distribution\. This includes credal sets, lower and upper probabilities, and lower previsions\. Walley’s framework of coherent lower previsions is one of the standard foundations for reasoning with imprecise probabilities, while Levi’s work on credal probability helped establish sets of probability functions as representations of belief states\(Walley,[1991](https://arxiv.org/html/2605.23146#bib.bib19); Levi,[1980](https://arxiv.org/html/2605.23146#bib.bib20)\)\.

Credal sets are closed and convex sets of probability distributions\. They provide a simple representation of Knightian uncertainty, where uncertainty cannot be represented by a single precise probability distribution\. Credal\-set methods are closely related to our work because they also evaluate choices using lower expectations over a set of admissible probability models\. However, IB is not merely a direct application of credal sets: in the finite setting considered here, an a\-measure contains both a probabilistic component and an offset term\.

Recent domain\-theoretic work on imprecise probability further emphasizes credal sets as structured objects supporting refinement and convergence\(Edalatet al\.,[2026](https://arxiv.org/html/2605.23146#bib.bib18)\); our work shares this finite\-representation perspective, but implements lower\-expectation planning using IB affine evaluators rather than ordinary probability measures\.

### 3\.3Policy\-dependent and Newcomb\-like environments

Policy\-dependent environments challenge the standard RL assumption that the environment is fixed independently of the agent’s policy\. Bell et al\. formalize this issue for RL through Newcomb\-like decision processes, where rewards or transitions may depend directly on the policy\. They show that value\-based RL agents in such environments cannot converge to non\-ratifiable policies and may fail to converge to optimal policies\(Bellet al\.,[2021](https://arxiv.org/html/2605.23146#bib.bib9)\)\. This motivates decision procedures that evaluate policies more directly, rather than only learning action values in a fixed environment\. Our work provides a finite IB implementation for studying lower\-expectation planning under non\-realizability and policy\-dependent uncertainty\.

## 4Methods

The belief state of the agent is stored as a single infradistribution and a world model\. The world model describes the type of environment in which the agent operates\. In particular, the representation of the probability measureμ\\muof each a\-measure depends on the world model\.

### 4\.1Representing infradistributions

General infradistributions are infinite sets of \(signed\) affine measures\. Operationally, measures can be discarded from this set if they are never needed to determine the lower expectation of any relevant function through equation \([3](https://arxiv.org/html/2605.23146#S2.E3)\)\. For any infradistribution, only its minimal points contribute to the expectation values\(Appel,[2020](https://arxiv.org/html/2605.23146#bib.bib11)\)\. Minimal points that can be written as non\-trivial convex combinations of other minimal points also cannot contribute\. This means that only the extremal minimal points need to be stored, analogous to representing a convex polytope by its vertices\. This is the key computational representation used in our implementation\.

Infradistributions are represented by their extremal minimal points, which is a finite set of a\-measures\. Three constructions cover every belief state we use:

1. 1\.Singleton: a set containing a single a\-measure encodes a hypothesis without uncertainty\. These infradistributions correspond to knowing the true environment exactly\.
2. 2\.Classical \(Bayesian\) mixture: a weighted combination of infradistributions, according to equation \([4](https://arxiv.org/html/2605.23146#S2.E4)\)\. These represent classical probabilistic uncertainty\. If each component is a singleton, this mixture reduces to standard Bayesian averaging\.
3. 3\.Knightian mixture: the set\-union of the constituent infradistributions\. This mixture carries no weights\. It is evaluated by the worst\-case across components\.

The two mixture operations compose and nest freely, which allows representing the rich belief states on which an IB agent operates\.

### 4\.2World models

Conceptually, measures are probability distributions over infinite histories\. Making them computationally tractable requires compressed representations\. Each type of environment uses a separate world model that defines how measures are represented and how predictions are derived from them, how past observations are represented, and how weighted mixtures of a\-measures are computed\. This paper focuses on two world models: Bernoulli bandits and Newcomb\-like problems\.

In a one\-armed Bernoulli bandit, the history can be represented as a pair of integers\(N,R\)\(N,R\), whereNNis the number of times the arm was pulled andRRis the number of times a reward was obtained\. This representation uses the fact that each round is independent, such that the order of observations does not matter\. A non\-mixed measure is represented by a fixed reward probabilitypp\. Mixed measures are represented as a set of pairs\(ci,pi\)\(c\_\{i\},p\_\{i\}\), wherepip\_\{i\}is the probability andcic\_\{i\}the weight of theithi^\{\\text\{th\}\}component of the mixture and∑ici=1\\sum\_\{i\}c\_\{i\}=1\. Given such a history and measure, the probability of a certain branch is computed as∑ici​\(1−pi\)N−R​piR\\sum\_\{i\}c\_\{i\}\(1\-p\_\{i\}\)^\{N\-R\}p\_\{i\}^\{R\}\. Predictions for future outcomes can be computed similarly\. Learning proceeds by updating the history\(N,R\)\(N,R\), implicitly recreating Bayes’ rule\. For akk\-armed Bernoulli bandit, histories are represented bykkpairs\(Nj,Rj\)\(N\_\{j\},R\_\{j\}\)and measures bykksets of pairs\(cj​i,pj​i\)\(c\_\{ji\},p\_\{ji\}\)\. Expectation values are computed per\-arm as for the one\-armed bandit\. This factorization is possible because arms are independent of each other, i\.e\. observing one arm does not give any information about the others\.

We consider Newcomb\-like environments with imperfect predictors\. The world model itself contains the entire reward matrix and the predictor accuracy, i\.e\. the agent knows the entire structure of the environment\. In this setting, there is nothing to be learned\. Therefore, measures and histories do not have an internal state and are not updated upon observations\.

### 4\.3Decision and update procedure

The aim of the agent is to select the optimal policy, which might be non\-deterministic, based on its belief state\. The agent achieves this by discretizing policy space and maintaining a set of candidate policiesΠ\\Pi\. At every step of the interaction, the agent will:

1. 1\.Compute the expected value of each policyπ∈Π\\pi\\in\\Pias𝔼¯Ψ​\(π\)​\[f\]=infa∈Ψ​\(π\)a​\(f\)\\underline\{\\mathbb\{E\}\}\_\{\\Psi\(\\pi\)\}\[f\]=\\inf\_\{a\\in\\Psi\(\\pi\)\}a\(f\), whereΨ​\(π\)\\Psi\(\\pi\)is the infradistribution andffis the reward function\. The infradistribution can depend on the policy explicitly, such as in Newcomb\-like environments, or implicitly, by sampling an action from the policy\. With\|Ψ\|=1\|\\Psi\|=1this operation reduces to an ordinary expected value\. With\|Ψ\|\>1\|\\Psi\|\>1it picks the worst\-case environment\.
2. 2\.Select the optimal policyπ∗=argmaxπ∈Π𝔼¯Ψ​\(π\)​\[f\]\\pi^\{\*\}=\\mathop\{\\mathrm\{argmax\}\}\_\{\\pi\\in\\Pi\}\\underline\{\\mathbb\{E\}\}\_\{\\Psi\(\\pi\)\}\[f\], and sample an action fromπ∗\\pi^\{\*\}\. Both the policy and the action are passed to the environment\.
3. 3\.Update its belief state upon receiving an observation from the environment\. This includes raw updates of a\-measures according to equation \([5](https://arxiv.org/html/2605.23146#S2.E5)\), updates of the probability measures as specified by the world model, and the renormalization step\. Because of linearity of the raw update, no new extremal points appear, so it suffices to update the existing set\.

## 5Results and Discussion

In the following sections, we introduce experiments that validate our implementation of an infra\-Bayesian agent\. First, we demonstrate proper behavior under Knightian uncertainty; next, we validate behavior in Newcomb’s problem\. Additional results in the appendix show that an IB agent reduces to a Bayesian agent when initialized equivalently in a standard stochastic bandit setting \(for which see[A](https://arxiv.org/html/2605.23146#A1)\), and how an infra\-Bayesian agent can meaningfully learn under Knightian uncertainty \(for which see[B](https://arxiv.org/html/2605.23146#A2)\)\.

### 5\.1Knightian Uncertainty

To demonstrate Knightian uncertainty, we consider a two\-armed, adversarial Bernoulli bandit environment\. Each arm yields reward 1 with a probability chosen anew at the beginning of each step, and otherwise reward 0\. The mechanism by which probabilities are chosen is unknown to the agent and may potentially be time\-dependent or even adversarial\. This makes it impossible to learn reward probabilities over multiple episodes\. Past observations do not provide useful data for assigning a prior to the current interaction\.

The reward probabilities are constrained to the rangep1∈\[0\.3,0\.7\]p\_\{1\}\\in\[0\.3,0\.7\]andp2∈\[0\.4,0\.8\]p\_\{2\}\\in\[0\.4,0\.8\]\. The left of figure[1](https://arxiv.org/html/2605.23146#S5.F1)displays the space of possible environments\. There are two regions: in the upper triangle of the diagramp2\>p1p\_\{2\}\>p\_\{1\}, and thus it is optimal to pull arm 2\. In the lower triangle, the opposite is true\. For decision making, it only matters whether the agent believes it is in the upper or lower region\. Both regions are possible, within the constraint\.

00\.50\.51100\.50\.511p2\>p1p\_\{2\}\>p\_\{1\}p1\>p2p\_\{1\}\>p\_\{2\}Probabilityp1p\_\{1\}Probabilityp2p\_\{2\}05050100100025255050EpisodeCumulative RegretClassical allowedClassical worst\-caseIB allowedIB worst\-caseFigure 1:Left: Visualization of environment space\(p1,p2\)\(p\_\{1\},p\_\{2\}\)\. In the blue \(green\) area, arm 1 \(2\) yields the higher expected reward\. The white box indicates the region that is allowed by the constraint\. The red dot indicates the worst allowed environment\. Right: cumulative regret of classical and IB agents\. The shaded areas show the theoretically allowed ranges\. The lines show simulated results from a single roll\-out of the worst\-case configuration for each agent\.A classical Bayesian agent cannot act from the interval constraints alone\. It must first replace them with a precise prior over environments\. Different choices of this prior can lead to different policies\. For example, a prior concentrated on an environment wherep1\>p2p\_\{1\}\>p\_\{2\}recommends arm 1, while a prior concentrated on an environment wherep2\>p1p\_\{2\}\>p\_\{1\}recommends arm 2\. In this experiment, we illustrate this prior\-dependence by considering classical agents whose priors are point masses at the corners of the allowed set; of course, this should not be read as the only possible classical choice\. A uniform prior over the allowed set would recommend arm 2 in this example, matching the infra\-Bayesian action\. The point is instead that the classical agent’s behavior depends on an additional prior\-selection assumption that is not specified by the interval constraints themselves\. By contrast, the infra\-Bayesian agent can represent the constraint directly as Knightian uncertainty, which will lead it to maximize the worst\-case outcome\. The worst allowed environment is\(p1,p2\)=\(0\.3,0\.4\)\(p\_\{1\},p\_\{2\}\)=\(0\.3,0\.4\), as indicated in the figure\. In this environmentp2\>p1p\_\{2\}\>p\_\{1\}, so the agent will always pull arm 2\. It is guaranteed to achieve an average reward of0\.40\.4, and does not risk getting0\.30\.3on arm 1\.

The exact reward and regret realized depend on the true environment\. The right panel of figure[1](https://arxiv.org/html/2605.23146#S5.F1)shows the possible ranges of regret achieved by the two agents, for any true environment consistent with the constraint\. Also shown are simulated regret curves for the worst\-case configurations for each agent\. We find that the worst\-case outcome for the IB agent realizes a lower regret than the worst\-case of the classical agent\.

Note that the results here are not meant to demonstrate a meaningful form of learning, either by the infra\-Bayesian or classical agents\. Indeed, even if arm 1 repeatedly gives higher rewards, an infra\-Bayesian agent will not update to exploit them, which is the correct behavior under Knightian uncertainty\. Rewards could be controlled by an unknown or adversarial mechanism that makes arm 1 look attractive, only to give it a lower reward when the agent chooses it\. Sticking to arm 2 is therefore the robust and safe strategy in this scenario\. We present a stochastic bandit setting, in which the IB agent does learn despite Knightian uncertainty in appendix[B](https://arxiv.org/html/2605.23146#A2)\.

### 5\.2Policy dependence

Table 1:Reward matrix for Newcomb’s problemPolicy dependence is investigated via Newcomb’s problem with an imperfect predictor\. We consider a setting in which the transparent box always contains $1\. The opaque box contains $10 if the agent is predicted to one\-box, and $0 otherwise\. The resulting payoffs are shown in table[1](https://arxiv.org/html/2605.23146#S5.T1)\. The agent chooses a policy that one\-boxes with probabilitypp\. A predictor with accuracyα∈\[0\.5,1\]\\alpha\\in\[0\.5,1\]reads the agent’s policy and predicts one\-boxing with probabilityp⋅\(2​α−1\)\+0\.5⋅\(2−2​α\)p\\cdot\(2\\alpha\-1\)\+0\.5\\cdot\(2\-2\\alpha\)\. A perfect predictor \(α=1\\alpha=1\) predicts one\-boxing at the same rateppthat the agent one\-boxes\. A random predictor \(α=0\.5\\alpha=0\.5\) guesses at random, thus its prediction is independent of the agent’s policy\.

We implement this setting using the Newcomb\-like world model described in section[4\.2](https://arxiv.org/html/2605.23146#S4.SS2)\. The optimal policy and reward are shown in figure[2](https://arxiv.org/html/2605.23146#S5.F2)\. For a sufficiently accurate predictor \(α\>0\.55\\alpha\>0\.55\), one\-boxing is the optimal strategy\. For a nearly random predictor \(α<0\.55\\alpha<0\.55\), two\-boxing is optimal\. The figure also shows simulated results from our implementation\. We find that the agent consistently selects the optimal policy and achieves the corresponding optimal reward\.

0\.60\.60\.80\.81155667788991010Predictor accuracyα\\alphaAverage reward0\.60\.60\.80\.81100\.20\.20\.40\.40\.60\.60\.80\.811Predictor accuracyα\\alphaOne\-boxing rateOptimalSimulatedFigure 2:Average reward \(left\) and one\-boxing rate \(right\) in Newcomb’s problem as a function of the predictor accuracy\. Shown are both optimal and simulated values, averaged over 1000 episodes\. Forα=0\.55\\alpha=0\.55, the reward is independent of the one\-boxing rate and thus every rate is optimal\.An agent following causal decision theory would two\-box in Newcomb’s problem, arguing that its decision cannot influence the already fixed box contents after the predictor’s move, thus missing out on the large reward when the predictor is sufficiently accurate\. An agent following evidential decision theory also one\-boxes in this scenario, but for a different reason\. It fails to accurately model the causal structure of the problem and can misbehave in similar scenarios\. This scenario demonstrates IB decision theory selecting the optimal action, and for the correct reason, given full information of the reward structure\. The same approach also achieves optimal behavior in the \(Asymmetric\) Death in Damascus and Coordination Game scenarios, some of which require mixed strategies\(Bellet al\.,[2021](https://arxiv.org/html/2605.23146#bib.bib9)\)\.

## 6Conclusion, Limitations, and Future Work

This paper presents a proof\-of\-concept infra\-Bayesian reinforcement learning architecture for finite\-outcome stateless decision problems\. Our design implements a\-measures, infradistributions, classical and Knightian mixtures, and the IB conditioning rule, while remaining computationally tractable by storing only extremal minimal points and using optimized representations for histories and belief states\. We find that the architecture recovers standard Bayesian behavior as a special case, confirming that IB generalizes rather than replaces classical reasoning\. In a Knightian\-uncertain bandits setting, the IB agent identifies and optimizes against the worst\-case environment within the admissible set, while classical agents depend on arbitrary priors\. This demonstrates concretely that the distinction between probabilistic and Knightian uncertainty, central to the IB formalism, produces meaningfully different agent behavior in settings where classical methods may be ambiguous\. More broadly, real\-world agents must operate under misspecification and policy\-dependent environments\. In such settings, optimizing expected value can lead to brittle behavior, while worst\-case optimization offers robustness\. Bridging IB theory to concrete agent implementations is a step toward RL systems that are robust by design\.

Our implementation is restricted to finite outcomes, nonnegative a\-measures, and small hypothesis spaces\. Scaling to continuous state spaces, large hypothesis classes, and function approximation remains open\. Future work – some of which is already underway – would seek to address these concerns, most notably permitting optimization of multi\-step decision processes under Knightian uncertainty and fully leveraging IB’s advantages in keeping track of multiple possible environments\. Nevertheless, the results show that IB reasoning can be translated into a working RL agent\. Additionally, our regret bounds, while an improvement over those of classical RL, remain linear, though those regret bounds for classical RL presuppose that a classical agent can converge to a ratifiable policy at all\.

## References

- Basic inframeasure theory\.AI Alignment Forum\.External Links:[Link](https://www.alignmentforum.org/posts/basic-inframeasure-theory)Cited by:[§1](https://arxiv.org/html/2605.23146#S1.p3.1),[§3](https://arxiv.org/html/2605.23146#S3.p1.1),[§4\.1](https://arxiv.org/html/2605.23146#S4.SS1.p1.1)\.
- J\. Bell, L\. Linsefors, C\. Oesterheld, and J\. Skalse \(2021\)Reinforcement learning in newcomblike environments\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 27284–27295\.Cited by:[§1](https://arxiv.org/html/2605.23146#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.23146#S2.SS2.p3.1),[§3\.3](https://arxiv.org/html/2605.23146#S3.SS3.p1.1),[§5\.2](https://arxiv.org/html/2605.23146#S5.SS2.p3.1)\.
- Diffractor and V\. Kosoy \(2020\)Introduction To The Infra\-Bayesianism Sequence — LessWrong\.External Links:[Link](https://www.lesswrong.com/posts/zB4f7QqKhBHa5b37a/introduction-to-the-infra-bayesianism-sequence)Cited by:[§1](https://arxiv.org/html/2605.23146#S1.p3.1),[§3](https://arxiv.org/html/2605.23146#S3.p1.1)\.
- A\. Edalat, P\. Di Gianantonio, and A\. Farjudian \(2026\)A domain\-theoretic foundation for imprecise probability and credal sets\.External Links:2604\.09272,[Link](https://arxiv.org/abs/2604.09272)Cited by:[§3\.2](https://arxiv.org/html/2605.23146#S3.SS2.p3.1)\.
- B\. Gelb \(2025a\)An introduction to credal sets and infra\-bayes learnability\.AI Alignment Forum\.External Links:[Link](https://www.alignmentforum.org/s/n7qFxakSnxGuvmYAX/p/rkhaRnAc6dLzQT2sJ)Cited by:[§3](https://arxiv.org/html/2605.23146#S3.p1.1)\.
- B\. Gelb \(2025b\)What is inadequate about bayesianism for ai alignment: motivating infra\-bayesianism\.AI Alignment Forum\.External Links:[Link](https://www.alignmentforum.org/posts/wzCtwYtojMabyEg2L/what-is-inadequate-about-bayesianism-for-ai-alignment)Cited by:[§1](https://arxiv.org/html/2605.23146#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.23146#S2.SS1.p2.1)\.
- M\. Ghavamzadeh, S\. Mannor, J\. Pineau, and A\. Tamar \(2015\)Bayesian reinforcement learning: a survey\.Foundations and Trends® in Machine Learning8\(5–6\),pp\. 359–489\.Cited by:[§2\.1](https://arxiv.org/html/2605.23146#S2.SS1.p2.1)\.
- G\. N\. Iyengar \(2005\)Robust dynamic programming\.Mathematics of Operations Research30\(2\),pp\. 257–280\.Cited by:[§3\.1](https://arxiv.org/html/2605.23146#S3.SS1.p1.1)\.
- V\. Kosoy and V\. Kosoy \(2023\)The Learning\-Theoretic Agenda: Status 2023 — LessWrong\.External Links:[Link](https://www.lesswrong.com/posts/ZwshvqiqCvXPsZEct/the-learning-theoretic-agenda-status-2023)Cited by:[§2\.1](https://arxiv.org/html/2605.23146#S2.SS1.p2.1)\.
- I\. Levi \(1980\)The enterprise of knowledge: an essay on knowledge, credal probability, and chance\.MIT Press,Cambridge, MA\.Cited by:[§3\.2](https://arxiv.org/html/2605.23146#S3.SS2.p1.1)\.
- A\. Nilim and L\. El Ghaoui \(2005\)Robust control of markov decision processes with uncertain transition matrices\.Operations Research53\(5\),pp\. 780–798\.Cited by:[§3\.1](https://arxiv.org/html/2605.23146#S3.SS1.p1.1)\.
- R\. Nozick \(1969\)Newcomb’s problem and two principles of choice\.InEssays in honor of carl g\. hempel: A tribute on the occasion of his sixty\-fifth birthday,pp\. 114–146\.Cited by:[§2\.3](https://arxiv.org/html/2605.23146#S2.SS3.p1.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.MIT Press\.Cited by:[§2\.1](https://arxiv.org/html/2605.23146#S2.SS1.p2.1)\.
- P\. Walley \(1991\)Statistical reasoning with imprecise probabilities\.Chapman and Hall,London\.Cited by:[§3\.2](https://arxiv.org/html/2605.23146#S3.SS2.p1.1)\.

## Appendix

## Appendix AEmpirical validation that an infra\-Bayesian reduces to a classical Bayesian agent given identical hypotheses

To validate our implementation, we test whether our infra\-Bayesian implementation reproduces ordinary Bayesian behavior when the agents are initialized with identical hypothesis sets\. We consider standard two\-armed Bernoulli bandits and compare a classical discrete Bayesian agent against an infra\-Bayesian agent with a singleaa\-measure\. Thisaa\-measure contains the same finite Bernoulli hypothesis grid used by the classical agent, so the pessimistic minimum overaa\-measures is vacuous and the infra\-Bayesian posterior predictive should reduce to the classical Bayesian posterior predictive\.

Figure[3](https://arxiv.org/html/2605.23146#A1.F3)shows the resulting cumulative regret and action probabilities across four stochastic, two\-armed bandit environments\. Solid, transparent curves are the classical Bayesian agent and opaque, dashed curves are the corresponding infra\-Bayesian agent; colors indicate different bandit environment\. The curves coincide exactly under matched random seeds, confirming that the infra\-Bayesian agent recovers the classical Bayesian update and action rule in the single\-measure setting\.

![Refer to caption](https://arxiv.org/html/2605.23146v1/validate_classical.png)Figure 3:Validation that the infra\-Bayesian agent reproduces classical Bayesian bandit behavior in the precise\-prior, single\-aa\-measure case\. The legend gives the priors of each arm in four unique Bernoulli bandit settings\.
## Appendix BTrap Bandit Experiments

We consider a simple experiment illustrating how a robust infra\-Bayesian learner can be useful even in a stateless stochastic bandit setting\. The environment has two arms, indexed byi∈\{1,2\}i\\in\\\{1,2\\\}\. At the start of each run, the arm reward probabilities\(p1,p2\)\(p\_\{1\},p\_\{2\}\)are sampled uniformly from

\{\(0\.3,0\.7\),\(0\.7,0\.3\)\},\\\{\(0\.3,0\.7\),\(0\.7,0\.3\)\\\},so that one arm has significantly higher expected reward than the other\. Independently, the world is sampled to be either safe or risky, with probabilityα\\alphaof being risky and probability1−α1\-\\alphaof being safe\.

In a safe world, each arm is Bernoulli with reward probabilitypip\_\{i\}\. In a risky world, the arm with the larger reward probability is also the trapped arm\. Pulling this arm yields a catastrophic reward of−1000\-1000with probabilitypcatp\_\{\\mathrm\{cat\}\}, yields reward11with probabilitypip\_\{i\}, and yields reward0otherwise\. The lower\-probability arm remains Bernoulli\. Thus each run is generated as follows:

sample\(p1,p2\)∼Unif​\{\(0\.3,0\.7\),\(0\.7,0\.3\)\},sampleworld type∼Bernoulli​\(α\),if safe:ri∼Bernoulli​\(pi\),if risky:\{the trap arm​arg⁡maxi⁡pi​yields−1000​with probability​pcat,the trap arm yields​1​with probability​pi,the trap arm yields​0​otherwise,the other arm yields​ri∼Bernoulli​\(pi\)\.\\begin\{array\}\[\]\{ll\}\\text\{sample\}&\(p\_\{1\},p\_\{2\}\)\\sim\\mathrm\{Unif\}\\\{\(0\.3,0\.7\),\(0\.7,0\.3\)\\\},\\\\ \\text\{sample\}&\\text\{world type\}\\sim\\mathrm\{Bernoulli\}\(\\alpha\),\\\\\[5\.69054pt\] \\text\{if safe:\}&r\_\{i\}\\sim\\mathrm\{Bernoulli\}\(p\_\{i\}\),\\\\\[2\.84526pt\] \\text\{if risky:\}&\\begin\{cases\}\\text\{the trap arm \}\\arg\\max\_\{i\}p\_\{i\}\\text\{ yields \}\-1000\\text\{ with probability \}p\_\{\\mathrm\{cat\}\},\\\\ \\text\{the trap arm yields \}1\\text\{ with probability \}p\_\{i\},\\\\ \\text\{the trap arm yields \}0\\text\{ otherwise\},\\\\ \\text\{the other arm yields \}r\_\{i\}\\sim\\mathrm\{Bernoulli\}\(p\_\{i\}\)\.\\end\{cases\}\\end\{array\}
We compare classical Bayesian agents with an infra\-Bayesian agent using the same joint hypothesis machinery\. Bayesian agents represent uncertainty usingInfradistribution\.mix\(\.\.\.\)\. The infra\-Bayesian agent instead uses Knightian uncertainty over the safe and risky world families viaInfradistribution\.mixKU\(\.\.\.\), while retaining ordinary Bayesian uncertainty over\(p1,p2\)\(p\_\{1\},p\_\{2\}\)within each family\.

The experiments vary the relationship between the true risky\-world probability,αDGP\\alpha\_\{\\mathrm\{DGP\}\}, and the Bayesian agent’s point prior,αprior\\alpha\_\{\\mathrm\{prior\}\}\. In the main mostly\-risky setting, we setαDGP=0\.99\\alpha\_\{\\mathrm\{DGP\}\}=0\.99\. We first consider a correctly specified Bayesian prior,αprior=0\.99\\alpha\_\{\\mathrm\{prior\}\}=0\.99, and then a severely misspecified prior,αprior=0\.01\\alpha\_\{\\mathrm\{prior\}\}=0\.01\. Across these conditions, the infra\-Bayesian agent uses the same classical prior over\(p1,p2\)\(p\_\{1\},p\_\{2\}\)as the Bayesian agents, but maintains Knightian uncertainty over whether the world is safe or risky\.

For Bayesian agents, we evaluate two exploration strategies: greedy action selection and Thompson sampling\. The infra\-Bayesian agent uses greedy action selection with respect to its robust lower values, with uniform tie\-breaking\. Regret is measured relative to the best policy with full knowledge of the true world\. We report cumulative expected regret percentiles and trapped\-arm pull\-rate percentiles\. Results are shown in Figure[4](https://arxiv.org/html/2605.23146#A2.F4)\.

![Refer to caption](https://arxiv.org/html/2605.23146v1/trapped_bandits.png)Figure 4:Comparing the performance of infra\-Bayesian and classical Bayesian agents \(with either greedy or Thompson Sampling exploration strategies\) in the trap bandit setting\. The first row shows results for a correctly specified Bayes prior condition; the second for a severely misspecified Bayes prior condition\. The first column shows cumulative expected regret, and the second shows the average pull rate of the risky \(trap\) arm\.When the Bayesian prior is correctly specified, greedy Bayes and the infra\-Bayesian agent behave nearly identically in risky worlds\. This is expected: when the risky\-world probability is high, expected\-value maximization already favors the conservative action\. The infra\-Bayesian agent nevertheless obtains this behavior without committing to a point prior over the safe/risky model class\. Under severe misspecification, however, greedy Bayes initially treats the world as mostly safe, repeatedly pulls the high\-reward trapped arm, and incurs much larger expected regret and a high catastrophe rate\. Thompson sampling slightly reduces regret in this misspecified condition, but does not remove the failure mode\.

These results show infra\-Bayesian agents matching or outperforming classical Bayesian agents in risky worlds, which raises a natural question: at what cost? Table[2](https://arxiv.org/html/2605.23146#A2.T2)includes an alternative mostly\-safe scenario withαDGP=0\.01\\alpha\_\{\\mathrm\{DGP\}\}=0\.01\. In this setting, the infra\-Bayesian agent incurs substantially higher regret, reflecting the cost of maintaining Knightian uncertainty over the risky\-world hypothesis\.

Table 2:Final cumulative expected\-regret percentiles with bootstrap confidence intervals\.

Similar Articles

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

arXiv cs.AI

This paper introduces Posterior Hybrid Bayesian Belief (PhyB), a framework that reformulates the expectation in Bayesian RL as a convex combination over dynamics models, enabling efficient regularized offline policy optimization with bounded objective discrepancy and state-of-the-art performance.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

arXiv cs.CL

This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.

Quantifying Potential Observation Missingness in Inverse Reinforcement Learning

arXiv cs.LG

This paper identifies the problem of missing observations in inverse reinforcement learning (IRL) that can make expert actions appear suboptimal, and develops a practical algorithm to quantify the minimal perturbations needed for expert actions to appear optimal, validated on synthetic tasks, cancer treatment simulation, and ICU data.