Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

arXiv cs.AI 05/15/26, 04:00 AM Papers
pomdp policy-synthesis sampling model-checking automata-learning formal-methods
Summary
This paper presents a novel framework for synthesizing finite-state controllers for Partially Observable Markov Decision Processes (POMDPs) by integrating sampling, automata learning, and model-checking. The approach provides formal guarantees for threshold-safety problems that elude existing formal synthesis tools.
arXiv:2605.14440v1 Announce Type: new Abstract: Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin's $L^*$ algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.
Original Article
View Cached Full Text
Cached at: 05/15/26, 06:25 AM
# Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning
Source: [https://arxiv.org/html/2605.14440](https://arxiv.org/html/2605.14440)
11institutetext:Nanyang Technological University, Singapore
11email:debraj\.chakraborty@ntu\.edu\.sg22institutetext:Tata Institute of Fundamental Research, Mumbai, India
22email:anirban\.majumdar@tifr\.res\.in33institutetext:Université Libre de Bruxelles, Brussels, Belgium
33email:\{prince\.mathew,jean\-francois\.raskin\}@ulb\.be44institutetext:IITB Trust Lab, Department of CSE, IIT Bombay, Mumbai, India
44email:sayan@cse\.iitb\.ac\.inAnirban Majumdar[https://orcid.org/0000-0003-4793-1892](https://orcid.org/0000-0003-4793-1892)Prince Mathew[https://orcid.org/0000-0001-6410-1474](https://orcid.org/0000-0001-6410-1474) Sayan Mukherjee[https://orcid.org/0000-0001-6473-3172](https://orcid.org/0000-0001-6473-3172)Jean\-François Raskin[https://orcid.org/0000-0002-3673-1097](https://orcid.org/0000-0002-3673-1097)

###### Abstract

Partially Observable Markov Decision Processes \(POMDPs\) are the standard framework for decision\-making under uncertainty\. While sampling\-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety\-critical applications\. Conversely, formal synthesis techniques provide correctness\-by\-construction but often struggle with scalability, as general POMDP synthesis is undecidable\. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model\-checking\. Inspired by Angluin’sL∗L^\{\*\}algorithm, our approach utilizes sampling as a membership oracle and model\-checking as an equivalence oracle\. This enables the synthesis of finite\-state controllers with formal guarantees, provided the sampling\-induced policy is regular\. We establish a relative completeness result for this framework\. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold\-safety problems that remain challenging for existing formal synthesis tools\. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems\.

## 1Introduction

Partially Observable Markov Decision Processes \(POMDPs\) are a fundamental model for decision\-making under uncertainty and partial observability\. They formalize a wide range of real\-world control problems in which the environment evolves stochastically, and the controller receives only indirect, noisy, or incomplete observations of the underlying state\. As a result, they are extensively studied in fields such as robotics, planning, and artificial intelligence, and offer a powerful framework for formulating the reactive controller synthesis problem in complex and uncertain environments\.

Despite their expressive power and practical importance, tool support for controller synthesis in POMDPs remains limited\. This is primarily due to the inherent algorithmic complexity of the model\. Even the most basic decision problems related to synthesis – existence of a controller that ensures safety \(*i\.e\.*, avoiding bad states\) with a probability above a given threshold – are undecidable in general\[[28](https://arxiv.org/html/2605.14440#bib.bib10)\], ruling out the possibility of complete algorithmic solutions\. As a consequence, most practical approaches rely on sampling\-based techniques, such as Monte Carlo Tree Search \(MCTS\)\[[24](https://arxiv.org/html/2605.14440#bib.bib33)\]and its variants\. These methods are scalable and often yield high\-quality policies in practice, but they offer no static correctness guarantees and are therefore unsuitable for safety\-critical applications\. Moreover, such techniques can be challenging to deploy in practice when they impose unacceptable runtime overhead\. In contrast, formal methods provide correctness by construction, typically through static \(*i\.e*\., pre\-deployment\) synthesis techniques and logical specification languages, but they generally do not scale to POMDPs of realistic size or complexity\. In this work, we propose to bridge the gap between sampling\-based methods and formal methods by combining them through a learning technique inspired by theL∗L^\{\*\}algorithm\[[4](https://arxiv.org/html/2605.14440#bib.bib6)\]\. We call our framework*Counterexample\-guided Policy Learning Using Sampling*\(Cplus\)\. The key insight here is that the synthesis of an observation\-based policy can be framed as a learning problem\.

In the terminology ofL∗L^\{\*\},Cplusemploys membership queries – called*action queries*– that are answered by an*action oracle*\(𝒪A\\mathcal\{O\}^\{A\}\) that, given a history of observations and actions, returns an optimal action to take from that position\. Sampling\-based POMDP analysis techniques are well\-suited to answer these kinds of queries\. The answers to these queries are then generalised to construct the hypotheses – finite\-state controllers \(FSCs\) represented as Mealy machines\.

Once a hypothesis has been constructed, it is passed on to a*model checking oracle*\(𝒪M\\mathcal\{O\}^\{M\}\), that verifies whether the candidate controller satisfies the given specification – these are called*model\-checking queries*\.

If the model checking oracle validates the hypothesis,Cplusterminates with a verified solution, providing the strong guarantees of formal methods\. Otherwise, it returns a set of traces that violate the specification\. These traces are then analyzed to identify a prefix where the candidate controller disagrees with𝒪A\\mathcal\{O\}^\{A\}\. This prefix, called the*counterexample*, is incorporated \(as inL∗L^\{\*\}\) into the learning algorithm to refine the hypothesis, and the process is repeated\.

While the use of a sampling tool alone provides no static formal guarantees, in our framework it is used exclusively to answer membership queries, with correctness ensured entirely by the subsequent model\-checking phase\. Importantly, the decision problem for model\-checking a candidate policy is tractable, as it reduces to model\-checking a Markov chain – the product of the POMDP and the FSC – against an infinite\-duration safety property\. This can be solved in polynomial time w\.r\.t\. the sizes of the POMDP and the FSC representing the candidate policy, in contrast to the undecidable nature of the general synthesis problem\.

Although due to the undecidability of POMDP synthesis, full completeness is unattainable, we establish arelative completenessresult\. First, we prove that finite\-state policies always suffice for*threshold\-safety objectives*\(Theorem[1](https://arxiv.org/html/2605.14440#Thmproblem1)\)\. Then, we show that if the policy implicitly defined by𝒪A\\mathcal\{O\}^\{A\}is both correct and regular \(i\.e\., recognizable by a finite\-state machine\), then our algorithm is guaranteed to terminate and produce a correct solution \(Theorem[4](https://arxiv.org/html/2605.14440#S4.SS0.SSS0.Px4)\)\. In this case, we obtain two important advantages with respect to executing𝒪A\\mathcal\{O\}^\{A\}online\. First, the solution produced by our procedure is certified by construction, prior to deployment\. Second, the solution takes the form of a finite\-state controller, which can be deployed and executed efficiently online, requiring only constant\-time updates – unlike online sampling methods, which may incur latency due to their computational overhead\.

To demonstrate the potential of our approach, we implemented a prototype ofCplusinPython\. For the action oracle𝒪A\\mathcal\{O\}^\{A\}, we rely on the online plannerDespot\[[40](https://arxiv.org/html/2605.14440#bib.bib7)\], which can be queried on demand to compute locally optimal actions without requiring the precomputation of a full policy\. For the model checking oracle𝒪M\\mathcal\{O\}^\{M\}, our implementation usesStorm\[[20](https://arxiv.org/html/2605.14440#bib.bib20)\]\. The implementation is modular\. In particular, other planners – either sampling\-based, or based on formal methods, as discussed later – can be used as𝒪A\\mathcal\{O\}^\{A\}\. Similarly, other probabilistic model checkers, such asPrism\[[26](https://arxiv.org/html/2605.14440#bib.bib28)\]orModest\[[19](https://arxiv.org/html/2605.14440#bib.bib1)\], can be used as𝒪M\\mathcal\{O\}^\{M\}without modifying the overall framework\.

The experimental results are promising\. Our implementation, which was developed to assess the potential of the framework, is able to solve threshold\-safety problems that are challenging for existing formal\-methods algorithms\. At the same time, it provides strong formal correctness guarantees through model checking, which sampling\-based approaches alone cannot provide\. The resulting framework, which combines sampling, learning, and verification, is therefore both theoretically well\-founded and practically relevant\.

However, due to the undecidability of the POMDP synthesis problem, our approach should be seen as an additional method rather than a complete solution\. In particular, it is best used as part of a portfolio of complementary techniques, together with other semi\-algorithms based on formal methods\.

### 1\.0\.1Related Works\.

The work that is most closely related to our approach is\[[42](https://arxiv.org/html/2605.14440#bib.bib38)\]\. This work addresses the learning of nondeterministic FSCs for bounded\-horizon reachability objectives in POMDPs,*i\.e\.*, reaching a set of states within a fixed number of steps with a probability exceeding a given threshold\. In contrast to the infinite\-horizon properties considered in our approach, bounded\-horizon properties are decidable for POMDPs\. Moreover, their method does not aim to solve the synthesis problem but to compute a most\-permissive policy that can later be refined into a concrete controller\. The search for a permissive policy entails an exponential blow\-up in the action space, which does not occur in our framework\. Their approach is illustrated only on a single, hand\-crafted example, and to the best of our knowledge, there is no implementation available for this\.

Online algorithms for policy computation include tree search methods that determine policies on the fly\. For instance, the algorithm POMCP\[[38](https://arxiv.org/html/2605.14440#bib.bib8)\]integrates MCTS with particle filtering, whileDespot\[[40](https://arxiv.org/html/2605.14440#bib.bib7)\]employs deterministic sparse trees to improve scalability\. These approaches scale well but lack strong correctness guarantees, making them less suitable for safety\-critical applications\.

Classical offline POMDP solvers such as Perseus\[[41](https://arxiv.org/html/2605.14440#bib.bib12)\]and SARSOP\[[25](https://arxiv.org/html/2605.14440#bib.bib13)\]use point\-based methods\[[34](https://arxiv.org/html/2605.14440#bib.bib11)\], which approximate the value function over a selected subset of the belief space\. On the other hand, methods based on partial exploration\[[7](https://arxiv.org/html/2605.14440#bib.bib15),[8](https://arxiv.org/html/2605.14440#bib.bib16)\]or abstraction\[[32](https://arxiv.org/html/2605.14440#bib.bib14)\]of the underlying belief MDP have been implemented in probabilistic model checkers such asStorm\[[20](https://arxiv.org/html/2605.14440#bib.bib20)\]and Prism\[[26](https://arxiv.org/html/2605.14440#bib.bib28)\]\.

Inductive synthesis tools such asPaynt\[[3](https://arxiv.org/html/2605.14440#bib.bib18),[2](https://arxiv.org/html/2605.14440#bib.bib19)\]explore the space of controllers by incrementally evaluating candidate solutions, but they struggle to scale as the required memory size increases\. The algorithm described in\[[6](https://arxiv.org/html/2605.14440#bib.bib17)\]applies anL∗L^\{\*\}learning algorithm to synthesize an FSC from a policy precomputed usingStorm\. Our approach differs from theirs in two key aspects:\(i\)\(i\)we do not precompute the full policy but instead query an action oracle only when needed, and\(ii\)\(ii\)we introduce a novel model\-checker\-based equivalence query that ensures correctness\. Other approaches derive FSCs by extracting automata from Recurrent Neural Network \(RNN\)\-based policies for POMDPs\[[11](https://arxiv.org/html/2605.14440#bib.bib22)\], but these do not provide formal guarantees\.

Black\-box checking\[[33](https://arxiv.org/html/2605.14440#bib.bib44),[37](https://arxiv.org/html/2605.14440#bib.bib45)\]is also related to our approach, as it combines automata learning with verification in a larger workflow\. However, in black\-box checking, a positive model\-checking result on the learned hypothesis is not sufficient on its own, since one must additionally establish conformance with the underlying system\. In contrast, in our setting, once the model checker returns true, the synthesized FSC is already a certified solution to the threshold\-safety problem\. Moreover, classical black\-box checking targets deterministic reactive systems representable as Mealy machines, whereas our framework addresses stochastic, partially observable POMDPs, for which the general synthesis problem is undecidable\.

The missing proofs in the article can be found in the Appendix\.

## 2Preliminaries

A*\(discrete\) probability distribution*on a countable setSSis a functiond:S→\[0,1\]d:S\\to\[0,1\]that satisfies∑s∈Sd\(s\)=1\\sum\_\{s\\in S\}d\(s\)=1\. We denote the set of all probability distributions on the setSSas𝖣𝗂𝗌𝗍\(S\)\\mathsf\{Dist\}\(S\)\. Ford∈𝖣𝗂𝗌𝗍\(S\)d\\in\\mathsf\{Dist\}\(S\), the*support*ofddis𝗌𝗎𝗉𝗉\(d\)=\{s∈S∣d\(s\)\>0\}\\mathsf\{supp\}\(d\)=\\\{s\\in S\\mid d\(s\)\>0\\\}\.

###### Definition 1\(Markov Chain\)

AMarkov chain \(MC\)is a tripleℳ=\(S,P,s0\)\\mathcal\{M\}=\(S,P,s\_\{0\}\)whereSSis a countable set of states,P:S→𝖣𝗂𝗌𝗍\(S\)P:S\\rightarrow\\mathsf\{Dist\}\(S\)is the transition function ands0∈Ss\_\{0\}\\in Sis the initial state\.

For statess,s′∈Ss,s^\{\\prime\}\\in S,P\(s\)\(s′\)P\(s\)\(s^\{\\prime\}\)denotes the probability of moving from statessto states′s^\{\\prime\}in a single step and we denote this probability asP\(s,s′\)P\(s,s^\{\\prime\}\)\. A*finite path*ρ=s0s1…si\\rho=s\_\{0\}s\_\{1\}\\ldots s\_\{i\}of lengthi≥0i\\geq 0is a sequence ofi\+1i\+1consecutive states such that for allt∈\[0,i−1\]t\\in\[0,i\-1\],st\+1∈𝗌𝗎𝗉𝗉\(P\(st\)\)s\_\{t\+1\}\\in\\mathsf\{supp\}\(P\(s\_\{t\}\)\)\. Similarly, an infinite path is an infinite sequenceρ=s0s1s2…\\rho=s\_\{0\}s\_\{1\}s\_\{2\}\\ldotsof states such that for allt∈ℕt\\in\\mathbb\{N\},st\+1∈𝗌𝗎𝗉𝗉\(P\(st\)\)s\_\{t\+1\}\\in\\mathsf\{supp\}\(P\(s\_\{t\}\)\)\. The probability of a finite pathρ=s0s1…sn\\rho=s\_\{0\}s\_\{1\}\\dots s\_\{n\}in the MCℳ\\mathcal\{M\}, denotedℳ\(ρ\)\\mathcal\{M\}\(\\rho\), is given by∏i=0n−1P\(si,si\+1\)\\prod\_\{i=0\}^\{n\-1\}P\(s\_\{i\},s\_\{i\+1\}\)\. For a finite pathρ\\rho, let𝖢𝗒𝗅\(ρ\)\\mathsf\{Cyl\}\(\\rho\)denote the set consisting of infinite pathsρ′\\rho^\{\\prime\}such thatρ\\rhois a prefix ofρ′\\rho^\{\\prime\}\. The set𝖢𝗒𝗅\(ρ\)\\mathsf\{Cyl\}\(\\rho\)is called the cylinder set ofρ\\rho\. For a set of finite pathsRR, we define𝖢𝗒𝗅\(R\)=⋃ρ∈R𝖢𝗒𝗅\(ρ\)\\mathsf\{Cyl\}\(R\)=\\bigcup\_\{\\rho\\in R\}\\mathsf\{Cyl\}\(\\rho\)\.

We denote byℙℳ\\mathbb\{P\}\_\{\\mathcal\{M\}\}the probability measure over infinite paths induced by the Markov chainℳ\\mathcal\{M\}\(see, e\.g\.\[[5](https://arxiv.org/html/2605.14440#bib.bib29)\], for a formal definition\)\. Intuitively,ℙℳ\\mathbb\{P\}\_\{\\mathcal\{M\}\}assigns to each measurable set of infinite paths \(called*events*\) the probability that an execution ofℳ\\mathcal\{M\}will follow a path in that set\.

Given two sets of states𝖦𝗈𝗈𝖽,𝖡𝖺𝖽⊆S\\mathsf\{Good\},\\mathsf\{Bad\}\\subseteq S, we first define◇𝖦𝗈𝗈𝖽\\Diamond\\mathsf\{Good\}as the set of paths that eventually reach a state in𝖦𝗈𝗈𝖽\\mathsf\{Good\}:◇𝖦𝗈𝗈𝖽=\{s0s1…∣∃i≥0:si∈𝖦𝗈𝗈𝖽\}\.\\Diamond\\mathsf\{Good\}=\\\{s\_\{0\}s\_\{1\}\\ldots\\mid\\exists i\\geq 0\\colon s\_\{i\}\\in\\mathsf\{Good\}\\\}\.This defines areachability objective\. We also define□¬𝖡𝖺𝖽\\Box\\neg\\mathsf\{Bad\}as the set of paths that avoid all states in𝖡𝖺𝖽\\mathsf\{Bad\}:□¬𝖡𝖺𝖽=\{s0s1…∣∀i≥0:si∉𝖡𝖺𝖽\}\.\\Box\\neg\\mathsf\{Bad\}=\\\{s\_\{0\}s\_\{1\}\\ldots\\mid\\forall i\\geq 0\\colon s\_\{i\}\\not\\in\\mathsf\{Bad\}\\\}\.This defines asafety objective\. Finally, we define¬𝖡𝖺𝖽𝒰𝖦𝗈𝗈𝖽\\neg\\mathsf\{Bad\}\\;\\mathcal\{U\}\\,\\mathsf\{Good\}as the set of paths that reach a state in𝖦𝗈𝗈𝖽\\mathsf\{Good\}while avoiding states in𝖡𝖺𝖽\\mathsf\{Bad\}:¬𝖡𝖺𝖽𝒰𝖦𝗈𝗈𝖽=\{s0s1…∣∃i≥0:si∈𝖦𝗈𝗈𝖽and∀0≤j<i:sj∉𝖡𝖺𝖽\}\\neg\\mathsf\{Bad\}\\;\\mathcal\{U\}\\,\\mathsf\{Good\}=\\\{s\_\{0\}s\_\{1\}\\ldots\\mid\\exists i\\geq 0\\colon s\_\{i\}\\in\\mathsf\{Good\}\\text\{~and~\}\\forall 0\\leq j<i\\colon s\_\{j\}\\not\\in\\mathsf\{Bad\}\\\}\. We refer to them asreach\-avoid objectives\. All these events are measurable\. In the definitions above, we liberally used the notations of temporal logics \(◇\\Diamond,□\\Box, and𝒰\\;\\mathcal\{U\}\\,\)\. An interested reader can see\[[5](https://arxiv.org/html/2605.14440#bib.bib29)\]for a formal introduction\.

###### Definition 2\(POMDP\)

A*Partially Observable Markov Decision Process \(POMDP\)*is a tuple𝒫=\(S,A,P,s0,Z,O\)\\mathcal\{P\}=\(S,A,P,s\_\{0\},Z,O\)whereSSis a finite set of states,AAis a finite set of actions,P:S×A⇀𝖣𝗂𝗌𝗍\(S\)P:S\\times A\\rightharpoonup\\mathsf\{Dist\}\(S\)is a \(partial\) transition function,s0∈Ss\_\{0\}\\in Sis the initial state,ZZis a finite set of observations, andO:S→ZO:S\\rightarrow Zis an observation function that maps each state to an observation111Equivalently, we can see an observationz∈Zz\\in Zas the subset of states\{s∈S∣O\(s\)=z\}\\\{s\\in S\\mid O\(s\)=z\\\}that share the observationzz\.\.

For statess,s′∈Ss,s^\{\\prime\}\\in S, anda∈Aa\\in A,P\(s,a\)\(s′\)P\(s,a\)\(s^\{\\prime\}\)denotes the probability of moving from statessto states′s^\{\\prime\}when taking the actionaaand we denote this probability asP\(s,a,s′\)P\(s,a,s^\{\\prime\}\)\. If for some states∈Ss\\in Sanda∈Aa\\in A,P\(s,a\)P\(s,a\)is not defined, we say that actionaais not*enabled*in statess\. Note that, whenZ=SZ=SandOOis the identity, then the POMDP is a classical fully observable MDP\.

In a POMDP𝒫\\mathcal\{P\}, an*infinite path*is a sequenceρ=s0a0s1a1s2…\\rho=s\_\{0\}a\_\{0\}s\_\{1\}a\_\{1\}s\_\{2\}\\ldotssuch that for allt≥0t\\geq 0,st\+1∈𝗌𝗎𝗉𝗉\(P\(st,at\)\)s\_\{t\+1\}\\in\\mathsf\{supp\}\(P\(s\_\{t\},a\_\{t\}\)\)\. A*finite path*ρ=s0a0s1…si\\rho=s\_\{0\}a\_\{0\}s\_\{1\}\\ldots s\_\{i\}of lengthi≥0i\\geq 0is a finite prefix of an infinite path\. Note that multiple states may have the same observation underOO, so the agent cannot in general infer the exact underlying state from its observations\. Thus in a POMDP, the agent observes only*histories*\(sequences of observations and actions\), not the underlying state sequence\. Formally, a history is a finite sequence of observations and actionsh=z0a0…ai−1zih=z\_\{0\}a\_\{0\}\\ldots a\_\{i\-1\}z\_\{i\}\. We call a history*valid*, if there exists a finite pathρ=s0a0s1…si\\rho=s\_\{0\}a\_\{0\}s\_\{1\}\\ldots s\_\{i\}withzj=O\(sj\)z\_\{j\}=O\(s\_\{j\}\)for0≤j≤i0\\leq j\\leq i\. We denote the set of all \(valid\) histories of𝒫\\mathcal\{P\}by𝖧𝗂𝗌𝗍𝒫\\mathsf\{Hist\}\_\{\\mathcal\{P\}\}\. With a slight abuse of notation, we also use a set of observationsY⊆ZY\\subseteq Zto denote the set of states:\{s∈S∣O\(s\)∈Y\}\\\{s\\in S\\mid O\(s\)\\in Y\\\}\.

###### Example 1

Consider the behaviour of a robot in the grid depicted in[Figure˜2](https://arxiv.org/html/2605.14440#S2.F2)\. This can be modelled as a POMDP𝒫\\mathcal\{P\}as follows: Each cell in the grid corresponds to a state inS=\{\(x,y\)∣x∈\{0,1,2,3\},y∈\{0,1,2\}\}S=\\\{\(x,y\)\\mid x\\in\\\{0,1,2,3\\\},y\\in\\\{0,1,2\\\}\\\}\. The cells marked in red are*bad*states that the robot must avoid\. It also must stay inside the grid \(*i\.e\.*, for example, action↑\\ \\uparrow\\from■\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}leads to a*bad*state\)\. The robot observes only the cell colour \(gray, white, or blue\) but cannot distinguish between cells of the same colour\. That is, the set of observations isZ=\{■,■,■\}Z=\\\{\\mathbin\{\\,\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\blacksquare\}\\,\},\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\},\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}\\\}, The set of actions isA=\{↑,↓,→\}A=\\\{\\ \\uparrow\\ ,\\ \\downarrow\\ ,\\rightarrow~\\\}\. The initial state iss0=\(0,1\)s\_\{0\}=\(0,1\)\. The transition function is defined as follows: for each state\(x,y\)\(x,y\)and actiona∈Aa\\in A, the system moves to the intended neighbouring cell with probability0\.90\.9and with probability0\.10\.1remains at the same cell\.

0123012Figure 1:A grid world representing the[Example˜1](https://arxiv.org/html/2605.14440#Thmexample1)\.n0n\_\{0\}n1n\_\{1\}n2n\_\{2\}n3n\_\{3\}■⁣/⁣→\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}/\\rightarrow~■⁣/⁣→\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}/\\rightarrow~■⁣/⁣→\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}/\\rightarrow~■⁣/⁣↑\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}/\\ \\uparrow\\■⁣/⁣↓\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}/\\ \\downarrow\\Figure 2:A finite\-state controller for the POMDP in[Example˜1](https://arxiv.org/html/2605.14440#Thmexample1)\.
Policies resolve the nondeterminism in POMDPs by specifying, for each finite history, which action the agent should take next\.

###### Definition 3\(Policy\)

A \(deterministic222It is well known \(see\[[13](https://arxiv.org/html/2605.14440#bib.bib39), Lemma 1\]\) that for infinite duration safety property \(and all objectives that can be reduced to it\), randomization is not useful to play optimally; therefore, we focus solely on deterministic policies\.\) policy for a POMDP𝒫\\mathcal\{P\}is a functionσ:𝖧𝗂𝗌𝗍𝒫→A\\sigma:\\mathsf\{Hist\}\_\{\\mathcal\{P\}\}\\to A\.

A policyσ\\sigmafor a POMDP𝒫\\mathcal\{P\}induces an infinite countable MC𝒫σ\\mathcal\{P\}\_\{\\sigma\}defined as follows: states are pairs\(s,h\)\(s,h\)wheres∈Ss\\in Sis a state of the POMDP andh∈𝖧𝗂𝗌𝗍𝒫h\\in\\mathsf\{Hist\}\_\{\\mathcal\{P\}\}is a history that ends with the observationO\(s\)O\(s\)\. The initial state is\(s0,O\(s0\)\)\(s\_\{0\},O\(s\_\{0\}\)\)\. The transition function is defined as:

Pσ\(\(s,h\),\(s′,h′\)\)=\{P\(s,σ\(h\),s′\)ifh′=h⋅σ\(h\)⋅O\(s′\),0otherwise\.P^\{\\sigma\}\(\(s,h\),\(s^\{\\prime\},h^\{\\prime\}\)\)=\\begin\{cases\}P\(s,\\sigma\(h\),s^\{\\prime\}\)&\\text\{if \}h^\{\\prime\}=h\\cdot\\sigma\(h\)\\cdot O\(s^\{\\prime\}\),\\\\ 0&\\text\{otherwise\.\}\\end\{cases\}We denote the probability measure associated with this MC asℙ𝒫σ\\mathbb\{P\}\_\{\\mathcal\{P\}\}^\{\\sigma\}\. We remove𝒫\\mathcal\{P\}from the subscript when the POMDP is clear from the context\.

The agent controlling the POMDP does not observe the exact state of the system during its execution but only the sequence of observations encountered\. The agent maintains abelief, that is a distribution over the states of the POMDP, indicating the probability of being in each state given the current history\.

A particularly relevant class of policies in practice is finite\-state policies \(also called finite\-state controllers\), as they naturally give rise to implementations that can be executed efficiently online\. In[Section˜3](https://arxiv.org/html/2605.14440#S3), we will further establish that such controllers are sufficient to provide solutions for safety threshold specifications\.

###### Definition 4\(Finite\-State Controller\)

A*finite\-state controller*\(FSC\) for a POMDP𝒫=\(S,A,P,s0,Z,O\)\\mathcal\{P\}=\(S,A,P,s\_\{0\},Z,O\)is a tupleℱ=\(N,γ,δ,n0\)\\mathcal\{F\}=\(N,\\gamma,\\delta,n\_\{0\}\)whereNNis a finite set of nodes,γ:N×Z→A\\gamma:N\\times Z\\rightarrow Ais the action mapping,δ:N×Z→N\\delta:N\\times Z\\rightarrow Nis the transition function, andn0n\_\{0\}is the initial node\.

A finite\-state controllerℱ=\(N,γ,δ,n0\)\\mathcal\{F\}=\(N,\\gamma,\\delta,n\_\{0\}\)defines a policyσℱ\\sigma\_\{\\mathcal\{F\}\}in the POMDP𝒫=\(S,A,P,s0,Z,O\)\\mathcal\{P\}=\(S,A,P,s\_\{0\},Z,O\)as follows: it starts from the initial noden0n\_\{0\}\. At any step, if the current controller node isn∈Nn\\in N, and the current POMDP state iss∈Ss\\in Sof the POMDP, the action suggested byℱ\\mathcal\{F\}isγ\(n,O\(s\)\)\\gamma\(n,O\(s\)\)\. The FSCℱ\\mathcal\{F\}then updates its current node ton′=δ\(n,o\)n^\{\\prime\}=\\delta\(n,o\), whereoois the newly received observation\. The state of the POMDP is updated according toPP\. Thus,𝒫\\mathcal\{P\}andℱ\\mathcal\{F\}induce a finite state Markov chain𝒫×ℱ=\(S×N,Pℱ,\(s0,n0\)\)\\mathcal\{P\}\\times\\mathcal\{F\}=\(S\\times N,P^\{\\mathcal\{F\}\},\(s\_\{0\},n\_\{0\}\)\)where

Pℱ\(\(s,n\),\(s′,n′\)\)=\{P\(s,γ\(n,O\(s\)\),s′\)ifn′=δ\(n,O\(s\)\),0otherwise\.P^\{\\mathcal\{F\}\}\(\(s,n\),\(s^\{\\prime\},n^\{\\prime\}\)\)=\\begin\{cases\}P\(s,\\gamma\(n,O\(s\)\),s^\{\\prime\}\)&\\text\{if \}n^\{\\prime\}=\\delta\(n,O\(s\)\),\\\\ 0&\\text\{otherwise\.\}\\end\{cases\}We call a policy*finite memory*or*regular*, if it can be represented by an FSC\.

###### Example 2

Consider the POMDP from[Example˜1](https://arxiv.org/html/2605.14440#Thmexample1)\. The objective is to maximize the probability of avoiding the bad states forever\. A possible policy is to moverightthree times, and then moveuptill the robot reaches thebluecell and then movedowntill it reaches thegraycell\. This policy can be represented by an FSC with four nodes as depicted in[Fig\.˜2](https://arxiv.org/html/2605.14440#S2.F2)\. Starting from the cell\(0,1\)\(0,1\), this policy ensures that the robot reaches the cell\(3,1\)\(3,1\)with probability0\.7290\.729\(as there is a probability of0\.930\.9^\{3\}of moving right three times successfully\), and from there it can stay in the safe region forever by movingupwhen at thegraycell, and movingdownfrom thebluecell\.

## 3The Threshold\-safety Problem

Given a POMDP, our goal is to identify policies that enforce desirable properties\. We first focus on two types of objectives: safety objectives and discounted\-sum reward objectives\.

###### Problem 1\(Threshold\-safety problem\)

Input: A POMDP𝒫\\mathcal\{P\}, a set of bad observations𝖡𝖺𝖽\\mathsf\{Bad\}and a thresholdα∈\[0,1\)\\alpha\\in\[0,1\)\.Output: A policyσ\\sigmarepresented as an FSC s\.t\.ℙ𝒫σ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}\_\{\\mathcal\{P\}\}^\{\\sigma\}\(\\Box\\neg\{\\sf Bad\}\)\>\\alpha, if it exists,and𝖭𝗈𝗇𝖾\\mathsf\{None\}otherwise\.

While it is well\-known that the threshold\-safety problem is undecidable\[[28](https://arxiv.org/html/2605.14440#bib.bib10)\],[Problem˜1](https://arxiv.org/html/2605.14440#Thmproblem1)establishes that[Problem˜1](https://arxiv.org/html/2605.14440#Thmproblem1)is well\-posed: the existence of a policy enforcing the safety threshold implies the existence of a finite\-state controller achieving the same threshold\. This result justifies our focus on synthesizing FSCs and also ensures that the problem is*recursively enumerable*\.

\{restatable\}

theoremfscExistence Given a POMDP𝒫\\mathcal\{P\}with a set of bad observations𝖡𝖺𝖽⊆Z\{\\sf Bad\}\\subseteq Z, and a thresholdα∈\[0,1\)\\alpha\\in\[0,1\), if there is a policyσ\\sigmasuch thatℙ𝒫σ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}\_\{\\mathcal\{P\}\}^\{\\sigma\}\(\\Box\\neg\{\\sf Bad\}\)\>\\alpha, then there exists a finite memory policyσ𝖿𝗆\\sigma\_\{\\sf fm\}such thatℙ𝒫σ𝖿𝗆\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}^\{\\sigma\_\{\\sf fm\}\}\_\{\{\\cal P\}\}\(\\Box\\neg\{\\sf Bad\}\)\>\\alpha\.

To prove the statement in[Problem˜1](https://arxiv.org/html/2605.14440#Thmproblem1), we consider the belief MDPℬ𝒫\\mathcal\{B\}\_\{\\mathcal\{P\}\}associated with the POMDP𝒫\\mathcal\{P\}: its set of states is the infinite set of beliefsb∈𝖣𝗂𝗌𝗍\(S\)b\\in\\mathsf\{Dist\}\(S\), and for each actionaaand observationzz, the successor beliefb′b^\{\\prime\}is given by the usual Bayesian update based onPPandOO, defined as follows:

∀s∈S∩z:b′\(s\)=∑s′∈Sb\(s′\)⋅P\(s′,a,s\)∑\(s′,s′′\)∣O\(s′′\)=zb\(s′\)⋅P\(s′,a,s′′\)\\forall s\\in S\\cap z:b^\{\\prime\}\(s\)=\\frac\{\\sum\_\{s^\{\\prime\}\\in S\}b\(s^\{\\prime\}\)\\cdot P\(s^\{\\prime\},a,s\)\}\{\\sum\_\{\(s^\{\\prime\},s^\{\\prime\\prime\}\)\\mid O\(s^\{\\prime\\prime\}\)=z\}b\(s^\{\\prime\}\)\\cdot P\(s^\{\\prime\},a,s^\{\\prime\\prime\}\)\}∀s∈S∩z¯:b′\(s\)=0\.\\forall s\\in S\\cap\\overline\{z\}:b^\{\\prime\}\(s\)=0\.
LetWWdenote the set of belief states from which there exists a policy to ensure safety with probability11\. Then we can show the following two lemmas \(proof in Appendix[0\.A](https://arxiv.org/html/2605.14440#Pt0.A1)\)\.

\{restatable\}

lemmafinFromW There exists a finite memory policyσ\\sigmasuch that for every belief stateb∈Wb\\in W,ℙℬ𝒫σ\(□¬𝖡𝖺𝖽∣initial beliefb\)=1\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\\mid\\text\{initial belief \}b\)=1\.

\{restatable\}

lemmasafeORinW For any policyσ\\sigmain𝒫\\mathcal\{P\},ℙℬ𝒫σ\(□¬𝖡𝖺𝖽∧□¬W\)=0\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\\wedge\\Box\\neg W\)=0\.

We now have all the ingredients to prove Theorem[1](https://arxiv.org/html/2605.14440#Thmproblem1)\.

###### Proof\(of[Problem˜1](https://arxiv.org/html/2605.14440#Thmproblem1)\)

Assume that there exists a policyσ\\sigmasuch thatℙ𝒫σ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{P\}\}\(\\Box\\neg\{\\sf Bad\}\)\>\\alpha\. From[Problem˜1](https://arxiv.org/html/2605.14440#Thmproblem1),ℙℬ𝒫σ\(◇W\)≥ℙℬ𝒫σ\(□¬𝖡𝖺𝖽∧◇W\)=ℙℬ𝒫σ\(□¬𝖡𝖺𝖽\)−ℙℬ𝒫σ\(□¬𝖡𝖺𝖽∧□¬W\)=ℙℬ𝒫σ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Diamond W\)\\geq\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\\wedge\\Diamond W\)=\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\)\-\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\\wedge\\Box\\neg W\)=\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\)\>\\alpha\. Note thatℙℬ𝒫σ\(◇W\)=limd→∞ℙℬ𝒫σ\(◇≤dW\)\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Diamond W\)=\\lim\_\{d\\to\\infty\}\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Diamond^\{\\leq d\}W\)where◇≤dW\\Diamond^\{\\leq d\}Wdenotes the set of paths that reachWWwithinddsteps\. Asℙℬ𝒫σ\(◇≤dW\)\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Diamond^\{\\leq d\}W\)is non\-decreasing indd, there must exist a finite depthddsuch thatℙℬ𝒫σ\(◇≤dW\)\>α\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Diamond^\{\\leq d\}W\)\>\\alpha\.

Now we constructσ𝖿𝗆\\sigma\_\{\\sf fm\}as follows: for all histories of depth at mostdd, if no belief state inWWhas been reached yet, we play according toσ\\sigma\. Whenever a belief state inWWis reached, we switch to the finite\-memory policy that enforces safety almost surely from that belief state, which exists by Lemma[1](https://arxiv.org/html/2605.14440#Thmproblem1)\. In the remaining case, that is, when we reach a depth greater thanddwithout having reached a belief state inWW, we play an arbitrary action\. Clearly, this policy uses finite memory, as it depends only on the history of observations and actions up to depthdd\. And, since from any belief state inWW,σ𝖿𝗆\\sigma\_\{\\sf fm\}ensures safety almost surely, we haveℙ𝒫σ𝖿𝗆\(□¬𝖡𝖺𝖽\)=ℙℬ𝒫σ𝖿𝗆\(◇W\)≥ℙℬ𝒫σ\(◇≤dW\)\>α\.\\mathbb\{P\}^\{\\sigma\_\{\\sf fm\}\}\_\{\\mathcal\{P\}\}\(\\Box\\neg\{\\sf Bad\}\)=\\mathbb\{P\}^\{\\sigma\_\{\\sf fm\}\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Diamond W\)\\geq\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Diamond^\{\\leq d\}W\)\>\\alpha\\,\.∎

#### Threshold Discounted Reward Problem\.

In the following, we use an online solver to compute locally optimal actions after a given history of observations and actions\. However, existing online solvers are mainly designed to optimizediscounted\-sum reward objectives, and do not directly handle infinite\-horizon safety properties\. We therefore start by formally defining the synthesis problem for discounted\-sum objectives\. We then show how the threshold\-safety synthesis problem can be related to this setting andapproximatelyreduced to it\.

For this, we introduce a few additional notations\. A reward functionr:Z×A⇀ℚr:Z\\times A\\rightharpoonup\\mathbb\{Q\}assigns a reward to each observation\-action pair\. Letρ=s0a0s1a1…\\rho=s\_\{0\}a\_\{0\}s\_\{1\}a\_\{1\}\\dotsbe an infinite path\. The*total reward*alongρ\\rhois∑i=0∞r\(𝒪\(si\),ai\)\\sum\_\{i=0\}^\{\\infty\}r\(\{\\cal O\}\(s\_\{i\}\),a\_\{i\}\)\. Given a*discount factor*λ∈\(0,1\)\\lambda\\in\(0,1\), the*discounted reward*alongρ\\rho, denotedRewλ\(ρ\)\\textsf\{Rew\}^\{\\lambda\}\(\\rho\), is defined as∑i=0∞λi⋅r\(𝒪\(si\),ai\)\.\\sum\_\{i=0\}^\{\\infty\}\\lambda^\{i\}\\cdot r\(\{\\cal O\}\(s\_\{i\}\),a\_\{i\}\)\.

We call a POMDP𝒫\\mathcal\{P\}, when equipped with a reward functionrr, a*reward POMDP*\. For a policyσ\\sigmain𝒫\\mathcal\{P\}, we define its expected discounted reward, denoted by𝔼𝒫σ\(Rewλ\)\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}\}\(\\textsf\{Rew\}^\{\\lambda\}\), as the expected value of the discounted rewards over all paths in the Markov chain𝒫σ\\mathcal\{P\}\_\{\\sigma\}that is induced byσ\\sigma\.

###### Problem 2\(Threshold discounted reward problem\)

Input: A POMDP𝒫\\mathcal\{P\}, reward functionrr, discount factorλ∈\(0,1\)\\lambda\\in\(0,1\), andthresholdα∈ℚ\\alpha\\in\\mathbb\{Q\}\.Output: A policyσ\\sigmas\.t\.𝔼𝒫σ\(Rewλ\)\>α\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}\}\(\{\\textsf\{Rew\}\}^\{\\lambda\}\)\>\\alpha, if it exists, and𝖭𝗈𝗇𝖾,\\mathsf\{None\},otherwise\.

Note that this problem is also undecidable\[[28](https://arxiv.org/html/2605.14440#bib.bib10)\]\. Furthermore, it was shown in\[[1](https://arxiv.org/html/2605.14440#bib.bib9)\]that safety \(and reachability\) objectives cannot be structurally represented as discounted reward objectives in anexactway\. However, we show below that safety objectives can be approximated to any desired degree of accuracy, which is the best achievable outcome in this setting\. We also argue that this approximation is sufficient in practice, as supported by our experimental results\.

First, for safety objectives, we proceed with the following reduction\. Given any set𝖡𝖺𝖽⊆Z\\mathsf\{Bad\}\\subseteq Z, without loss of generality, we will assume that the*unsafe*states \(states whose observations are in𝖡𝖺𝖽\\mathsf\{Bad\}\) are sinks\. Then, from𝒫\\mathcal\{P\}, we construct a reward POMDP𝒫r\\mathcal\{P\}^\{r\}as follows: add one additional stateqq, assign a new observationO\(q\)O\(q\), redirect all outgoing transitions from the unsafe states in𝒫\\mathcal\{P\}toqq, and makeqqa sink state\. Define the reward functionrrin𝒫r\\mathcal\{P\}^\{r\}as follows: for every actionaa, we assign a reward−1\-1to the observation\-action pairs\(o,a\)\(o,a\)foro∈𝖡𝖺𝖽o\\in\\mathsf\{Bad\}, and assign0to every other observation\-action pair\. It is straightforward to verify that, under this construction, every infinite path in𝒫r\\mathcal\{P\}^\{r\}that never visits𝖡𝖺𝖽\\mathsf\{Bad\}has a total reward0, while every infinite path that visits𝖡𝖺𝖽\\mathsf\{Bad\}has a \(undiscounted\) total reward−1\-1\.

In online samplers, the focus is not on the total reward, but rather on the discounted reward\. However, when the discount factorλ\\lambdais close to11, the difference between the total reward and the discounted reward becomes negligible\. Additionally, there is a bijection between paths in𝒫\\mathcal\{P\}and𝒫r\\mathcal\{P\}^\{r\}, which can further be lifted to policies in the two POMDPs\. This is because, after visiting a state in𝖡𝖺𝖽\\mathsf\{Bad\}, all transitions are deterministic in both \(and the rest are the same\)\. Combining these two observations, we get that, for all policiesσ\\sigma:

ℙ𝒫σ\(□¬𝖡𝖺𝖽\)=1\+limλ→1𝔼𝒫rσ\(Rewλ\)=limλ→1\(1\+𝔼𝒫rσ\(Rewλ\)\)\.\\mathbb\{P\}\_\{\\mathcal\{P\}\}^\{\\sigma\}\(\\Box\\neg\{\\sf Bad\}\)=1\+\\lim\_\{\\lambda\\rightarrow 1\}\\mathbb\{E\}\_\{\\mathcal\{P\}^\{r\}\}^\{\\sigma\}\(\{\\sf\\textsf\{Rew\}^\{\\lambda\}\}\)=\\lim\_\{\\lambda\\rightarrow 1\}\(1\+\\mathbb\{E\}\_\{\\mathcal\{P\}^\{r\}\}^\{\\sigma\}\(\{\\sf\\textsf\{Rew\}^\{\\lambda\}\}\)\)\\,\.This result implies that, in the limit, asλ\\lambdatends to11, maximizing the expected discounted reward in theλ\\lambda\-discounted POMDP𝒫r\\mathcal\{P\}^\{r\}is equivalent to maximizing the probability of staying safe in𝒫\\mathcal\{P\}\. We formally state this fact below\.

As recalled above, the discounted\-reward problem is undecidable\. Nevertheless, sampling\-based solvers can compute good, near\-optimal policies in many practical settings\.

\{restatable\}

propositionsafetyDrewEq For all POMDPs𝒫\\mathcal\{P\}, and for all safety objectives encoded by observations inBad, let𝒫r\\mathcal\{P\}^\{r\}be the reward POMDP constructed as above\. Then, for any thresholdα∈\[0,1\)\\alpha\\in\[0,1\)andε\>0\\varepsilon\>0, :

1. 1\.there exists a discount factorλ∈\(0,1\)\\lambda\\in\(0,1\), such that if there is a policyσ\\sigmasatisfying1\+𝔼𝒫rσ\(Rewλ\)\>α1\+\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}^\{r\}\}\(\{\\textsf\{Rew\}\}^\{\\lambda\}\)\>\\alpha, thenℙ𝒫σ\(□¬𝖡𝖺𝖽\)\>α−ε\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{P\}\}\(\\Box\\neg\{\\sf Bad\}\)\>\\alpha\-\\varepsilon;
2. 2\.if there exists a policyσ\\sigmasuch thatℙ𝒫σ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{P\}\}\(\\Box\\neg\{\\sf Bad\}\)\>\\alpha, then, there exists a discount factorλ∈\(0,1\)\\lambda\\in\(0,1\)such that1\+𝔼𝒫rσ\(Rewλ\)\>α−ε1\+\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}^\{r\}\}\(\{\\textsf\{Rew\}\}^\{\\lambda\}\)\>\\alpha\-\\varepsilon\.

## 4The FrameworkCplus

In this section, we present our policy learning framework, named*Counterexample\-guided Policy Learning Using Sampling*\(Cplus\), for the safety\-synthesis problem that combines sampling and model\-checking through active automata learning\. A diagrammatic description is provided in[Figure˜3](https://arxiv.org/html/2605.14440#S4.F3)\. For the rest of this section, we fix a finite POMDP𝒫=\(S,A,P,s0,Z,O\)\\mathcal\{P\}=\(S,A,P,s\_\{0\},Z,O\)with a set of unsafe observations𝖡𝖺𝖽\\mathsf\{Bad\}and a thresholdα∈\[0,1\)\\alpha\\in\[0,1\)inducing a safety specification \(as formulated in Problem[1](https://arxiv.org/html/2605.14440#Thmproblem1)\)\. Let𝒫r\\mathcal\{P\}^\{r\}be the reward POMDP following the construction described in[Section˜3](https://arxiv.org/html/2605.14440#S3)\.

Our algorithm is an adaptation of a variant of Angluin’sL∗L^\{\*\}for Mealy machines\[[36](https://arxiv.org/html/2605.14440#bib.bib30)\]\. The algorithm maintains a table𝒯\\mathcal\{T\}throughout, whose rows and columns are sequences of observations of𝒫\\mathcal\{P\}\. Formally, an observation table is a triple𝒯=\(𝖲,𝖤,𝖳\)\\mathcal\{T\}=\(\\mathsf\{S\},\\mathsf\{E\},\\mathsf\{T\}\), where𝖲\\mathsf\{S\}is a nonempty prefix\-closed set of observation sequences,𝖤\\mathsf\{E\}is a nonempty set of observation sequences, and𝖳:\(𝖲∪𝖲⋅Z\)×𝖤→A∗\\mathsf\{T\}:\(\\mathsf\{S\}\\cup\\mathsf\{S\}\\cdot Z\)\\times\\mathsf\{E\}\\to A^\{\*\}is a function that assigns a sequence of actions to observation sequences\. The observation table is initialized with𝖲=\{ε\}\\mathsf\{S\}=\\\{\\varepsilon\\\}and𝖤=Z\\mathsf\{E\}=Z\(see Table[1](https://arxiv.org/html/2605.14440#S4.T1)\)\.

■\\mathbin\{\\,\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\blacksquare\}\\,\}■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}■\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}𝖲\\mathsf\{\\mathsf\{S\}\}ε\\varepsilonx→\\rightarrow~x𝖲⋅Z\\mathsf\{\\mathsf\{S\}\}\\cdot Z■\\mathbin\{\\,\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\blacksquare\}\\,\}xxx■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}x→\\rightarrow~x■\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}xxxTable 1:Initial observation table for the POMDP in[Figure˜2](https://arxiv.org/html/2605.14440#S2.F2)\.For a rows=z11z12…z1ms=z\_\{1\}^\{1\}z\_\{1\}^\{2\}\\dots z\_\{1\}^\{m\}and a columne=z21z22…z2ne=z\_\{2\}^\{1\}z\_\{2\}^\{2\}\\dots z\_\{2\}^\{n\}, the value of𝖳\(s,e\)\\mathsf\{T\}\(s,e\)is a sequence of actionsa21a22…a2na\_\{2\}^\{1\}a\_\{2\}^\{2\}\\dots a\_\{2\}^\{n\}\. Note that,\|e\|=\|𝖳\(s,e\)\|\|e\|=\|\\mathsf\{T\}\(s,e\)\|\. Further, from the rowss, the columnee, and the value𝖳\(s,e\)\\mathsf\{T\}\(s,e\), we can obtain an action\-observation sequencez11a11z12a12…z1ma1mz21a21z22a22…z2na2nz\_\{1\}^\{1\}a\_\{1\}^\{1\}z\_\{1\}^\{2\}a\_\{1\}^\{2\}\\dots z\_\{1\}^\{m\}a\_\{1\}^\{m\}z\_\{2\}^\{1\}a\_\{2\}^\{1\}z\_\{2\}^\{2\}a\_\{2\}^\{2\}\\dots z\_\{2\}^\{n\}a\_\{2\}^\{n\}, wherea1i=𝖳\(z11z12…z1i−1,z1i\)a\_\{1\}^\{i\}=\\mathsf\{T\}\(z\_\{1\}^\{1\}z\_\{1\}^\{2\}\\dots z\_\{1\}^\{i\-1\},z\_\{1\}^\{i\}\), for1≤i≤m1\\leq i\\leq m\(with the convention thatz10=εz\_\{1\}^\{0\}=\\varepsilon\)\. One can observe that, these values already exist in𝒯\\mathcal\{T\}, since𝖲\\mathsf\{S\}is prefix\-closed withz11z12…z1i−1∈𝖲∪𝖲⋅Zz\_\{1\}^\{1\}z\_\{1\}^\{2\}\\dots z\_\{1\}^\{i\-1\}\\in\\mathsf\{S\}\\cup\\mathsf\{S\}\\cdot Z, andz1i∈Z⊆𝖤z\_\{1\}^\{i\}\\in Z\\subseteq\\mathsf\{E\}\. Lastly, suppose𝖤=\{e1,…,eℓ\}\\mathsf\{E\}=\\\{e\_\{1\},\\ldots,e\_\{\\ell\}\\\}for someℓ∈ℕ\\ell\\in\\mathbb\{N\}\. Then, for anys∈𝖲∪𝖲⋅Zs\\in\\mathsf\{S\}\\cup\\mathsf\{S\}\\cdot Z, we writerow\(s\)row\(s\)to denote the tuple\(𝖳\(s,e1\),𝖳\(s,e2\),…,\(𝖳\(s,eℓ\)\)\(\\mathsf\{T\}\(s,e\_\{1\}\),\\mathsf\{T\}\(s,e\_\{2\}\),\\ldots,\(\\mathsf\{T\}\(s,e\_\{\\ell\}\)\)\.

We use two kinds of queries to construct the observation table:*action queries*and*model\-checking queries*, as described below\.

- •*Action Queries*\(AQ\): given a sequence of observations and actionshh, if it is a valid history in𝒫\\mathcal\{P\}, we query an action oracle \(𝒪A\\mathcal\{O\}^\{A\}\) to identify the actionaawhich is optimal333We use a cache to store the action queries instead of querying𝒪A\\mathcal\{O\}^\{A\}each time; this avoids selecting different actions for the same history on different occasions\.to take afterhh\. Ifhhis not a valid history, we return a special symbol ‘x’ representing ‘don’t care’\. The validity of a history can be checked using classical graph theoretic analysis on the POMDP\.
- •*Model\-Checking Queries*\(MCQ\): given an FSCℋ\\mathcal\{H\}, we query a model\-checking oracle \(𝒪M\\mathcal\{O\}^\{M\}\) to verify ifℋ\\mathcal\{H\}satisfies the safety specification,*i\.e\.*, whetherℙ𝒫×ℋ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}\_\{\\mathcal\{P\}\\times\\mathcal\{H\}\}\(\\Box\\neg\\mathsf\{Bad\}\)\>\\alphaholds\. The oracle𝒪M\\mathcal\{O\}^\{M\}either returns*yes*\(if the above holds\), or a*finite*set of paths in𝒫×ℋ\\mathcal\{P\}\\times\\mathcal\{H\}\(called*counterexample*\) that are unsafe and have a cumulative probability of at least1−α1\-\\alpha\.

We now elaborate on different steps of our frameworkCplus\.

![Refer to caption](https://arxiv.org/html/2605.14440v1/figures/pseudoalgo.png)Figure 3:Overview of our policy learning frameworkCplus\.#### Action Queries via Sampling\.

Given a*valid*historyhhof observations and actions, we use a sampling\-based𝒪A\\mathcal\{O\}^\{A\}to compute an actionaato play afterhhin order to maximize the expected discounted reward of the associated POMDP𝒫r\\mathcal\{P\}^\{r\}\. Thanks to[Problem˜2](https://arxiv.org/html/2605.14440#Thmproblem2), in the following, we will assume that𝒪A\\mathcal\{O\}^\{A\}is trying to maximize the probability of satisfying the safety specification in𝒫\\mathcal\{P\}\.

#### Hypothesis Construction\.

We ensure that the table is*closed*, i\.e\., for alls′∈𝖲⋅Zs^\{\\prime\}\\in\\mathsf\{S\}\\cdot Z, there existss∈𝖲s\\in\\mathsf\{S\}such thatrow\(s\)=row\(s′\)row\(s\)=row\(s^\{\\prime\}\)444In contrast to the originalL∗L^\{\*\}, we do not need to check for*consistency*, thanks to the counterexample\-processing technique of Rivest\-Schapire\[[35](https://arxiv.org/html/2605.14440#bib.bib41)\]\.\. Once the table is closed, a hypothesis FSCℋ=\(N,γ,δ,n0\)\\mathcal\{H\}=\(N,\\gamma,\\delta,n\_\{0\}\), corresponding to the table, is constructed using standard techniques forL∗L^\{\*\}, as follows\. The nodesNNare the rows in𝖲\\mathsf\{S\}, the initial noden0n\_\{0\}isrow\(ε\)row\(\\varepsilon\)\. The transition function is defined asδ\(row\(s\),o\)=row\(so\)\\delta\(row\(s\),o\)=row\(so\), and the action map is defined asγ\(row\(s\),o\)=𝖳\(s,o\)\\gamma\(row\(s\),o\)=\\mathsf\{T\}\(s,o\)\.

#### Model\-checking Queries\.

Given a hypothesisℋ\\mathcal\{H\}, we do a model\-checking query forℋ\\mathcal\{H\}using a probabilistic model\-checker \(𝒪M\\mathcal\{O\}^\{M\}\)\. We evaluate the quality ofℋ\\mathcal\{H\}by computing the safety probability in the Markov chain𝒫×ℋ\\mathcal\{P\}\\times\\mathcal\{H\}\. If indeedℙ𝒫×ℋ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}\_\{\\mathcal\{P\}\\times\\mathcal\{H\}\}\(\\Box\\neg\\mathsf\{Bad\}\)\>\\alpha, then the algorithm terminates and returnsℋ\\mathcal\{H\}\. Else,𝒪M\\mathcal\{O\}^\{M\}returns a*finite*counterexampleℛ\\mathcal\{R\}\.

###### Definition 5\(Counterexample\)

For a hypothesisℋ\\mathcal\{H\}that does not satisfy the safety specification, a*counterexample*is a setℛ\\mathcal\{R\}of finite paths in the MC𝒫×ℋ\\mathcal\{P\}\\times\\mathcal\{H\}, such that𝖢𝗒𝗅\(ℛ\)⊆◇𝖡𝖺𝖽\\mathsf\{Cyl\}\(\\mathcal\{R\}\)\\subseteq\\Diamond\\mathsf\{Bad\}andℙ𝒫×ℋ\(𝖢𝗒𝗅\(ℛ\)\)\>1−α\\mathbb\{P\}\_\{\\mathcal\{P\}\\times\\mathcal\{H\}\}\(\\mathsf\{Cyl\}\(\\mathcal\{R\}\)\)\>1\-\\alpha\.

For safety specifications in finite MC, finite counterexamples suffice\[[18](https://arxiv.org/html/2605.14440#bib.bib2)\]:

###### Lemma 1\(\[[18](https://arxiv.org/html/2605.14440#bib.bib2)\]\)

For a finiteMCℳ\\mathcal\{M\}, if it does not satisfy a safety specificationℙℳ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}\_\{\\mathcal\{M\}\}\(\\Box\\neg\\mathsf\{Bad\}\)\>\\alpha, then there exists a counterexampleℛ\{\\cal R\}that is finite\.

When𝒪M\\mathcal\{O\}^\{M\}returns a counterexampleℛ\\mathcal\{R\},Cplusproceeds to the next step\.

#### Counterexample Processing\.

Letℛ\\mathcal\{R\}be a finite counterexample returned by𝒪M\\mathcal\{O\}^\{M\}in the previous step for hypothesisℋ\\mathcal\{H\}\. Assumeℛ\\mathcal\{R\}consists of the finite pathsρ1,ρ2,…,ρk\\rho\_\{1\},\\rho\_\{2\},\\dots,\\rho\_\{k\}\. Now, all actions taken along these paths are either due to𝒪A\\mathcal\{O\}^\{A\}\(*i\.e\.*, the answers to the action queries\), or to the “generalization” that occurs during the hypothesis building phase of the learning algorithm\. For all the latter paths, we call𝒪A\\mathcal\{O\}^\{A\}to check whether there exists someρj\\rho\_\{j\}and a prefixhhofρj\\rho\_\{j\}such that the action afterhhassigned byℋ\\mathcal\{H\}and the one returned by𝒪A\\mathcal\{O\}^\{A\}are different\. We then add a suffix ofρj\\rho\_\{j\}as the column of𝒯\\mathcal\{T\}using the technique of Rivest\-Schapire\[[35](https://arxiv.org/html/2605.14440#bib.bib41)\]\. In the following lemma, we establish when such a prefix exists\. We introduce the notation: given a pathρ:=s0s1…\\rho:=s\_\{0\}s\_\{1\}\\dots, and a policyσ\\sigma, we noteρ⊧σ\\rho\\models\\sigmawhen for everyi≥0i\\geq 0,σ\(z0a0z1…zi\)=ai\\sigma\(z\_\{0\}a\_\{0\}z\_\{1\}\\dots z\_\{i\}\)=a\_\{i\}, where for everyj≥0j\\geq 0,zjz\_\{j\}is the observation of the statesjs\_\{j\}\. The following lemma then follows from[Definition˜5](https://arxiv.org/html/2605.14440#Thmdefinition5)\.

\{restatable\}

lemmacounterex Letℋ\\mathcal\{H\}be a hypothesis not satisfying the safety specification,*i\.e\.*,ℙ𝒫×ℋ\(□¬𝖡𝖺𝖽\)≤α\\mathbb\{P\}\_\{\\mathcal\{P\}\\times\\mathcal\{H\}\}\(\\Box\\neg\\mathsf\{Bad\}\)\\leq\\alpha\. Further, let𝒢:=\{σ∣ℙ𝒫σ\(□¬𝖡𝖺𝖽\)\>α\}\\mathcal\{G\}:=\\\{\\sigma\\mid\\mathbb\{P\}\_\{\\mathcal\{P\}\}^\{\\sigma\}\(\\Box\\neg\\mathsf\{Bad\}\)\>\\alpha\\\}be the set of all threshold\-safe policies\. Ifℛ\\mathcal\{R\}is a counterexample for the model\-checking query onℋ\\mathcal\{H\}, then∀σ∈𝒢\\forall\\sigma\\in\\mathcal\{G\},∃ρ∈ℛ\\exists\\rho\\in\\mathcal\{R\}such thatρ⊧̸σ\\rho\\not\\models\\sigma\.

Due to the undecidability result, there are two possibilities: the action oracle𝒪A\\mathcal\{O\}^\{A\}may or may not have a threshold\-safe policy\. When it has, the policy of𝒪A\\mathcal\{O\}^\{A\}belongs to𝒢\\mathcal\{G\}of Lemma[4](https://arxiv.org/html/2605.14440#S4.SS0.SSS0.Px4)\. Thus, we are guaranteed to find a prefixhhthat we add to the observation table𝒯\\mathcal\{T\}andCplusmoves to the next iteration\. When𝒪A\\mathcal\{O\}^\{A\}does not have a threshold\-safe policy, we may not be able to find any prefixhhfrom the counterexampleℛ\\mathcal\{R\}where𝒪A\\mathcal\{O\}^\{A\}disagrees with the hypothesis\. In that case,Cplusreturns*fail*\.

We highlight two key properties ofCplus\. First, whenCplusterminates and returns an FSC, the model\-checking step guarantees that the FSC satisfies the safety specification\. Second, since the threshold\-safety problem is generally undecidable for POMDPs, only relative completeness can be established:Cplusmay not always terminate, however, whenever the action oracle has a regular winning policy \(sayσD\\sigma\_\{D\}\),Cplusterminates\. This follows from the termination guarantee ofL∗L^\{\*\}and the fact that a counterexample produced in the model\-checking step contains a pathρ⊧̸σD\\rho\\not\\models\\sigma\_\{D\}, due to[Section˜4](https://arxiv.org/html/2605.14440#S4.SS0.SSS0.Px4)\. The following theorem formalizes this idea, and the detailed proof can be found in Appendix[0\.B](https://arxiv.org/html/2605.14440#Pt0.A2)\.

\{restatable\}

theoremrelCompl Given a POMDP𝒫\\mathcal\{P\}with a set of unsafe observations𝖡𝖺𝖽\\mathsf\{Bad\}and a thresholdα∈\[0,1\)\\alpha\\in\[0,1\)inducing a safety specification \(as formulated in Problem[1](https://arxiv.org/html/2605.14440#Thmproblem1)\), let𝒫r\\mathcal\{P\}^\{r\}be the corresponding reward POMDP\.

1. ∙\\bullet\(Correctness\)IfCplusterminates and returns an FSCℋ\\mathcal\{H\}, then it satisfies the specification,*i\.e\.*,ℙ𝒫×ℋ\(□¬𝖡𝖺𝖽\)\>α\\mathbb\{P\}\_\{\\mathcal\{P\}\\times\\mathcal\{H\}\}\(\\Box\\neg\\mathsf\{Bad\}\)\>\\alpha\.
2. ∙\\bullet\(Relative completeness\)For anyε\>0\\varepsilon\>0, if the action oracle \(with the correspondingλ\\lambdaas in[Problem˜2](https://arxiv.org/html/2605.14440#Thmproblem2)\) follows a regular policyσ\\sigma, such that it satisfies1\+𝔼𝒫rσ\(Rewλ\)\>α\+ε1\+\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}^\{r\}\}\(\{\\textsf\{Rew\}\}^\{\\lambda\}\)\>\\alpha\+\\varepsilonin𝒫r\\mathcal\{P\}^\{r\}, thenCplusterminates and returns an FSC \(guaranteed to be correct by the correctness property above\)\.

### Illustrative example\.

We now demonstrate the execution ofCpluson the POMDP𝒫\\mathcal\{P\}in[Figure˜2](https://arxiv.org/html/2605.14440#S2.F2)\. We first initialise𝖲\\mathsf\{S\}to\{ε\}\\\{\\varepsilon\\\}and𝖤\\mathsf\{E\}to\{■,■,■\}\\\{\\mathbin\{\\,\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\blacksquare\}\\,\},\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\},\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}\\\}\. The initial observation table built using𝒪A\\mathcal\{O\}^\{A\}is shown in[Table˜1](https://arxiv.org/html/2605.14440#S4.T1)\. Since the robot starts in the cell \(0,1\), the first observation seen can only be■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\. Therefore, the table containsxin the cells corresponding to the other two observations\. This observation table is closed\. Therefore, we construct the hypothesis FSCℋ\\mathcal\{H\}in[Figure˜4](https://arxiv.org/html/2605.14440#S4.F4)from this observation table as described in Page[4](https://arxiv.org/html/2605.14440#S4.SS0.SSS0.Px2)\. In this example, we query𝒪M\\mathcal\{O\}^\{M\}to model\-check the threshold\-safety formula:ℙ𝒫×ℋ\(□¬𝖡𝖺𝖽\)\>0\.7\\mathbb\{P\}\_\{\\mathcal\{P\}\\times\\mathcal\{H\}\}\(\\Box\\neg\\mathsf\{Bad\}\)\>0\.7\. In this case,𝒪M\\mathcal\{O\}^\{M\}returns a counterexampleρ:=■■■■\\rho:=\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\. Note now that the observation\-action sequence■⁣→⁣■⁣→⁣■⁣→⁣■⁣→\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\rightarrow~\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\rightarrow~\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\rightarrow~\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\rightarrow~results in a bad observation \(i\.e\., it goes outside the grid\) with probability0\.94=0\.65\>0\.30\.9^\{4\}=0\.65\>0\.3\. From the observation sequenceρ\\rho, we obtain the suffix■⁣■⁣■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}using Rivest\-Schapire counterexample processing and add this to𝖤\\mathsf\{E\}\. We fill the resultant observation table using queries to𝒪A\\mathcal\{O\}^\{A\}\. The resultant table is shown in[Table˜2](https://arxiv.org/html/2605.14440#S4.T2)\. Since this table is not closed \(the row labelled with■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}is not in𝖲\\mathsf\{S\}\), we make it closed\. This gives us the observation table shown in[Table˜3](https://arxiv.org/html/2605.14440#S4.T3)\. From this, we construct the FSC given in[Figure˜2](https://arxiv.org/html/2605.14440#S2.F2)and query𝒪M\\mathcal\{O\}^\{M\}again\. The query now succeeds, andCplusreturns this FSC\.

n0n\_\{0\}■⁣/⁣→\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}/\\rightarrow~Figure 4:Hypothesis after the first iteration\.■\\mathbin\{\\,\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\blacksquare\}\\,\}■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}■\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}■⁣■⁣■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\ \\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\ \\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}𝖲\\mathsf\{\\mathsf\{S\}\}ε\\varepsilonx→\\rightarrow~x→⁣→⁣→\\rightarrow~\\rightarrow~\\rightarrow~𝖲⋅Z\\mathsf\{\\mathsf\{S\}\}\\cdot Z■\\mathbin\{\\,\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\blacksquare\}\\,\}xxxxxx■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}x→\\rightarrow~x→⁣→⁣↑\\rightarrow~\\rightarrow~\\ \\uparrow\\■\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}xxxxxx

Table 2:Observation table after counterexample processing\.

■\\mathbin\{\\,\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\blacksquare\}\\,\}■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}■\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}■⁣■⁣■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\ \\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\ \\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}𝖲\\mathsf\{\\mathsf\{S\}\}ε\\varepsilonx→\\rightarrow~x→⁣→⁣→\\rightarrow~\\rightarrow~\\rightarrow~■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}x→\\rightarrow~x→⁣→⁣↑\\rightarrow~\\rightarrow~\\ \\uparrow\\■⁣■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}x→\\rightarrow~x→↑↑\\rightarrow~\\ \\uparrow\\ \\ \\uparrow\\■⁣■⁣■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}x↑\\ \\uparrow\\x↑↑↑\\ \\uparrow\\ \\ \\uparrow\\ \\ \\uparrow\\■⁣■⁣■⁣■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}x↑\\ \\uparrow\\↓\\ \\downarrow\\↑↑↑\\ \\uparrow\\ \\ \\uparrow\\ \\ \\uparrow\\𝖲⋅Z\\mathsf\{\\mathsf\{S\}\}\\cdot Z■⁣■⁣■⁣■⁣■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}x↑\\ \\uparrow\\↓\\ \\downarrow\\↑↑↑\\ \\uparrow\\ \\ \\uparrow\\ \\ \\uparrow\\■⁣■⁣■⁣■⁣■\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{\.5,\.5,\.5\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{\.5,\.5,\.5\}\\pgfsys@color@gray@stroke\{\.5\}\\pgfsys@color@gray@fill\{\.5\}\\blacksquare\}\\,\}\\mathbin\{\\,\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\blacksquare\}\\,\}x↑\\ \\uparrow\\↓\\ \\downarrow\\↑↑↑\\ \\uparrow\\ \\ \\uparrow\\ \\ \\uparrow\\

Table 3:Observation table after making[Table˜2](https://arxiv.org/html/2605.14440#S4.T2)*closed*\. Rows with onlyxare omitted\.

Finally, note that the FSC, if any, returned by our framework need not be the same as the one implicitly defined by the action oracle because the learning procedure may terminate before fully constructing the policy of the action oracle\. Nonetheless, due to the model\-checking step, the returned controller will also satisfy the threshold\-safety specification \(see[Section˜4](https://arxiv.org/html/2605.14440#S4.SS0.SSS0.Px4)\)\.

### Beyond Safety Objectives\.

So far, we have focused on safety objectives\. This choice was mainly motivated by the simplicity of counterexamples, which can be represented as finite sets of finite paths violating the safety threshold, allowing efficient integration of model checking into the learning loop\. Other objectives, such as infinite\-horizon reachability, are more challenging, as counterexamples may involve infinite paths or more complex structures, for instance, subgraphs of the induced Markov chain that violate the reachability constraint\.

To handle reachability, we rely on a standard reduction from unbounded to bounded reachability, where the target observation must be reached within a fixed number of steps\. This reduction turns reachability into a safety problem and provides an approximation that converges to the unbounded objective as the bound increases\. We use this approach to validate reachability synthesis tasks\. Importantly, if the model\-checking phase inCplussucceeds for a bounded reachability objective, then the resulting FSC also satisfies the corresponding unbounded reachability property\.

Finally, our framework naturally supports*reach\-avoid*objectives, which combine reaching a target set of observations with avoiding unsafe observations, and are both expressive and practically relevant\.

## 5Experiments

### Implementation Details\.

We have made a prototype implementation ofCplus555publicly available at[https://github\.com/mukherjee\-sayan/pomdp\-policy\-synth](https://github.com/mukherjee-sayan/pomdp-policy-synth)inPythonusing the automata learning libraryAalpy\[[30](https://arxiv.org/html/2605.14440#bib.bib35), v1\.4\.1\]\. The POMDPs were modelled in PRISM format and then translated to the DRN format using the model checkerStorm\[[20](https://arxiv.org/html/2605.14440#bib.bib20)\]\. We built a simulator for DRN files to handle the action queries, in the online plannerDespot\[[40](https://arxiv.org/html/2605.14440#bib.bib7)\]\.Despotemploys an anytime heuristic search algorithm to incrementally construct a finite\-depth sparse approximation of the belief tree, maintaining and updating lower and upper bounds on the value of each belief node\. These bounds guide the search and expansion process by focusing on the most promising parts of the tree\. The algorithm finally selects and returns the best action at the root belief node based on the current value estimates\. To optimize the amount of computation performed byDespot, and also to get rid of possible non\-determinism in the action queries due to sampling, for every sequence of observations that we query onDespot, we store the actions suggested byDespotin a cache\. This means, given two sequencesh,h′h,h^\{\\prime\}withhhbeing a prefix ofh′h^\{\\prime\}, when computing the actions forh′h^\{\\prime\}, we callDespotonly on the suffix ofh′h^\{\\prime\}which is not present inhh\. If given sufficient time,Despotreturns an optimal action with high probability \(see\[[40](https://arxiv.org/html/2605.14440#bib.bib7)\]for a formal description\)\. We use the default settings ofDespot, in particular, the discount factorλ\\lambdais set to0\.950\.95666The fixedλ\\lambdavalue was chosen as a proof\-of\-concept\. We acknowledge that the resultingε\\varepsilon\(in[Problem2](https://arxiv.org/html/2605.14440#Thmproblem2)\) is not quantified per example, and that longer\-horizon instances may be more sensitive to this choice\.\.

Model\-checking queries were implemented inStorm\[[20](https://arxiv.org/html/2605.14440#bib.bib20)\]\. When a counterexample set exists for the given safety thresholdα\\alphaand a hypothesis FSC,Stormfinds one by computing the shortest paths until the accumulated probability mass of the traces exceedsα\\alpha\. It uses a recursive enumeration algorithm\[[21](https://arxiv.org/html/2605.14440#bib.bib32)\]for finding the shortest paths in a weighted graph \(see\[[18](https://arxiv.org/html/2605.14440#bib.bib2)\]\)\.

### Experimental Details\.

In our implementation, we instantiateCpluswithDespotas action oracle andStormas the model\-checking oracle \(denoted asCplus\(D,S\)\\textsc\{Cplus\}\(\\textsc\{D\},\\textsc\{S\}\)\)\. In the benchmarks whereStormperforms well, we also useStormas both the action and model\-checking oracle \(denoted asCplus\(S,S\)\\textsc\{Cplus\}\(\\textsc\{S\},\\textsc\{S\}\)\)\. When usingDespotas𝒪A\\mathcal\{O\}^\{A\},Cplusqueries it on demand as needed\. When usingStorm, we first compute a policy offline on an approximation of the infinite beliefs MDP, andCplusthen uses this precomputed policy to answer action queries\. In this setting,Stormexplores the explicit model up to its internal cutoffs, which may still involve analyzing parts of the belief space thatCplusnever queries\. Also,Cplusmay query beliefs that lie outsideStorm’s explored region, in which case we fall back toDespotfor the action queries\. All experiments were conducted on an Intel®Xeon®Gold 5218R CPU running Ubuntu 22\.04\.5 LTS, with 20 CPU cores and 251 GB RAM\.

We compare our results with two state\-of\-the\-art approaches based on formal methods:\(i\)\(i\)the*belief exploration*approach ofStorm\[[20](https://arxiv.org/html/2605.14440#bib.bib20)\]constructs a finite fragment of the belief MDP by exploring successor beliefs until a limit is reached, then approximates the value of the*cutoff*beliefs \(the beliefs where the exploration stopped\) using policies derived from the underlying MDP, and\(ii\)\(ii\)the*iterative policy synthesis*approach ofPaynt\[[3](https://arxiv.org/html/2605.14440#bib.bib18)\]that inductively explores and evaluates policies of increasing memory size\. WhileStormis a mature probabilistic model checker, its belief\-exploration approach often struggles with*safety*objectives, as it does not guarantee a safe policy from the*cutoff*beliefs\.Paynt, on the other hand, synthesizes compact FSCs; however, it may take significantly longer on instances where small FSCs777Payntusually does not terminate when solutions require FSCs beyond a few nodes\.are not sufficient\.

InStorm, we allocate6060seconds for the belief exploration phase to construct a substantial belief fragment and report the value achieved\. We set a timeout of600600seconds forStorm,Paynt, and also forCplus, with each ofDespotandStormas𝒪A\\mathcal\{O\}^\{A\}\. ForPayntandCplus, we target the same safety thresholds\. Since we cannot runStormwith different thresholds, we report on the single optimal value computed byStormfor all the benchmarks\. ForPaynt, we extract from the solver logs the time at which an FSC satisfying each target safety threshold is first obtained, or report failure if no such FSC is found within the time limit\. ForCplus, synthesis is performed independently for each threshold provided as input, and we report the corresponding synthesis times\.

The main objective of this section is to demonstrate through empirical results that even with a prototypical implementation of theCplusframework, we can handle models that are challenging for existing formal\-methods\-based approaches\. The key takeaways from the empirical results are the following:\(i\)\(i\)Stormoften fails to produce useful policies for infinite horizon*safety*objectives, since its Belief exploration fails to guarantee safety from cutoff belief states \(Table[4](https://arxiv.org/html/2605.14440#S5.T4)\),\(ii\)\(ii\)Payntstruggles as soon as it needs to compute FSCs even with a few number of nodes, since iterative policy synthesis underperforms due to the rapid exponential growth of the solution space when the number of nodes in the candidate FSCs increases \(Table[5](https://arxiv.org/html/2605.14440#S5.T5)\),\(iii\)\(iii\)Cplusis less sensitive to the size of the learnt FSC \(see the examples*hallway*\)\. However, it struggles when the number of observations in the model grows large \(this trend is visible in Table[5](https://arxiv.org/html/2605.14440#S5.T5)\), due to the inherent weakness ofL∗L^\{\*\}on large alphabets,\(iv\)\(iv\)For the examples whereStormperforms well, usingStormas𝒪A\\mathcal\{O\}^\{A\}improves the performance ofCplus\(see the*cards*examples\)\. These results suggest that no single approach dominates across all instances, underscoring the value of a portfolio approach to policy synthesis for POMDPs \(which is undecidable\)\.

### Evaluation\.

We evaluateCpluson different benchmarks that demonstrate the framework’s capability, and present our results in[Tables˜4](https://arxiv.org/html/2605.14440#S5.T4)and[5](https://arxiv.org/html/2605.14440#S5.T5)\. Here,\|ℋ\|\|\\mathcal\{H\}\|reports on the number of states present in the constructed hypothesis;\|ℬ\|\|\\mathcal\{B\}\|is the number of states in the belief MDP computed byStorm; ‘iter’ refers to the number of iterations inCplus; ‘time’ reports the total time taken by the tools in seconds; ‘value’ is the maximum probability to satisfy the specification with the policy computed byStorm\.

Grid Sizeα\\alphaCplus\(D,S\)\\textsc\{Cplus\}\(\\textsc\{D\},\\textsc\{S\}\)PayntStormsuccess\|ℋ\|\|\\mathcal\{H\}\|itertimesuccess\|ℋ\|\|\\mathcal\{H\}\|time\|ℬ\|\|\\mathcal\{B\}\|valuetime5×55\\times 50\.25/53\.214\.815/510\.00417\.8×10617\.8\\times 10^\{6\}0227\.120\.55/53\.214\.815/510\.00410×1010\\times 100\.25/52\.814\.585/510\.00210\.5×10610\.5\\times 10^\{6\}03060\.52/5213\.365/510\.00215×1515\\times 150\.25/5315\.535/510\.0065\.4×1065\.4\\times 10^\{6\}0280\.430\.51/5315\.545/510\.00620×2020\\times 200\.25/53\.417\.205/510\.013\.2×1063\.2\\times 10^\{6\}0336\.680\.50/5–––4/510\.01Table 4:Evaluation on*Grid\-world*benchmarks\. The columns ‘success’ denote the number of times the tools could construct a policy satisfying threshold \(α\\alpha\)\.#### Grid\-World Benchmarks

is a classical domain in POMDP research: a robot moves in an N×N grid, aiming to avoid holes while navigating in cardinal directions\. Movements are stochastic – with90%90\\%probability the move succeeds, and with10%10\\%the robot remains in place\. The robot starts at a random safe cell, and it cannot distinguish between safe cells, leading to partial observability\. The goal for the robot is to maximize the probability of staying safe, that is, avoiding visiting the holes \(bad cells\) in specific locations\. We generate grids of sizeN×NN\\times NforN∈\{5,10,15,20\}N\\in\\\{5,10,15,20\\\}, with55random instances per size\.10%10\\%of the cells in the grid were chosen at random and were assigned as bad cells\. The average runtime, success rate, and synthesized controller size for differentα\\alphaand grid sizes are reported in[Table˜4](https://arxiv.org/html/2605.14440#S5.T4)\. In these randomly generated instances, even FSCs of size11typically suffice, and thusPayntperforms well, synthesizing controllers quickly\. But because of the stochastic nature, the belief space for these POMDPs is unbounded, even for small grids with 25 cells\. Thus, in these safety examples,Storm’s belief\-exploration approach fails: even with an additional6060seconds exploration budget,Stormfails to find good strategies in any of the instances, even when it explores the belief space up to size approximately10710^\{7\}\.

Model\|S\|\|S\|\|Z\|\|Z\|α\\alphaCplus\(D,S\)\\textsc\{Cplus\}\(\\textsc\{D\},\\textsc\{S\}\)Cplus\(S,S\)\\textsc\{Cplus\}\(\\textsc\{S\},\\textsc\{S\}\)PayntStorm\|ℋ\|\|\\mathcal\{H\}\|itertime\|ℋ\|\|\\mathcal\{H\}\|itertime\|ℋ\|\|\\mathcal\{H\}\|time\|ℬ\|\|\\mathcal\{B\}\|valuetimehallway1630\.9925378\.1124349\.3251\.4431\.000\.01hallway\-22430\.9920468\.9920343\.41\-\-31\.000\.01hallway\-simple\-5015440\.99532138\.7753222\.61\-\-1561\.001\.00cards\-removed\-2960\.5411\.59411\.4810\.01101\.000\.01cards\-removed\-376100\.5712\.102024\.0410\.021030\.971\.00cards\-removed\-4165120\.548212\.282927\.4220\.163220\.912\.00cards\-removed\-5306140\.5118244\.7447216\.13\-\-9190\.858\.00cards\-added\-23580\.5711\.93711\.7710\.01400\.740\.01cards\-added\-394100\.529411\.8021510\.6620\.033200\.621\.00cards\-added\-4197120\.53569224\.8710819184\.11\-\-25260\.560\.01cards\-added\-5356140\.5\-\-\-\-\-\-\-\-196030\.5193\.00cheese\-maze1580\.81417\.951212\.8310\.011\.7×107\\times 10^\{7\}0\.99231\.64cheese\-maze\-det1580\.8916\.10911\.5220\.01221\.000\.01refuel \(N=6,E=8\)270360\.5314\.97311\.4610\.072\.6×107\\times 10^\{7\}0\.72288\.55rocks\(N=4\)332660\.512232\.6126115\.5910\.054851\.007\.00

Table 5:Evaluation on various benchmarks for the toolsCplus\(D,S\),Cplus\(S,S\),Storm, andPaynt\. We report only the bounded version of the cards here\. Bold entries indicate the best\-performing tool \(in terms of time\) among those that produce verified FSCs\. Extended experimental results are included in the Appendix \([Table˜6](https://arxiv.org/html/2605.14440#Pt0.A2.T6)\)\.
#### Hallway Navigation Benchmarks

is a variant of the examples in\[[23](https://arxiv.org/html/2605.14440#bib.bib24)\], where a robot is inside a hallway, and its objective is to maximize the probability of*reaching a specific location*\. The robot can move in different directions, but its movement is restricted by the walls in the hallway\. Similar to the Grid\-World scenario, the robot starts from a random position in the hallway and cannot distinguish among the cells\. To further show the advantage of our approach, we also consider another variant of the hallway problem: the robot must go*right*5050times and then go*down*to reach the target \(*hallway\-simple\-5050*\)\. The objective in all these benchmarks is to compute a policy for the threshold0\.990\.99\.

On these benchmarks,Stormreturns randomized policies \(within<1<1second\) that reach the target with probability11\.Payntsucceeds in finding a policy in the smaller variant, but times out in the longer variants\. SinceCpluscannot handle unbounded reachability, we reformulated this objective as bounded reachability with a horizon of100100\. We verified that the policy computed for bounded reachability is also sufficient for the unbounded reachability setup\. On the*hallway\-simple\-50*benchmark, while the winning policy is conceptually simple, the size of the required controller is rather large \(to track the number of steps taken\)\. On this example,Cpluscould find the optimal policy in just22iterations within 3 minutes, whereasPaynt’s inductive approach struggles to produce one, illustrating thatCplus’s counterexample\-guided learning scales better than exhaustive search when large FSCs are necessary\.

#### Cards Game Benchmarks

is a more intricate example, adapted from\[[15](https://arxiv.org/html/2605.14440#bib.bib34)\]\(see[Appendix˜0\.C](https://arxiv.org/html/2605.14440#Pt0.A3)for a version of the example\)\. The game involves anNN\-deck card\. In the*removed*variant, one card is initially removed at random; in the*added*variant, one card is duplicated\. At each step, the player draws a card from the deck uniformly at random or guesses which card is missing or duplicated\. The objective is to guess correctly \(reachability\) while avoiding incorrect guesses \(safety\)\. We consider both variants under two settings: \(i\)*bounded*– the agent must guess the card within2N2Nsteps; and \(ii\)*unbounded*– there is no fixed window to guess the correct card, but at each step, with probability0\.050\.05, the agent is forced to guess the correct card\. These represent challenging benchmarks requiring substantial controller memory\. In the*removed*variant, the optimal policy is to wait until observing all but one card, then guess the unseen card\. In the*added*variant, the policy is to track which card appears most frequently in draws, then guess that card as the duplicate\.

[Table˜5](https://arxiv.org/html/2605.14440#S5.T5)reports results across both variants for bounded settings for different values ofNN\.Payntperforms well on smaller examples, where the FSC size remains manageable, but its performance degrades rapidly asNNincreases due to the exponential growth of the controller space\. In contrast,Cplusscales effectively: it synthesizes better FSCs, which avoids exhaustive enumeration of the controller space\.Storm’s belief\-exploration approach works well on these benchmarks, quickly computing high\-quality policies\. When we useStorm’s computed policy as the action oracle forCplus\(instead ofDespot\), synthesis times improve significantly, demonstrating the flexibility ofCplus’s framework to incorporate different oracles\.

#### Reach\-Avoid Navigation Benchmarks\.

We evaluateCpluson several bigger reach\-avoid navigation benchmarks inspired by\[[23](https://arxiv.org/html/2605.14440#bib.bib24),[39](https://arxiv.org/html/2605.14440#bib.bib42),[29](https://arxiv.org/html/2605.14440#bib.bib23),[22](https://arxiv.org/html/2605.14440#bib.bib25)\]\. The benchmark*Refuel*involves a rover travelling from one corner to the opposite of a grid, while avoiding obstacles on the diagonal; it must manage energy by recharging at designated stations and operate under noisy observations of position and battery level\.*Rocks*requires the rover to sample two rocks \(valuable or dangerous\), collect a valuable rock, deliver it to the drop\-off zone, and avoid dangerous rocks\. In*Cheese\-Maze*, a mouse must collect cheese while avoiding traps\. It has two variants: deterministic variant \(actions succeed as intended\) and stochastic variant \(intended action succeeds with probability0\.80\.8\)\. In these reach\-avoid examples, bothStormandPayntwork well\. Furthermore, when we useStorm’s computed policy as the action oracle, our synthesis time improves significantly\.

### Data Availability\.

The artifact accompanying this paper\[[12](https://arxiv.org/html/2605.14440#bib.bib43)\]contains source code, benchmark files, and scripts to replicate our experiments\.

## 6Conclusion and Future Work

In this work, we proposeCplus– a framework that combines the scalability of sampling\-based POMDP controller synthesis methods with the formal guarantees of model checking\. Our prototype demonstrates thatCpluscan synthesize verified controllers across several benchmarks\. Theoretically, we prove that if the underlying sampler \(in our case,Despot\) induces a winning policy that is regular, thenCplusis guaranteed to terminate with a correct FSC\.

We show that this new approach for solving infinite\-duration safety objectives in POMDPs often compares favourably with state\-of\-the\-art tools such asStormandPaynt\. Given that POMDPs constitute a particularly challenging class of models,Cpluscan be regarded as a complementary framework that can be used within a portfolio of methods alongside existing approaches proposed in the literature\. We will investigate whether this algorithm can be integrated into state\-of\-the\-art frameworks such asStorm\[[20](https://arxiv.org/html/2605.14440#bib.bib20)\]orPrism\[[26](https://arxiv.org/html/2605.14440#bib.bib28)\]\.

For future work, we envision several directions\. First, to address POMDPs with large observation spaces, we plan to explore other techniques such as symbolic automata learning that can handle large \(or even infinite\) alphabets\[[16](https://arxiv.org/html/2605.14440#bib.bib37)\]\. Recent works on compositional automata learning\[[31](https://arxiv.org/html/2605.14440#bib.bib48),[27](https://arxiv.org/html/2605.14440#bib.bib46),[17](https://arxiv.org/html/2605.14440#bib.bib47)\]suggests another promising direction: instead of learning a single FSC monolithically, one could exploit structural decomposition to improve scalability\. Second,Cplusprovides a flexible foundation: each of its core components \(the oracles\) can be replaced with alternative tools, enabling broader applicability across diverse models and objectives\. Third, we aim to generalize the use of counterexamples to handle \(general\) reachability and, more generally, PCTL properties within our framework\. Finally, the idea of combining heuristic search \(here, sampling\) with decidable model checking and learning can also be applied in other settings where the synthesis problem is undecidable\. For example, this framework could be applied to the synthesis problem for weighted timed automata\. As in the case of POMDPs, this synthesis problem is undecidable\[[10](https://arxiv.org/html/2605.14440#bib.bib27)\], while model checking remains decidable\[[9](https://arxiv.org/html/2605.14440#bib.bib26)\]\.

## Acknowledgements

*Debraj Chakraborty*is currently supported by the National Research Foundation, Singapore, under its RSS Scheme \(NRF\-RSS2022\-009\) and was also partially supported by the MASH \(MUNI/I/1757/2021\) funded by Masaryk University and the ERC project InOVationCS \(grant No\. 101171844\)\.*Anirban Majumdar*is supported by the Department of Atomic Energy, Government of India, under project no\. RTI4014\.*Prince Mathew*is supported by the FNRS–DFG Weave project FORM\-LEARN\-POMDP \(Ref\. 40028647\)\. Most of this work was done while*Sayan Mukherjee*was affiliated with Univ Rennes, Inria, CNRS, IRISA, France; he is currently supported by IITB Trust Lab\.*Jean\-François Raskin*is supported by the Foundation ULB and the Fonds Thelam of Roi Baudouin Foundation\.

## References

- \[1\]R\. Alur, S\. Bansal, O\. Bastani, and K\. Jothimurugan\(2022\)A framework for transforming specifications in reinforcement learning\.InPrinciples of Systems Design: Essays Dedicated to Thomas A\. Henzinger on the Occasion of His 60th Birthday,pp\. 604–624\.Cited by:[§3](https://arxiv.org/html/2605.14440#S3.SS0.SSS0.Px1.p4.1)\.
- \[2\]R\. Andriushchenko, A\. Bork, M\. Češka, S\. Junges, J\. Katoen, and F\. Macák\(2023\)Search and explore: symbiotic policy synthesis in pomdps\.InInternational Conference on Computer Aided Verification,pp\. 113–135\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p4.3)\.
- \[3\]R\. Andriushchenko, M\. Češka, S\. Junges, J\. Katoen, and Š\. Stupinskỳ\(2021\)PAYNT: a tool for inductive synthesis of probabilistic programs\.InInternational Conference on Computer Aided Verification,pp\. 856–869\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p4.3),[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx2.p2.2)\.
- \[4\]D\. Angluin\(1987\)Learning regular sets from queries and counterexamples\.Inf\. Comput\.75\(2\),pp\. 87–106\.Cited by:[§1](https://arxiv.org/html/2605.14440#S1.p2.1)\.
- \[5\]C\. Baier and J\. Katoen\(2008\)Principles of model checking\.MIT press\.Cited by:[§2](https://arxiv.org/html/2605.14440#S2.p3.4),[§2](https://arxiv.org/html/2605.14440#S2.p4.14)\.
- \[6\]A\. Bork, D\. Chakraborty, K\. Grover, J\. Křetínský, and S\. Mohr\(2024\)Learning explainable and better performing representations of pomdp strategies\.InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems,pp\. 299–319\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p4.3)\.
- \[7\]A\. Bork, S\. Junges, J\. Katoen, and T\. Quatmann\(2020\)Verification of indefinite\-horizon pomdps\.InInternational Symposium on Automated Technology for Verification and Analysis,pp\. 288–304\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p3.1)\.
- \[8\]A\. Bork, J\. Katoen, and T\. Quatmann\(2022\)Under\-approximating expected total rewards in pomdps\.InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems,pp\. 22–40\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p3.1)\.
- \[9\]P\. Bouyer, T\. Brihaye, V\. Bruyère, and J\. Raskin\(2007\)On the optimal reachability problem of weighted timed automata\.Formal Methods Syst\. Des\.31\(2\),pp\. 135–175\.Cited by:[§6](https://arxiv.org/html/2605.14440#S6.p3.1)\.
- \[10\]T\. Brihaye, V\. Bruyère, and J\. Raskin\(2005\)On optimal timed strategies\.InFORMATS,Lecture Notes in Computer Science, Vol\.3829,pp\. 49–64\.Cited by:[§6](https://arxiv.org/html/2605.14440#S6.p3.1)\.
- \[11\]S\. Carr, N\. Jansen, and U\. Topcu\(2021\)Task\-aware verifiable rnn\-based policies for partially observable markov decision processes\.Journal of Artificial Intelligence Research72,pp\. 819–847\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p4.3)\.
- \[12\]D\. Chakraborty, A\. Majumdar, P\. Mathew, S\. Mukherjee, and J\. Raskin\(2026\-05\)Artifact for "synthesizing pomdp policies: sampling meets model\-checking via learning"\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.20084734)Cited by:[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx4.p1.1)\.
- \[13\]K\. Chatterjee, L\. Doyen, H\. Gimbert, and T\. A\. Henzinger\(2010\)Randomness for free\.InInternational Symposium on Mathematical Foundations of Computer Science,pp\. 246–257\.Cited by:[footnote 2](https://arxiv.org/html/2605.14440#footnote2)\.
- \[14\]K\. Chatterjee, L\. Doyen, T\. A\. Henzinger, and J\. Raskin\(2007\)Algorithms for omega\-regular games with imperfect information\.Logical Methods in Computer Science3\.Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.14440#Thmproofx2.p1.1)\.
- \[15\]K\. Chatterjee, L\. Doyen, J\. Raskin, and O\. Sankur\(2025\)The value problem for multiple\-environment mdps with parity objective\.InICALP,LIPIcs, Vol\.334,pp\. 150:1–150:17\.Cited by:[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx3.Px3.p1.3),[Example 3](https://arxiv.org/html/2605.14440#Thmexample3.p1.15)\.
- \[16\]S\. Drews and L\. D’Antoni\(2017\)Learning symbolic automata\.InTACAS \(1\),Lecture Notes in Computer Science, Vol\.10205,pp\. 173–189\.Cited by:[§6](https://arxiv.org/html/2605.14440#S6.p3.1)\.
- \[17\]H\. Fujinami, M\. Waga, J\. An, K\. Suenaga, N\. Yanagisawa, H\. Iseri, and I\. Hasuo\(2025\)Componentwise automata learning for system integration\.InInternational Symposium on Automated Technology for Verification and Analysis,pp\. 3–26\.Cited by:[§6](https://arxiv.org/html/2605.14440#S6.p3.1)\.
- \[18\]T\. Han and J\. Katoen\(2007\)Counterexamples in probabilistic model checking\.InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems,pp\. 72–86\.Cited by:[§4](https://arxiv.org/html/2605.14440#S4.SS0.SSS0.Px3.p2.1),[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx1.p2.2),[Lemma 1](https://arxiv.org/html/2605.14440#Thmlemma1)\.
- \[19\]A\. Hartmanns and H\. Hermanns\(2014\)The modest toolset: an integrated environment for quantitative modelling and verification\.InTACAS,Lecture Notes in Computer Science, Vol\.8413,pp\. 593–598\.Cited by:[§1](https://arxiv.org/html/2605.14440#S1.p8.4)\.
- \[20\]C\. Hensel, S\. Junges, J\. Katoen, T\. Quatmann, and M\. Volk\(2022\)The probabilistic model checker Storm\.Int\. J\. Softw\. Tools Technol\. Transf\.24\(4\),pp\. 589–610\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p3.1),[§1](https://arxiv.org/html/2605.14440#S1.p8.4),[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx1.p1.9),[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx1.p2.2),[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx2.p2.2),[§6](https://arxiv.org/html/2605.14440#S6.p2.1)\.
- \[21\]V\. M\. Jiménez and A\. Marzal\(1999\)Computing the k shortest paths: a new algorithm and an experimental comparison\.InAlgorithm Engineering: 3rd International Workshop, WAE’99 London, UK, July 19–21, 1999 Proceedings 3,pp\. 15–29\.Cited by:[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx1.p2.2)\.
- \[22\]S\. Junges, N\. Jansen, and S\. A\. Seshia\(2021\)Enforcing almost\-sure reachability in pomdps\.InInternational Conference on Computer Aided Verification,pp\. 602–625\.Cited by:[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx3.Px4.p1.1)\.
- \[23\]L\. P\. Kaelbling\(1995\)Learning policies for partially observable environments: scaling up\.InMachine Learning Proceedings 1995: Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, July 9\-12 1995,pp\. 362\.Cited by:[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx3.Px2.p1.3),[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx3.Px4.p1.1)\.
- \[24\]L\. Kocsis and C\. Szepesvári\(2006\)Bandit based monte\-carlo planning\.InEuropean conference on machine learning,pp\. 282–293\.Cited by:[§1](https://arxiv.org/html/2605.14440#S1.p2.1)\.
- \[25\]H\. Kurniawati, D\. Hsu, and W\. S\. Lee\(2008\)Sarsop: efficient point\-based pomdp planning by approximating optimally reachable belief spaces\.\.InRobotics: Science and systems,Vol\.2008\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p3.1)\.
- \[26\]M\. Z\. Kwiatkowska, G\. Norman, and D\. Parker\(2011\)PRISM 4\.0: verification of probabilistic real\-time systems\.InCAV,Lecture Notes in Computer Science, Vol\.6806,pp\. 585–591\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p3.1),[§1](https://arxiv.org/html/2605.14440#S1.p8.4),[§6](https://arxiv.org/html/2605.14440#S6.p2.1)\.
- \[27\]F\. Labbaf, J\. F\. Groote, H\. Hojjat, and M\. R\. Mousavi\(2023\)Compositional learning for interleaving parallel automata\.InInternational Conference on Foundations of Software Science and Computation Structures,pp\. 413–435\.Cited by:[§6](https://arxiv.org/html/2605.14440#S6.p3.1)\.
- \[28\]O\. Madani, S\. Hanks, and A\. Condon\(1999\)On the undecidability of probabilistic planning and infinite\-horizon partially observable markov decision problems\.AAAI10\(315149\.315395\)\.Cited by:[§1](https://arxiv.org/html/2605.14440#S1.p2.1),[§3](https://arxiv.org/html/2605.14440#S3.SS0.SSS0.Px1.p4.1),[§3](https://arxiv.org/html/2605.14440#S3.p2.1)\.
- \[29\]R\. A\. McCallum\(1992\)First results with utile distinction memory for reinforcement learning\.Cited by:[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx3.Px4.p1.1)\.
- \[30\]E\. Muškardin, B\. K\. Aichernig, I\. Pill, A\. Pferscher, and M\. Tappler\(2022\)AALpy: an active automata learning library\.Innovations in Systems and Software Engineering18\(3\),pp\. 417–426\.Cited by:[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx1.p1.9)\.
- \[31\]T\. Neele and M\. Sammartino\(2023\)Compositional automata learning of synchronous systems\.InInternational Conference on Fundamental Approaches to Software Engineering,pp\. 47–66\.Cited by:[§6](https://arxiv.org/html/2605.14440#S6.p3.1)\.
- \[32\]G\. Norman, D\. Parker, and X\. Zou\(2017\)Verification and control of partially observable probabilistic systems\.Real\-Time Systems53,pp\. 354–402\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p3.1)\.
- \[33\]D\. A\. Peled, M\. Y\. Vardi, and M\. Yannakakis\(2002\)Black box checking\.J\. Autom\. Lang\. Comb\.7\(2\),pp\. 225–246\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p5.1)\.
- \[34\]J\. Pineau, G\. Gordon, S\. Thrun,et al\.\(2003\)Point\-based value iteration: an anytime algorithm for pomdps\.InIjcai,Vol\.3,pp\. 1025–1032\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p3.1)\.
- \[35\]R\. L\. Rivest and R\. E\. Schapire\(1993\)Inference of finite automata using homing sequences\.Inf\. Comput\.103\(2\),pp\. 299–347\.Cited by:[§4](https://arxiv.org/html/2605.14440#S4.SS0.SSS0.Px4.p1.23),[footnote 4](https://arxiv.org/html/2605.14440#footnote4)\.
- \[36\]M\. Shahbaz and R\. Groz\(2009\)Inferring mealy machines\.InInternational Symposium on Formal Methods,pp\. 207–222\.Cited by:[§4](https://arxiv.org/html/2605.14440#S4.p2.9),[Appendix 0\.B](https://arxiv.org/html/2605.14440#Thmproofx6.p1.21)\.
- \[37\]J\. Shijubo, M\. Waga, and K\. Suenaga\(2021\)Efficient black\-box checking via model checking with strengthened specifications\.InRV,Lecture Notes in Computer Science,pp\. 100–120\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p5.1)\.
- \[38\]D\. Silver and J\. Veness\(2010\)Monte\-carlo planning in large pomdps\.Advances in neural information processing systems23\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p2.1)\.
- \[39\]T\. Smith and R\. Simmons\(2012\)Heuristic search value iteration for pomdps\.arXiv preprint arXiv:1207\.4166\.Cited by:[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx3.Px4.p1.1)\.
- \[40\]A\. Somani, N\. Ye, D\. Hsu, and W\. S\. Lee\(2013\)DESPOT: online pomdp planning with regularization\.Advances in neural information processing systems26\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p2.1),[§1](https://arxiv.org/html/2605.14440#S1.p8.4),[§5](https://arxiv.org/html/2605.14440#S5.SS0.SSSx1.p1.9)\.
- \[41\]M\. T\. Spaan and N\. Vlassis\(2005\)Perseus: randomized point\-based value iteration for pomdps\.Journal of artificial intelligence research24,pp\. 195–220\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p3.1)\.
- \[42\]B\. Wu, X\. Zhang, and H\. Lin\(2021\)Supervisor synthesis of POMDP via automata learning\.Autom\.129,pp\. 109654\.Cited by:[§1\.0\.1](https://arxiv.org/html/2605.14440#S1.SS0.SSS1.p1.1)\.

## Appendix 0\.AMissing Proofs from[Section˜3](https://arxiv.org/html/2605.14440#S3)

###### Proof

Winning a safety objective with probability11can only be achieved by a policy that entirely avoids bad states\. Indeed, if any bad state is reached, it is reached after a finite execution prefix, and this prefix has non\-zero probability, which would contradict the fact that the policy is winning almost surely\. Thus, the stochastic nodes can be viewed as being controlled by an adversary in a zero\-sum two\-player game with partial observation\. We know that in such games, finite memory \(exponential in the worst case\) suffices\[[14](https://arxiv.org/html/2605.14440#bib.bib4)\]\.∎

###### Proof

Without loss of any generality, assume that we have a single sink state with unique observation𝖡𝖺𝖽\{\\sf Bad\}, and letb𝖡𝖺𝖽b\_\{\\sf Bad\}be the belief that assigns probability11to the bad sink state\.

Consider any beliefb∉Wb\\notin W\. By definition ofWW, for any policyσ\\sigma, we haveℙℬ𝒫σ\(□¬𝖡𝖺𝖽∣initial beliefb\)<1\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\\mid\\text\{initial belief \}b\)<1\. Then, frombbunder policyσ\\sigma, there must exist a finite path of size≤\|S\|\\leq\|S\|frombbtob𝖡𝖺𝖽b\_\{\\sf Bad\}\(this path corresponds to the simple path that starts from a state in the support ofbband reaches the bad sink state\)\. Thus, for allb∈Wb\\in W, we haveℙℬ𝒫σ\(◇≤\|S\|𝖡𝖺𝖽∣initial beliefb\)≥η\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Diamond^\{\\leq\|S\|\}\{\\sf Bad\}\\mid\\text\{initial belief \}b\)\\geq\\eta, whereη=\(pmin\)\|S\|\\eta=\(p\_\{\\min\}\)^\{\|S\|\}, wherepminp\_\{\\min\}is the minimum non\-zero transition probability in the underlying MDP\. Therefore, for any beliefb∉Wb\\notin W, we haveℙℬ𝒫σ\(□≤\|S\|¬𝖡𝖺𝖽\)<1−η\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box^\{\\leq\|S\|\}\\neg\{\\sf Bad\}\)<1\-\\eta\.

Now if the initial belief is inWW, thenℙℬ𝒫σ\(□¬𝖡𝖺𝖽∧□¬W\)=0\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\\wedge\\Box\\neg W\)=0trivially holds\. Otherwise, consider the probability of staying safe and not reachingWWwithink⋅\|S\|k\\cdot\|S\|steps, for somek∈ℕk\\in\\mathbb\{N\}:ℙℬ𝒫σ\(□≤k⋅\|S\|¬𝖡𝖺𝖽∧□≤k⋅\|S\|¬W\)<\(1−η\)k\.\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box^\{\\leq k\\cdot\|S\|\}\\neg\{\\sf Bad\}\\wedge\\Box^\{\\leq k\\cdot\|S\|\}\\neg W\)<\(1\-\\eta\)^\{k\}\\,\.Now, ifkktends to infinity, we haveℙℬ𝒫σ\(□¬𝖡𝖺𝖽∧□¬W\)=0\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{B\}\_\{\\mathcal\{P\}\}\}\(\\Box\\neg\{\\sf Bad\}\\wedge\\Box\\neg W\)=0\.∎

###### Proof

1. 1\.Recall that, for POMDPs𝒫\\mathcal\{P\},𝒫r\\mathcal\{P\}^\{r\}, ℙ𝒫σ\(□¬𝖡𝖺𝖽\)=1\+limλ→1𝔼𝒫rσ\(Rewλ\)=limλ→1\(1\+𝔼𝒫rσ\(Rewλ\)\)\\mathbb\{P\}\_\{\\mathcal\{P\}\}^\{\\sigma\}\(\\Box\\neg\{\\sf Bad\}\)=1\+\\lim\_\{\\lambda\\rightarrow 1\}\\mathbb\{E\}\_\{\\mathcal\{P\}^\{r\}\}^\{\\sigma\}\(\{\\sf\\textsf\{Rew\}^\{\\lambda\}\}\)=\\lim\_\{\\lambda\\rightarrow 1\}\(1\+\\mathbb\{E\}\_\{\\mathcal\{P\}^\{r\}\}^\{\\sigma\}\(\{\\sf\\textsf\{Rew\}^\{\\lambda\}\}\)\)From the above equation, we get that asλ→1\\lambda\\rightarrow 1,\(1\+𝔼𝒫rσ\(Rewλ\)\)→ℙMσ\(□¬𝖡𝖺𝖽\)\(1\+\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}^\{r\}\}\(\{\\textsf\{Rew\}^\{\\lambda\}\}\)\)\\rightarrow\\mathbb\{P\}^\{\\sigma\}\_\{M\}\(\\Box\\neg\{\\sf Bad\}\)\. Then, we can deduce that for everyε\>0\\varepsilon\>0, there exists someδ≥0\\delta\\geq 0such that whenever\|1−λ\|<δ\|1\-\\lambda\|<\\delta,\|\(1\+𝔼𝒫rσ\(Rewλ\)\)−\(ℙ𝒫σ\(□¬𝖡𝖺𝖽\)\)\|<ε\|\(1\+\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}^\{r\}\}\(\{\\textsf\{Rew\}^\{\\lambda\}\}\)\)\-\(\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{P\}\}\(\\Box\\neg\{\\sf Bad\}\)\)\|<\\varepsilon\. Therefore, for everyε\>0\\varepsilon\>0, we can always choose a discount factorλ\\lambda*sufficiently*\-close to11, for which\|\(1\+𝔼𝒫rσ\(Rewλ\)\)−\(ℙ𝒫σ\(□¬𝖡𝖺𝖽\)\)\|<ε\|\(1\+\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}^\{r\}\}\(\{\\textsf\{Rew\}^\{\\lambda\}\}\)\)\-\(\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{P\}\}\(\\Box\\neg\{\\sf Bad\}\)\)\|<\\varepsilonholds\. This implies that if\(1\+𝔼𝒫rσ\(Rewλ\)\)\>α\(1\+\\mathbb\{E\}^\{\\sigma\}\_\{\\mathcal\{P\}^\{r\}\}\(\{\\textsf\{Rew\}^\{\\lambda\}\}\)\)\>\\alphafor someα\\alpha, thenℙ𝒫σ\(□¬𝖡𝖺𝖽\)\>α−ε\\mathbb\{P\}^\{\\sigma\}\_\{\\mathcal\{P\}\}\(\\Box\\neg\{\\sf Bad\}\)\>\\alpha\-\\varepsilon\.
2. 2\.The proof of the second statement is analogous\.∎

## Appendix 0\.BMissing Proofs from[Section˜4](https://arxiv.org/html/2605.14440#S4)

###### Proof

Towards a contradiction, assume that there exists a policyσ∈𝒢\\sigma\\in\\mathcal\{G\}such that for everyρ∈ℛ\\rho\\in\\mathcal\{R\}, we haveρ⊧σ\\rho\\models\\sigma\. Sinceℛ\\mathcal\{R\}is a counterexample, we have𝖢𝗒𝗅\(ℛ\)⊆◇𝖡𝖺𝖽\\mathsf\{Cyl\}\(\\mathcal\{R\}\)\\subseteq\\Diamond\\mathsf\{Bad\}andℙ𝒫×ℋ\(𝖢𝗒𝗅\(ℛ\)\)\>1−α\\mathbb\{P\}\_\{\\mathcal\{P\}\\times\\mathcal\{H\}\}\(\\mathsf\{Cyl\}\(\\mathcal\{R\}\)\)\>1\-\\alpha\. This impliesℙ𝒫σ\(◇𝖡𝖺𝖽\)\>1−α\\mathbb\{P\}\_\{\\mathcal\{P\}\}^\{\\sigma\}\(\\Diamond\\mathsf\{Bad\}\)\>1\-\\alpha, and henceℙ𝒫σ\(□¬𝖡𝖺𝖽\)≤α\\mathbb\{P\}\_\{\\mathcal\{P\}\}^\{\\sigma\}\(\\Box\\neg\\mathsf\{Bad\}\)\\leq\\alpha\. This contradictsσ∈𝒢\\sigma\\in\\mathcal\{G\}\.

###### Proof

Correctness trivially follows since ifCplusterminates, then it does so with a successful model\-checking query\. For the relative completeness, letℋσ\\mathcal\{H\}\_\{\\sigma\}be the FSC corresponding toσ\\sigmaaccording to whichDespotis suggesting actions\. Since action queries are performed according toℋσ\\mathcal\{H\}\_\{\\sigma\}, the observation table𝒯\\mathcal\{T\}maintained byCplusis always consistent withℋσ\\mathcal\{H\}\_\{\\sigma\}\. Now, letℋi\\mathcal\{H\}\_\{i\}be the hypothesis generated at theithi^\{th\}iteration ofCplus, and letℛ\\mathcal\{R\}be the counterexample produced byStormat that iteration\. Note that, by[Problem˜2](https://arxiv.org/html/2605.14440#Thmproblem2),ℋσ\\mathcal\{H\}\_\{\\sigma\}satisfies the safety specification in𝒫\\mathcal\{P\}\. Therefore, by[Section˜4](https://arxiv.org/html/2605.14440#S4.SS0.SSS0.Px4), there existsρ∈ℛ\\rho\\in\\mathcal\{R\}such thatρ⊧̸σ\\rho\\not\\models\\sigma\. Now, sinceρ⊧σℋi\\rho\\models\\sigma\_\{\\mathcal\{H\}\_\{i\}\},ρ\\rhomust have a finite prefixρh\\rho\_\{h\}such thatσℋi\(ρh\)≠σ\(ρh\)\\sigma\_\{\\mathcal\{H\}\_\{i\}\}\(\\rho\_\{h\}\)\\neq\\sigma\(\\rho\_\{h\}\)\. Since𝒯\\mathcal\{T\}is consistent withℋσ\\mathcal\{H\}\_\{\\sigma\}, addingρh\\rho\_\{h\}to𝒯\\mathcal\{T\}rules out this counterexample from subsequent model\-checking queries\. The termination ofCplusthen follows from the termination ofL∗L^\{\*\}for Mealy machines\[[36](https://arxiv.org/html/2605.14440#bib.bib30)\]\. ∎

Model\|S\|\|S\|\|Z\|\|Z\|α\\alphaCplus\(D,S\)\\textsc\{Cplus\}\(\\textsc\{D\},\\textsc\{S\}\)Cplus\(S,S\)\\textsc\{Cplus\}\(\\textsc\{S\},\\textsc\{S\}\)PayntStorm\|ℋ\|\|\\mathcal\{H\}\|itertime\|ℋ\|\|\\mathcal\{H\}\|itertime\|ℋ\|\|\\mathcal\{H\}\|time\|ℬ\|\|\\mathcal\{B\}\|valuetimehallway1630\.9925378\.1124349\.3251\.4431\.000\.01hallway\-22430\.9920468\.9920343\.41\-\-31\.000\.01hallway\-simple\-5015440\.99532138\.7753222\.61\-\-1561\.001\.00cards\-removed\-2\-bounded960\.5411\.59411\.4810\.01101\.000\.01cards\-removed\-3\-bounded76100\.5712\.102024\.0410\.021030\.971\.00cards\-removed\-4\-bounded165120\.548212\.282927\.4220\.163220\.912\.00cards\-removed\-5\-bounded306140\.5118244\.7447216\.13\-\-9190\.858\.00cards\-added\-2\-bounded3580\.5711\.93711\.7710\.01400\.740\.01cards\-added\-3\-bounded94100\.529411\.8021510\.6620\.033200\.621\.00cards\-added\-4\-bounded197120\.53569224\.8710819184\.11\-\-25260\.560\.01cards\-added\-5\-bounded356140\.5\-\-\-\-\-\-\-\-196030\.5193\.00cards\-removed\-2\-unbounded1390\.5612\.74612\.3610\.01110\.950\.010\.2612\.77612\.3410\.01cards\-removed\-3\-unbounded25110\.51518\.291512\.4922\.33320\.850\.010\.21518\.301512\.4210\.02cards\-removed\-4\-unbounded41130\.527113\.672114\.32\-\-790\.740\.010\.227113\.682114\.3510\.18cards\-removed\-5\-unbounded61150\.53374545\.203117\.75\-\-1900\.640\.010\.240128\.203117\.8510\.01cards\-removed\-6\-unbounded85170\.5\-\-\-63239\.79\-\-4450\.553\.000\.234136\.2743114\.922188\.56cards\-removed\-7\-unbounded113190\.5\-\-\-\-\-\-\-\-10200\.477\.000\.22172480\.3158159\.55\-\-cards\-removed\-8\-unbounded145210\.5\-\-\-\-\-\-\-\-22990\.4018\.000\.2\-\-\-60188\.03\-\-cards\-added\-2\-unbounded1790\.5615\.11712\.1010\.014170\.781\.000\.2615\.11712\.1310\.01cards\-added\-3\-unbounded31110\.51218\.7415211\.64\-\-479080\.64183\.000\.21218\.751115\.0510\.04cards\-added\-4\-unbounded49130\.5\-\-\-\-\-\-\-\-44029820\.5428\.880\.217110\.031918\.8910\.44cards\-added\-5\-unbounded71150\.5\-\-\-\-\-\-\-\-108019950\.47217\.660\.228123\.4328120\.1812\.28cards\-added\-6\-unbounded97170\.5\-\-\-\-\-\-\-\-95076800\.40161\.720\.239145\.8938139\.6810\.01cards\-added\-7\-unbounded127190\.5\-\-\-\-\-\-\-\-95838310\.32159\.360\.253186\.3452176\.7510\.01cards\-added\-8\-unbounded161210\.5\-\-\-\-\-\-\-\-92369730\.25152\.180\.2681148\.85681141\.14\-\-cheese\-maze1580\.51417\.951212\.8010\.01172329270\.99231\.640\.81416\.461212\.8010\.01cheese\-maze\-det1580\.5916\.11911\.5220\.01221\.000\.010\.8916\.10911\.5220\.01refuel\(N=6,E=8N=6,E=8\)270360\.5314\.97311\.4610\.07264236720\.72288\.550\.8\-\-\-\-\-\-10\.07rocks\(N=4N=4\)332660\.512232\.6226115\.6010\.054851\.007\.000\.8\-\-\-28234\.5610\.05Table 6:We report here various statistics regarding three tools:Cplus\(D,S\),Cplus\(S,S\),Storm, andPaynt\. Bold entries indicate the best\-performing tool \(in terms of time\) among those that produce verified FSCs\.
## Appendix 0\.CExample: Cards

o0o\_\{0\}011223301122330112233Back to0Back to0Back to013\\frac\{1\}\{3\}13\\frac\{1\}\{3\}13\\frac\{1\}\{3\}12\\frac\{1\}\{2\}12\\frac\{1\}\{2\}12\\frac\{1\}\{2\}12\\frac\{1\}\{2\}12\\frac\{1\}\{2\}12\\frac\{1\}\{2\}

\(a\)
−/draw\-/draw1/draw1/draw2/draw2/draw1/draw1/draw2/draw2/draw3/draw3/draw3/draw3/draw2/guess32/guess\_\{3\}3/guess23/guess\_\{2\}1/guess31/guess\_\{3\}3/guess13/guess\_\{1\}1/guess21/guess\_\{2\}2/guess12/guess\_\{1\}−/draw\-/draw

\(b\)

Figure 5:A POMDP representing[Example˜3](https://arxiv.org/html/2605.14440#Thmexample3)is depicted in[Fig\.˜5\(a\)](https://arxiv.org/html/2605.14440#Pt0.A3.F5.sf1)\. Labels on states denote their corresponding observations\. Only the transitions for the𝖽𝗋𝖺𝗐\{\\sf draw\}action from each state are depicted\. The arrows from the terminal states in each block represent transitions back to the observation0of the corresponding block\.[Fig\.˜5\(b\)](https://arxiv.org/html/2605.14440#Pt0.A3.F5.sf2)represents a winning policy for the threshold\-safety policy for this POMDP\.###### Example 3\(Removing Cards\-33\)

Consider the following card example, adapted from\[[15](https://arxiv.org/html/2605.14440#bib.bib34)\], depicted in[Figure˜5\(a\)](https://arxiv.org/html/2605.14440#Pt0.A3.F5.sf1)\. We have a33\-card deck, numbered1,2,31,2,3, from which one card, initially chosen at random with probability13\\frac\{1\}\{3\}, is missing\. The possible actions are:𝖽𝗋𝖺𝗐\{\\sf draw\}, and\(𝗀𝗎𝖾𝗌𝗌i\)i∈\{1,2,3\}\(\{\\sf guess\}\_\{i\}\)\_\{i\\in\\\{1,2,3\\\}\}\. Only the transitions corresponding to the𝖽𝗋𝖺𝗐\{\\sf draw\}action from each state are depicted in the figure\. The observations are: the numbers on the cards, the initial observationo0o\_\{0\}, and a dummy observation0\. Thus, the agent does not directly observe which block \(*i\.e\.*, which missing\-card configuration\) it is in, but instead observes the cards drawn \(at random\) from the deck\. From the bottom of each block, if the𝖽𝗋𝖺𝗐\{\\sf draw\}action is chosen, the system returns to the root of the corresponding block \(represented as “Back to0”\)\. In theii\-th block, from the0observation, on action𝖽𝗋𝖺𝗐\{\\sf draw\}, there is a transition to each of the present cards with probability12\\frac\{1\}\{2\}\. When the agent becomes confident about which card is missing, it can play the corresponding𝗀𝗎𝖾𝗌𝗌i\{\\sf guess\}\_\{i\}\. Guessing the correct card leads to a winning state, while choosing it incorrectly leads to a separate sink state \(these two states are not shown in the figure\)\. The objective of the agent is to correctly guess the missing card\.

Figure[5\(b\)](https://arxiv.org/html/2605.14440#Pt0.A3.F5.sf2)depicts a policy \(as an FSC\) for Example[3](https://arxiv.org/html/2605.14440#Thmexample3)\. After the first card is drawn \(and hence the block is determined\) this policy waits \(by repeatedly suggesting to𝖽𝗋𝖺𝗐\{\\sf draw\}\) while the same card is shown\. As soon as it observes a different card it can then correctly guess the missing card\. Note that, this policy wins the game with probability11\.
Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

Similar Articles

Learning POMDP World Models from Observations with Language-Model Priors

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

Robust Shielding for Safe Reinforcement Learning

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

Submit Feedback

Similar Articles

Learning POMDP World Models from Observations with Language-Model Priors
Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning
Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning
Robust Shielding for Safe Reinforcement Learning
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation