Answer-Set-Programming-based Abstractions for Reinforcement Learning
Summary
This paper presents an Answer Set Programming (ASP) based implementation of the CARCASS framework for constructing abstractions in reinforcement learning, demonstrating its effectiveness on Blocks World and Minigrid domains.
View Cached Full Text
Cached at: 06/01/26, 09:27 AM
# Answer-Set-Programming-based Abstractions for Reinforcement Learning
Source: [https://arxiv.org/html/2605.31444](https://arxiv.org/html/2605.31444)
###### Abstract
Reinforcement Learning \(RL\) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging\. Abstraction and approximation are therefore essential\. Relational Reinforcement Learning \(RRL\) offers a way to reason about objects and their relations, and the CARCASS framework by Martijn van Otterlo demonstrates how logical representations can model Markov Decision Processes \(MDPs\) in first\-order domains\. Originally implemented in Prolog, CARCASS leverages domain knowledge to create powerful abstractions\. We explore Answer\-Set Programming \(ASP\), which is a rich and, contrary to Prolog, fully declarative modelling language, to realise CARCASS abstractions\. We evaluate our ASP\-based implementation in case studies of two domains, viz\. Blocks World and Minigrid\. Our results indicate that CARCASS with ASP provides a promising approach to constructing abstractions for RL, especially when domain knowledge is available\.111Our implementation is available at[https://github\.com/rbankosegger/RLASP\-core](https://github.com/rbankosegger/RLASP-core)\. Further material \(data, encodings, extended documentation\) can be found here:[https://www\.bankosegger\.at/iclp26/](https://www.bankosegger.at/iclp26/)\.
###### keywords:
Relational Reinforcement Learning, ASP, State\-Action\-Space Abstractions
## 1Introduction
*Reinforcement Learning*\(RL\)\(Sutton and Barto[2018](https://arxiv.org/html/2605.31444#bib.bib37)\)has emerged as a central paradigm in AI, providing a principled framework for agents to learn behaviour through interaction with an environment\. By formalising sequential decision\-making as a Markov Decision Process \(MDP\), RL enables agents to improve their behaviour by trial and error, guided by the maximisation of cumulative reward\. Although this paradigm has produced impressive results with significant practical applications, a challenge that remains is that realistic problem settings often involve enormous state and action spaces, which render naive learning approaches intractable\. Effective techniques for coping with this curse of dimensionality, such as abstraction and approximation, are therefore indispensable\.
Approximations, e\.g\., Deep Q\-Learning\(Mnihet al\.[2015](https://arxiv.org/html/2605.31444#bib.bib130)\), aim at learning compact representations of expected long\-term rewards, typically using deep neural networks, thereby enabling generalisation across large or continuous state spaces\. In practice, however, this often requires large amounts of training data, and learning can be unstable\(Sutton and Barto[2018](https://arxiv.org/html/2605.31444#bib.bib37), Chapter 11\.3\)\.
*Relational Reinforcement Learning*\(RRL\) addresses some of these limitations by using first\-order logic to introduce powerful abstractions for reasoning about complex problem domains\. The key idea is to extend representations to a first\-order setting, allowing environments to be described naturally in terms of objects and their relations\(van Otterlo[2005](https://arxiv.org/html/2605.31444#bib.bib103)\)\. A major advantage of this approach is that it supports generalisation across similar states and can even transfer learned knowledge to related tasks\.
One particular approach to RRL, rooted in logic programming, is the CARCASS framework \(short for*Compact Abstraction using Relational Conjunctions for Aggregation of State\-action Spaces*\) introduced by Martijn van Otterlo \([2004](https://arxiv.org/html/2605.31444#bib.bib102)\)\. CARCASS aims at abstracting the state\-action space by representing abstract states as conjunctions of first\-order literals, possibly enriched with background knowledge, and associates them with sets of admissible actions\. An abstraction is thus an ordered list of rules of the form*State*→\\rightarrow\{Action1, …, Actionn\}, each capturing a situation described by the state and the available actions\. For example, the rule
clear\(X\),clear\(Y\),on\(Y,0\)→\{move\(X,Y\),move\(Y,X\)\}\\textbf\{clear\}\(X\),\\,\\textbf\{clear\}\(Y\),\\,\\textbf\{on\}\(Y,0\)\\;\\;\\rightarrow\\;\\;\\\{\\textbf\{move\}\(X,Y\),\\,\\textbf\{move\}\(Y,X\)\\\}represents an abstract blocks world state in which two blocksXXandYYare clear and can be stacked on one another\. A*policy*\(what action to choose when in a given state\) can then be learned via, e\.g\.,QQ\-learning\(Watkins and Dayan[1992](https://arxiv.org/html/2605.31444#bib.bib125)\)over the abstract state\-action space\. One of CARCASS’s key advantages is its ability to leverage domain knowledge in the abstraction rules, thereby reducing the effective size of the state–action space and supporting generalisation across related situations\. The framework was realised in Prolog, making use of SLDNF\-resolution to handle first\-order reasoning over states and actions\(van Otterlo[2009](https://arxiv.org/html/2605.31444#bib.bib113)\)\.
Building on this, we explore Answer\-Set Programming \(ASP\)\(Brewkaet al\.[2011](https://arxiv.org/html/2605.31444#bib.bib88)\)as a rich and fully declarative modelling language for revisiting CARCASS\.222Pure Prolog is declarative in principle, but in practice many implementations rely on procedural features such as negation as failure, cuts, or the ordering of rules to achieve efficiency\.ASP is a logic\-based knowledge representation and reasoning framework, featuring a concise yet highly expressive modelling language\. Unlike Prolog, the meaning of an ASP program is independent of the ordering of rules and of literals in rule bodies\. It supports nonmonotonic inference, reasoning under incomplete knowledge, expressing preferences and optimisation, as well as modelling actions and change, making it particularly suitable for describing state abstractions in RRL\.
Reimplementing CARCASS in ASP highlights several practical advantages\. Admissible states, transitions, and optimisation criteria can be specified declaratively at a high level, without explicit procedural search strategies\. Defaults and integrity constraints are supported naturally, allowing concise and robust encodings that would otherwise require more algorithmic, control\-intensive implementations in Prolog\. This also enables direct representation of complex reasoning patterns, including partial observability, non\-determinism, and preference\-based selection, with domain constraints expressed directly in the rules\.
In this work, we introduce a general method for encoding abstractions in ASP, which we evaluate using two case studies\. The first considers Blocks World, a classical planning problem that involves stacking blocks to reach a desired configuration, and the second MiniGrid, a suite of grid world navigation tasks involving different subtasks such as key collection and door opening to reach a goal\. Both domains have state spaces that are infeasibly large without abstraction\.
Our main contribution can thus be summarised as follows:
1. 1\.we introduce a general method for encoding CARCASS abstractions in ASP, showing how a fully declarative and expressive modelling language can realise relational state–action abstractions;
2. 2\.furthermore, we show how to use ASP\-based CARCASS abstractions for online learning; and
3. 3\.we evaluate our ASP\-based implementation in two case studies: Blocks World and MiniGrid\. The experiments show that, usingQQ\-learning on the abstract representations, high\-quality policies can be learned consistently\. These policies can also be obtained with significantly fewer samples than without the abstraction\.
In conclusion, our results indicate that CARCASS with ASP provides a promising approach to constructing abstractions for RL, in particular when domain knowledge is available\.
The rest of this paper is organised as follows\. After preliminaries in Section[2](https://arxiv.org/html/2605.31444#S2), we present in Section[3](https://arxiv.org/html/2605.31444#S3)our ASP\-based approach for encoding CARCASS abstractions\. Sections[4](https://arxiv.org/html/2605.31444#S4)and[5](https://arxiv.org/html/2605.31444#S5)are devoted to the case studies, followed by an empirical evaluation with setup, results, and discussion in Section[6](https://arxiv.org/html/2605.31444#S6), which also touches modelling differences between ASP and Prolog\. In Section[7](https://arxiv.org/html/2605.31444#S7), we review related work, and in Section[8](https://arxiv.org/html/2605.31444#S8), we conclude with directions for future work\.
## 2Preliminaries
We next review relevant concepts from logic programming \(SLDNF\-resolution and ASP\), reinforcement learning \(relational MDP andQQ\-learning\), and of the CARCASS framework\.
### 2\.1Logic Programming
We assume a first\-order language with constantscc, functional termsf\(t¯\)f\(\\bar\{t\}\), and predicate atomsp\(t¯\)\\textbf\{p\}\(\\bar\{t\}\)over a PL1\-signatureΣ=\(Func,Pred\)\\Sigma=\(\\textit\{Func\},\\textit\{Pred\}\)and a setVar=\{V1,…,Vm\}\\textit\{Var\}=\\\{V\_\{1\},\\ldots,V\_\{m\}\\\}of variables, wheret¯=t1,…,tn\\bar\{t\}=t\_\{1\},\\ldots,t\_\{n\}is a list of terms matching the arity offf, resp\.,pp\.
A*naf\-literal*is an atomp\(t¯\)\\textbf\{p\}\(\\bar\{t\}\)or an expressionnotp\(t¯\)\\textit\{not\}\\ \\textbf\{p\}\(\\bar\{t\}\)with negation as failure\. A*normal rule*is of the formh←b1,…,bn\.h~\\leftarrow~b\_\{1\},\\ldots,b\_\{n\}\., wherehhis an atom andb1,…,bnb\_\{1\},\\ldots,b\_\{n\}are naf\-literals\. A*normal program*P=\{r1,…,rn\}P=\\\{r\_\{1\},\\ldots,r\_\{n\}\\\}is a set of normal rules\. The application of a*substitution*θ:Var→Term\\theta:\\textit\{Var\}\\to\\textit\{Term\}from a set of variables to a set of terms is defined as usual\. Syntactic objects without variables are called*ground*\. Given a programPPand a*normal goal*←b1,…,bn\.\\leftarrow~b\_\{1\},\\ldots,b\_\{n\}\., an*SLDNF\-refutation*ofP∪\{←b1,…,bn\.\}P\\cup\\\{\\leftarrow~b\_\{1\},\\ldots,b\_\{n\}\.\\\}is denoted byP⊢SLDNFb1,…,bnP\\vdash\_\{\\text\{SLDNF\}\}b\_\{1\},\\ldots,b\_\{n\}\(Nienhuys\-Cheng and de Wolf[1997](https://arxiv.org/html/2605.31444#bib.bib83), Chapter 8\)\.
Based on a signatureΣ=\(Func,Pred\)\\Sigma=\(\\textit\{Func\},\\textit\{Pred\}\)we denote by𝒰\(Σ\)\\mathcal\{U\}\(\\Sigma\)the Herbrand universe overFunc, byℬ\(Σ\)\\mathcal\{B\}\(\\Sigma\)the Herbrand base over𝒰\(Σ\)\\mathcal\{U\}\(\\Sigma\)andPred, and byℐ\(Σ\)=2ℬ\(Σ\)\\mathcal\{I\}\(\\Sigma\)=2^\{\\mathcal\{B\}\(\\Sigma\)\}the set of Herbrand interpretations\. In an interpretationI∈ℐ\(Σ\)I\\in\\mathcal\{I\}\(\\Sigma\), an atomp\(t¯\)\\textbf\{p\}\(\\bar\{t\}\)is*true*ifp\(t¯\)∈I\\textbf\{p\}\(\\bar\{t\}\)\\in Iand*false*ifp\(t¯\)∉I\\textbf\{p\}\(\\bar\{t\}\)\\not\\in I\.
We consider ASP programs as defined in the*ASP\-Core\-2 input language format*\(Calimeriet al\.[2020](https://arxiv.org/html/2605.31444#bib.bib35)\), with the usual features such as strong negation, disjunctive rule heads, and optimisation\. In particular, our encodings make use of choice rules, aggregates, and weak constraints\. The set of all answer sets of a programPPis denoted by𝒜𝒮\(P\)\\mathcal\{AS\}\(P\)\.
### 2\.2Reinforcement Learning
Reinforcement Learning\(Sutton and Barto[2018](https://arxiv.org/html/2605.31444#bib.bib37)\)is a discrete\-time, stochastic control process governed by the dynamics of a*task environment*and by the actions of a learning*agent*\. At every time pointtt, the agent perceives the current*state*𝖲t\\mathsf\{S\}\_\{t\}of the environment and takes an*action*𝖠t\\mathsf\{A\}\_\{t\}\. The environment transitions to a new state𝖲t\+1\\mathsf\{S\}\_\{t\+1\}based on the effects of the action and a*reward*𝖱t\+1\\mathsf\{R\}\_\{t\+1\}is obtained\. The result is a history of interactions𝖧=\(𝖲0,𝖠0,𝖱1,𝖲1,𝖠1,𝖱2,𝖲2,…\)\\mathsf\{H\}=\(\\mathsf\{S\}\_\{0\},\\mathsf\{A\}\_\{0\},\\mathsf\{R\}\_\{1\},\\mathsf\{S\}\_\{1\},\\mathsf\{A\}\_\{1\},\\mathsf\{R\}\_\{2\},\\mathsf\{S\}\_\{2\},\\ldots\)\. The performance of the agent is measured in terms of the*γ\\gamma\-discounted return*𝖦t≐∑k=0∞γk𝖱t\+k\+1\\mathsf\{G\}\_\{t\}\\doteq\\sum\_\{k=0\}^\{\\infty\}\\gamma^\{k\}\\mathsf\{R\}\_\{t\+k\+1\}\.
The task environment is formalised as a*relational Markov decision process \(RMDP\)*\(van Otterlo[2009](https://arxiv.org/html/2605.31444#bib.bib113), p\. 168\)\. Given a PL1\-signatureΣ=\(Func,PredS∪PredA\)\\Sigma=\(\\textit\{Func\},\\textit\{Pred\}\_\{S\}\\cup\\textit\{Pred\}\_\{A\}\), an RMDP consists of a setS⊆ℐ\(\(Func,PredS\)\)S\\subseteq\\mathcal\{I\}\(\(\\textit\{Func\},\\textit\{Pred\}\_\{S\}\)\)of states, which are Herbrand interpretations, a setA⊆ℬ\(\(Func,PredA\)\)A\\subseteq\\mathcal\{B\}\(\(\\textit\{Func\},\\textit\{Pred\}\_\{A\}\)\)of actions, where each action is an atom, and a setR⊆ℝR\\subseteq\\mathbb\{R\}of rewards\. For every states∈Ss\\in S, a setAsA\_\{s\}of admissible actions is defined\. The set of admissible state\-action pairs isΨ≐\{\(s,a\)∣s∈S,a∈As\}\\Psi\\doteq\\\{\(s,a\)\\mid s\\in S,a\\in A\_\{s\}\\\}\. The*dynamics*of the environment are defined by the probabilityp\(s′,r∣s,a\)p\(s^\{\\prime\},r\\mid s,a\)of transitioning to states′s^\{\\prime\}with a reward ofrrafter performing actionaain statess, i\.e\.𝖲t\+1,𝖱t\+1∼p\(⋅∣𝖲t,𝖠t\)\\mathsf\{S\}\_\{t\+1\},\\mathsf\{R\}\_\{t\+1\}\\sim p\(\\ \\cdot\\mid\\mathsf\{S\}\_\{t\},\\mathsf\{A\}\_\{t\}\)\. With the discount rateγ\\gamma, an RMDP is thus a tupleM=\(S,A,R,Ψ,p,γ\)M=\(S,A,R,\\Psi,p,\\gamma\)\. We further define the logic programsPs≐\{p\(t¯\)\.∣p\(t¯\)∈s\}P\_\{s\}\\doteq\\\{\\ \\textbf\{p\}\(\\bar\{t\}\)\.\\mid\\textbf\{p\}\(\\bar\{t\}\)\\in s\\ \\\}andPAs≐\{p\(t¯\)\.∣p\(t¯\)∈As\}P\_\{A\_\{s\}\}\\doteq\\\{\\ \\textbf\{p\}\(\\bar\{t\}\)\.\\mid\\textbf\{p\}\(\\bar\{t\}\)\\in A\_\{s\}\\ \\\}\.
###### Example 1\(Blocks World\)\.
Consider a33\-blocks world setting\(Slaney and Thiébaux[2001](https://arxiv.org/html/2605.31444#bib.bib80)\)as illustrated in Fig\.[1](https://arxiv.org/html/2605.31444#F1)\. A state is described using the predicateson\(B,L\)\\textbf\{on\}\(B,L\)to denote that blockBBis on top of locationLL; andgoal\(B,L\)\\textbf\{goal\}\(B,L\)to denote the agent’s task, i\.e\., a state must be reached in whichBBis on top ofLL\. All actions are based onmove\(B,L\)\\textbf\{move\}\(B,L\), denoting the move of blockBBto some new locationLL\. The admissible actions correspond to executable moves, e\.g\.As1=\{move\(2,1\),move\(1,2\),move\(1,table\)\}A\_\{s\_\{1\}\}=\\\{\\textbf\{move\}\(2,1\),\\textbf\{move\}\(1,2\),\\textbf\{move\}\(1,table\)\\\}\. For this particular RMDP, the effects ofmoveactions are deterministic and cause the insertion and deletion ofonatoms according to the standard blocks world dynamics\. Thegoalatoms are part of every state and are unaffected by moves\. The task of the RMDP is encoded in the reward structure: the reward is9999when a state is reached in which all blocks are stacked in ascending order \(i\.e\.,s2s\_\{2\}\) and is−1\-1otherwise\. The discount rate isγ=1\\gamma=1\.

s1\\displaystyle s\_\{1\}=\{on\(0,table\),on\(1,0\),on\(2,table\),goal\(0,table\),goal\(1,0\),goal\(2,1\)\}\\displaystyle=\\left\\\{\\begin\{array\}\[\]\{l\}\\textbf\{on\}\(0,table\),\\textbf\{on\}\(1,0\),\\textbf\{on\}\(2,table\),\\\\ \\textbf\{goal\}\(0,table\),\\textbf\{goal\}\(1,0\),\\textbf\{goal\}\(2,1\)\\end\{array\}\\right\\\}a1\\displaystyle a\_\{1\}=move\(2,1\)\\displaystyle=\\textbf\{move\}\(2,1\)s2\\displaystyle s\_\{2\}=\{on\(0,table\),on\(1,0\),on\(2,1\),goal\(0,table\),goal\(1,0\),goal\(2,1\)\}\\displaystyle=\\left\\\{\\begin\{array\}\[\]\{l\}\\textbf\{on\}\(0,table\),\\textbf\{on\}\(1,0\),\\textbf\{on\}\(2,1\),\\\\ \\textbf\{goal\}\(0,table\),\\textbf\{goal\}\(1,0\),\\textbf\{goal\}\(2,1\)\\end\{array\}\\right\\\}
Figure 1:A33\-blocks world RMDP\. Taking actiona1a\_\{1\}in states1s\_\{1\}causes a transition tos2s\_\{2\}\.A*policy*π\(a∣s\)\\pi\(a\\mid s\)defines the probability of taking actiona∈Asa\\in A\_\{s\}in statess, i\.e\.𝖠t∼π\(⋅∣𝖲t\)\\mathsf\{A\}\_\{t\}\\sim\\pi\(\\ \\cdot\\mid\\mathsf\{S\}\_\{t\}\)\. The goal is to find an optimal policyπ⋆\\pi\_\{\\star\}that maximises the expected return for all states, that is𝔼p,π⋆\[𝖦t∣𝖲t=s\]≥𝔼p,π\[𝖦t∣𝖲t=s\]\\mathbb\{E\}\_\{p,\\pi\_\{\\star\}\}\\left\[\\mathsf\{G\}\_\{t\}\\mid\\mathsf\{S\}\_\{t\}=s\\,\\right\]\\geq\\mathbb\{E\}\_\{p,\\pi\}\\left\[\\mathsf\{G\}\_\{t\}\\mid\\mathsf\{S\}\_\{t\}=s\\,\\right\]for allπ,s,t\\pi,s,t\. A classic algorithm to findπ⋆\\pi\_\{\\star\}without prior knowledge ofppis*QQ\-learning*\(Watkins and Dayan[1992](https://arxiv.org/html/2605.31444#bib.bib125)\)\. Given a step\-size parameterα∈\(0,1\]\\alpha\\in\(0,1\]and an*action\-value function*𝖰:Ψ→ℝ\\mathsf\{Q\}:\\Psi\\to\\mathbb\{R\}, the version we use\(Sutton and Barto[2018](https://arxiv.org/html/2605.31444#bib.bib37), Chapter 6\.5\)is governed by the update
𝖰\(𝖲t,𝖠t\)←𝖰\(𝖲t,𝖠t\)\+α⋅\(𝖱t\+1\+γ⋅max𝑎\{𝖰\(𝖲t\+1,a\)\}−𝖰\(𝖲t,𝖠t\)\)\.\\mathsf\{Q\}\(\\mathsf\{S\}\_\{t\},\\mathsf\{A\}\_\{t\}\)\\leftarrow\\mathsf\{Q\}\(\\mathsf\{S\}\_\{t\},\\mathsf\{A\}\_\{t\}\)\+\\alpha\\cdot\\left\(\\mathsf\{R\}\_\{t\+1\}\+\\gamma\\cdot\\underset\{a\}\{\\max\}\\bigl\\\{\\mathsf\{Q\}\(\\mathsf\{S\}\_\{t\+1\},a\)\\bigr\\\}\-\\mathsf\{Q\}\(\\mathsf\{S\}\_\{t\},\\mathsf\{A\}\_\{t\}\)\\right\)\.An approximation ofπ⋆\\pi\_\{\\star\}is obtained by selecting actions based onargmax\{𝖰\(s,a\)∣a∈As\}\{\\arg\\max\}\\\{\\mathsf\{Q\}\(s,a\)\\mid a\\in A\_\{s\}\\\}\.
### 2\.3CARCASS Abstractions
We now introduce the CARCASS framework\(van Otterlo[2009](https://arxiv.org/html/2605.31444#bib.bib113), Chapter 5\), discussing its syntax, semantics and finally an abstract version ofQQ\-learning\.
LetΣ\\Sigmabe a PL1\-signature andVara set of variables\.
- •An*abstract state*s^\{\\hat\{s\}\}has the forms^≐l1,…,lk\{\\hat\{s\}\}\\doteq l\_\{1\},\\ldots,l\_\{k\},k≥1k\\geq 1, with alllil\_\{i\}naf\-literals overΣ\\SigmaandVar\.
- •An*abstract action*a^\{\\hat\{a\}\}is a single predicate atoma^≐p\(u¯\)\{\\hat\{a\}\}\\doteq\\textbf\{p\}\(\\bar\{u\}\)overΣ\\SigmaandVar\.
- •A*CARCASS*C^=⟨\(s^1,A^s^1\),…,\(s^n,A^s^n\)⟩\\hat\{C\}=\\langle\(\{\\hat\{s\}\}\_\{1\},\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{1\}\}\),\\ldots,\(\{\\hat\{s\}\}\_\{n\},\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{n\}\}\)\\rangleis a finite list ofn≥1n\\geq 1elements, such that everys^i\{\\hat\{s\}\}\_\{i\}is an abstract state and everyAs^i=\{a^i,1,…,a^i,mi\}A\_\{\{\\hat\{s\}\}\_\{i\}\}=\\\{\{\\hat\{a\}\}\_\{i,1\},\\ldots,\{\\hat\{a\}\}\_\{i,m\_\{i\}\}\\\}is a set ofmi≥1m\_\{i\}\\geq 1abstract actions that are*admissible*ins^i\{\\hat\{s\}\}\_\{i\}\. The variables occurring inAs^iA\_\{\{\\hat\{s\}\}\_\{i\}\}must also occur ins^i\{\\hat\{s\}\}\_\{i\}\. Furthermore, we letS^=\{s^i∣1≤i≤n\}\\hat\{S\}=\\\{\{\\hat\{s\}\}\_\{i\}\\mid 1\\leq i\\leq n\\\}andΨ^=\{\(s^i,a^i,j\)∣1≤i≤n,1≤j≤mi\}\\hat\{\\Psi\}=\\\{\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)\\mid 1\\leq i\\leq n,1\\leq j\\leq m\_\{i\}\\\}\.
The semantics of a CARCASSC^\\hat\{C\}is defined via a covering relationCov, which uses SLDNF\-resolution\. Intuitively, to choose an abstract state\-action pair for some concrete pair\(s,a\)\(s,a\),C^\\hat\{C\}is interpreted as a*decision list*that is traversed until the firsts^i\{\\hat\{s\}\}\_\{i\}is found which coversss\. Then, somea^i,j\{\\hat\{a\}\}\_\{i,j\}with\(s^i,a^i,j\)\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)covers\(s,a\)\(s,a\)is chosen fromA^s^i\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{i\}\}\. Formally, given an RMDPM=\(S,A,R,Ψ,p,γ\)M=\(S,A,R,\\Psi,p,\\gamma\)andC^\\hat\{C\}, for everys^i∈S^\{\\hat\{s\}\}\_\{i\}\\in\\hat\{S\}and\(s^i,a^i,j\)∈Ψ^\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)\\in\\hat\{\\Psi\}:
Cov\(s^i\)\\displaystyle\\textit\{Cov\}\(\{\\hat\{s\}\}\_\{i\}\)≐\{s∈S∣Ps⊢SLDNFs^i\},\\displaystyle\\doteq\\\{\\ s\\in S\\mid P\_\{s\}\\vdash\_\{\\text\{SLDNF\}\}\{\\hat\{s\}\}\_\{i\}\\ \\\},Cov\(s^i,a^i,j\)\\displaystyle\\textit\{Cov\}\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)≐\{\(s,a\)∈Ψ∣Ps⊢SLDNFs^iθanda=a^i,jθ\},\\displaystyle\\doteq\\\{\\ \(s,a\)\\in\\Psi\\mid P\_\{s\}\\vdash\_\{\\text\{SLDNF\}\}\{\\hat\{s\}\}\_\{i\}\\theta\\text\{ and \}a=\{\\hat\{a\}\}\_\{i,j\}\\theta\\ \\\},⟦s^i⟧C^\\displaystyle\{\\llbracket\{\{\\hat\{s\}\}\_\{i\}\}\\rrbracket\_\{\\hat\{C\}\}\}≐\{s∈Cov\(s^i\)∣s∉Cov\(s^k\)for all1≤k<i\},\\displaystyle\\doteq\\\{\\ s\\in\\textit\{Cov\}\(\{\\hat\{s\}\}\_\{i\}\)\\mid s\\not\\in\\textit\{Cov\}\(\{\\hat\{s\}\}\_\{k\}\)\\text\{ for all \}1\\leq k<i\\ \\\},⟦s^i,a^i,j⟧C^\\displaystyle\{\\llbracket\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}\\rrbracket\_\{\\hat\{C\}\}\}≐\{\(s,a\)∈Cov\(s^i,a^i,j\)∣s∉Cov\(s^k\)for all1≤k<i\},\\displaystyle\\doteq\\\{\\ \(s,a\)\\in\\textit\{Cov\}\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)\\mid s\\not\\in\\textit\{Cov\}\(\{\\hat\{s\}\}\_\{k\}\)\\text\{ for all \}1\\leq k<i\\ \\\},⟦s^i,a^i,j⟧C^s\\displaystyle\{\\llbracket\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}\\rrbracket\_\{\\hat\{C\}\}^\{s\}\}≐\{a∣\(s,a\)∈⟦s^i,a^i,j⟧C^\}\.\\displaystyle\\doteq\\\{\\ a\\mid\(s,a\)\\in\{\\llbracket\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}\\rrbracket\_\{\\hat\{C\}\}\}\\ \\\}\.
###### Example 2\(Blocks World cont’d\)\.
An example CARCASS for a33\-blocks world is given in Table[1](https://arxiv.org/html/2605.31444#T1)\. Intuitively,s^1\{\\hat\{s\}\}\_\{1\}captures all states with a tower of two blocks,s^2\{\\hat\{s\}\}\_\{2\}captures all states with all blocks on the table, ands^3\{\\hat\{s\}\}\_\{3\}captures all states with a tower of three blocks\.
Table 1:Example CARCASS\(van Otterlo[2009](https://arxiv.org/html/2605.31444#bib.bib113), p\. 253\)Algorithm[1](https://arxiv.org/html/2605.31444#algorithm1)illustrates howQQ\-learning can operate on the abstract representations of a CARCASS\. The algorithm assumes that the task environment is explored in a series of episodes, each with a history𝖧=\(𝖲0,𝖠0,𝖱1,𝖲1,…\)\\mathsf\{H\}=\(\\mathsf\{S\}\_\{0\},\\mathsf\{A\}\_\{0\},\\mathsf\{R\}\_\{1\},\\mathsf\{S\}\_\{1\},\\ldots\)\. The CARCASS semantics is used to derive an abstract series of interactions𝖧^=\(𝖲^0,𝖠^0,𝖱1,𝖲^1,…\)\\hat\{\\mathsf\{H\}\}=\(\\hat\{\\mathsf\{S\}\}\_\{0\},\\hat\{\\mathsf\{A\}\}\_\{0\},\\mathsf\{R\}\_\{1\},\\hat\{\\mathsf\{S\}\}\_\{1\},\\ldots\), to whichQQ\-learning is applied, working with abstract policies and value functions\.
Input:RMDPM=\(S,A,R,Ψ,p,γ\)M=\(S,A,R,\\Psi,p,\\gamma\), CARCASSC^=⟨\(s^1,A^s^1\),…,\(s^n,A^s^n\)⟩\\hat\{C\}=\\langle\(\{\\hat\{s\}\}\_\{1\},\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{1\}\}\),\\ldots,\(\{\\hat\{s\}\}\_\{n\},\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{n\}\}\)\\rangle, initial abstract action\-value function𝖰^:Ψ^→ℝ\\hat\{\\mathsf\{Q\}\}:\\hat\{\\Psi\}\\to\\mathbb\{R\}, step sizeα∈\(0,1\]\\alpha\\in\(0,1\]
Output:Learned abstract action\-value function
𝖰^\\hat\{\\mathsf\{Q\}\}\(ideally
𝖰^≈q⋆\\hat\{\\mathsf\{Q\}\}\\approx q\_\{\\star\}\)
1foreach**episode**do
2Initialize the starting state
𝖲0\\mathsf\{S\}\_\{0\}
3Find
𝖲^0∈S^\\hat\{\\mathsf\{S\}\}\_\{0\}\\in\\hat\{S\}for which
𝖲0∈⟦𝖲^0⟧C^\\mathsf\{S\}\_\{0\}\\in\{\\llbracket\{\\hat\{\\mathsf\{S\}\}\_\{0\}\}\\rrbracket\_\{\\hat\{C\}\}\}
4repeatfor every time point
t=0,1,2,…t=0,1,2,\\ldots
5
6Choose
𝖠^t∈A^𝖲^t\\hat\{\\mathsf\{A\}\}\_\{t\}\\in\\hat\{A\}\_\{\\hat\{\\mathsf\{S\}\}\_\{t\}\}using an abstract behaviour policy derived from
𝖰^\\hat\{\\mathsf\{Q\}\}
7Choose a random action
𝖠t∈⟦𝖲^t,𝖠^t⟧C^𝖲t\\mathsf\{A\}\_\{t\}\\in\{\\llbracket\{\\hat\{\\mathsf\{S\}\}\_\{t\},\\hat\{\\mathsf\{A\}\}\_\{t\}\}\\rrbracket\_\{\\hat\{C\}\}^\{\\mathsf\{S\}\_\{t\}\}\}
8Take action
𝖠t\\mathsf\{A\}\_\{t\}, observe
𝖲t\+1\\mathsf\{S\}\_\{t\+1\},
𝖱t\+1\\mathsf\{R\}\_\{t\+1\}
9Find
𝖲^t\+1∈S^\\hat\{\\mathsf\{S\}\}\_\{t\+1\}\\in\\hat\{S\}for which
𝖲t\+1∈⟦𝖲^t\+1⟧C^\\mathsf\{S\}\_\{t\+1\}\\in\{\\llbracket\{\\hat\{\\mathsf\{S\}\}\_\{t\+1\}\}\\rrbracket\_\{\\hat\{C\}\}\}
10
𝖰^\(𝖲^t,𝖠^t\)←𝖰^\(𝖲^t,𝖠^t\)\+α⋅\(𝖱t\+1\+γ⋅maxa^\{𝖰^\(𝖲^t\+1,a^\)\}−𝖰^\(𝖲^t,𝖠^t\)\)\\hat\{\\mathsf\{Q\}\}\\left\(\\hat\{\\mathsf\{S\}\}\_\{t\},\\hat\{\\mathsf\{A\}\}\_\{t\}\\right\)\\leftarrow\\hat\{\\mathsf\{Q\}\}\\left\(\\hat\{\\mathsf\{S\}\}\_\{t\},\\hat\{\\mathsf\{A\}\}\_\{t\}\\right\)\+\\alpha\\cdot\\biggl\(\\mathsf\{R\}\_\{t\+1\}\+\\gamma\\cdot\\underset\{\{\\hat\{a\}\}\}\{\\max\}\\left\\\{\\hat\{\\mathsf\{Q\}\}\\left\(\\hat\{\\mathsf\{S\}\}\_\{t\+1\},\{\\hat\{a\}\}\\right\)\\right\\\}\-\\hat\{\\mathsf\{Q\}\}\\left\(\\hat\{\\mathsf\{S\}\}\_\{t\},\\hat\{\\mathsf\{A\}\}\_\{t\}\\right\)\\biggr\)
11
12until*end of episode*or*maximum number of steps reached*
13end foreach
return
𝖰^\\hat\{\\mathsf\{Q\}\}
Algorithm 1QQ\-learning for CARCASSs\(van Otterlo[2009](https://arxiv.org/html/2605.31444#bib.bib113), p\. 258\)
## 3ASP\-Based Abstractions
We provide a method to encode CARCASS abstractions in ASP and show how it can be used in an online learning setting\. We also extend the method to allow for additional background knowledge\.
### 3\.1ASP Encoding
LetC^=⟨\(s^1,A^s^1\),…,\(s^n,A^s^n\)⟩\\hat\{C\}=\\langle\(\{\\hat\{s\}\}\_\{1\},\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{1\}\}\),\\ldots,\(\{\\hat\{s\}\}\_\{n\},\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{n\}\}\)\\ranglebe a CARCASS andM=\(S,A,R,Ψ,p,γ\)M=\(S,A,R,\\Psi,p,\\gamma\)an RMDP with a shared signatureΣ=\(Func,PredS∪PredA\)\\Sigma=\(\\textit\{Func\},\\textit\{Pred\}\_\{S\}\\cup\\textit\{Pred\}\_\{A\}\)\. We assume that none of the function\- and predicate names introduced below occur inΣ\\Sigma\.
###### Definition 1\(Labelling Function\)\.
To reason about abstract states and state\-action pairs as objects, we define a*labelling function*as a one\-to\-one mappingλ:S^∪Ψ^↦Term\\lambda:\\hat\{S\}\\cup\\hat\{\\Psi\}\\mapsto\\textit\{Term\}from abstract states and state\-action pairs to a set of ground terms\.
We now present our CARCASS encoding\. To this end, we introduce several new atoms:
- •sIdx\(λ\(s^i\),i\)\\textbf\{sIdx\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\),i\)states an abstract states^i\{\\hat\{s\}\}\_\{i\}\(identified by its label\) is at indexiiin the decision list\.
- •sCov\(λ\(s^i\),\(T¯\)\)\\textbf\{sCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\),\(\\bar\{T\}\)\)states a covering relation betweens^i\{\\hat\{s\}\}\_\{i\}and some statessexists\. Denoting all free variables occurring ins^i\{\\hat\{s\}\}\_\{i\}byV¯=V1,…,Vm\\bar\{V\}=V\_\{1\},\\ldots,V\_\{m\},\(T¯\)\(\\bar\{T\}\)represents a ground substitutionV¯θ=T¯\\bar\{V\}\\theta=\\bar\{T\}, reminiscent of the originalCov, wherePs⊢SLDNFs^iθP\_\{s\}\\vdash\_\{\\text\{SLDNF\}\}\{\\hat\{s\}\}\_\{i\}\\theta\.
- •sChoice\(λ\(s^i\)\)\\textbf\{sChoice\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\)\)statess^i\{\\hat\{s\}\}\_\{i\}as “chosen”, such that all abstract states covering some concrete state can be enumerated as answer sets\.
- •aCov\(λ\(s^i,a^i,j\),p\(t¯\)\)\\textbf\{aCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\),p\(\\bar\{t\}\)\)states a covering relation between\(s^i,a^i,j\)\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)\(identified by its label\) and a concrete pair\(s,a\)\(s,a\)exists in each answer set wheres^i\{\\hat\{s\}\}\_\{i\}is “chosen”\. Note that we use two representations for the covered actionaa: as atoma=p\(t¯\)a=\\textbf\{p\}\(\\bar\{t\}\), as originally defined; and as the termp\(t¯\)p\(\\bar\{t\}\)inaCov\.
Armed with these auxiliary predicates, we develop the CARCASS encoding by combining several programs that encode different CARCASS components\.
We shall use a programPs^iP\_\{\{\\hat\{s\}\}\_\{i\}\}that encodes a single abstract state, a programPs^i,a^i,jP\_\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}that encodes the covering relation for abstract state\-action pairs, and programsPCovP\_\{Cov\}andPC^P\_\{\\hat\{C\}\}that translate the decision list semantics following the guess\-and\-check methodology\. ProgramPCovP\_\{Cov\}produces solution candidates using a choice rule, andPC^P\_\{\\hat\{C\}\}adds a constraint to eliminate unwanted candidates\. Given a concrete statess, represented by programPsP\_\{s\}, and its setAsA\_\{s\}of admissible concrete actions, represented byPAsP\_\{A\_\{s\}\},Ps∪PAs∪PCovP\_\{s\}\\cup P\_\{A\_\{s\}\}\\cup P\_\{Cov\}yields one solution candidate per covering abstract state\. Furthermore,Ps∪PAs∪PC^P\_\{s\}\\cup P\_\{A\_\{s\}\}\\cup P\_\{\\hat\{C\}\}eliminates all but one such candidate, corresponding to the selected abstract state via the CARCASS semantics\.
###### Definition 2\(CARCASS ASP Encoding\)\.
Lets^i∈S^\{\\hat\{s\}\}\_\{i\}\\in\\hat\{S\}be an abstract state and letV¯=V1,…,Vm\\bar\{V\}=V\_\{1\},\\ldots,V\_\{m\}denote all the variables occurring ins^i\{\\hat\{s\}\}\_\{i\}\. Then, we define programs as follows:
- •Ps^iP\_\{\{\\hat\{s\}\}\_\{i\}\}encodes the CARCASS covering relation fors^i\{\\hat\{s\}\}\_\{i\}by Ps^i\\displaystyle P\_\{\{\\hat\{s\}\}\_\{i\}\}≐\{sIdx\(λ\(s^i\),i\)\.sCov\(λ\(s^i\),\(V¯\)\)←s^i\.\}\.\\displaystyle\\doteq\\left\\\{\\begin\{array\}\[\]\{l@\{~\}l@\{~\}l\}\\textbf\{sIdx\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\),i\)\.\\qquad\\textbf\{sCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\),\(\\bar\{V\}\)\)&\\leftarrow&\{\\hat\{s\}\}\_\{i\}\.\\end\{array\}\\right\\\}\.The first rule is a fact definingsIdxas just described\. The head of the second rule definessCov; its body consists of the literals ins^i\{\\hat\{s\}\}\_\{i\}\. The ground instances of this rule reflect all of the possible ground substitutions for the variables ins^i\{\\hat\{s\}\}\_\{i\}\.
- •Ps^i,a^i,jP\_\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}encodes the CARCASS covering relation for\(s^i,a^i,j\)∈Ψ^\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)\\in\\hat\{\\Psi\}, witha^i,j=p\(u¯\)\{\\hat\{a\}\}\_\{i,j\}=\\textbf\{p\}\(\\bar\{u\}\), by Ps^i,a^i,j\\displaystyle P\_\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}≐\{aCov\(λ\(s^i,a^i,j\),p\(u¯\)\)←sChoice\(λ\(s^i\)\),sCov\(λ\(s^i\),\(V¯\)\),p\(u¯\)\.\},\\displaystyle\\doteq\\left\\\{\\begin\{array\}\[\]\{lll\}\\textbf\{aCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\),\\text\{p\}\(\\bar\{u\}\)\)&\\leftarrow&\\textbf\{sChoice\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\)\),\\textbf\{sCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\),\(\\bar\{V\}\)\),\\textbf\{p\}\(\\bar\{u\}\)\.\\end\{array\}\\right\\\},wherep/np/nis a fresh function name of the same arity asp/n\\textbf\{p\}/n\. The head of this rule definesaCov; its body reflects the definition ofCov\(s^i,a^i,j\)\\textit\{Cov\}\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\), namely the coverage ofs^iθ\{\\hat\{s\}\}\_\{i\}\\theta\(bysCov\) under the possible ground substitutionsθ\\thetaof the rule \(in particular the ground substitutionsV¯θ\\bar\{V\}\\thetaofV¯\\bar\{V\}\), and the admissibility of concrete actionsa=p\(u¯\)θa=\\textbf\{p\}\(\\bar\{u\}\)\\thetathat are ground instances ofa^i,j\{\\hat\{a\}\}\_\{i,j\}under said substitutions\.
- •PCovP\_\{\\textit\{Cov\}\}encodes the covering relations for all abstract states and state\-action pairs by PCov\\displaystyle P\_\{\\textit\{Cov\}\}≐⋃1≤i≤n\(Ps^i∪⋃1≤j≤miPs^i,a^i,j\)∪\{1=\{sChoice\(R\):sCov\(R,\_\)\}\.\}\.\\displaystyle\\doteq\\bigcup\_\{1\\leq i\\leq n\}\\Big\(P\_\{\{\\hat\{s\}\}\_\{i\}\}\\cup\\bigcup\_\{1\\leq j\\leq m\_\{i\}\}P\_\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}\\Big\)\\cup\\Big\\\{1=\\\{\\textbf\{sChoice\}\(R\):\\textbf\{sCov\}\(R,\\\_\)\\ \\\}\.\\Big\\\}\.It includes a choice rule, in which the labels of all covering abstract states are made available for the choice, but only one can be chosen per answer set\.
- •PC^P\_\{\\hat\{C\}\}encodes the full CARCASS semantics by PC^\\displaystyle P\_\{\\hat\{C\}\}≐PCov∪\{←sChoice\(R1\),sCov\(R2,\_\),sIdx\(R1,I1\),sIdx\(R2,I2\),I2<I1\.\}\.\\displaystyle\\doteq P\_\{\\textit\{Cov\}\}\\cup\\left\\\{\\begin\{array\}\[\]\{l\}\\leftarrow\\textbf\{sChoice\}\(R1\),\\textbf\{sCov\}\(R2,\\\_\),\\textbf\{sIdx\}\(R1,I1\),\\textbf\{sIdx\}\(R2,I2\),I2<I1\.\\end\{array\}\\right\\\}\.The constraint added toPCovP\_\{\\textit\{Cov\}\}eliminates all solution candidates except for the one corresponding to the minimal covering state with respect to the decision list ordering\.
###### Example 3\(Blocks World cont’d\)\.
To illustrate the ASP encoding, consider the CARCASS from Example[2](https://arxiv.org/html/2605.31444#Thmexample2), in particular the abstract states^2\{\\hat\{s\}\}\_\{2\}and the abstract pair\(s^2,a^2,1\)\(\{\\hat\{s\}\}\_\{2\},\{\\hat\{a\}\}\_\{2,1\}\)\. We choose descriptive names as labels, e\.g\.λ\(s^2\)="3towers"\\lambda\(\{\\hat\{s\}\}\_\{2\}\)=\\texttt\{"3towers"\}andλ\(\(s^2,a^2,1\)\)="3t\-mAB"\\lambda\(\(\{\\hat\{s\}\}\_\{2\},\{\\hat\{a\}\}\_\{2,1\}\)\)=\\texttt\{"3t\-mAB"\}\. Then:
Ps^2=\{sIdx\("3towers",2\)\.sCov\("3towers",\(A,B,C\)\)←on\(A,table\),on\(B,table\),on\(C,table\),A≠B,B≠C,A≠C\.\},P\_\{\{\\hat\{s\}\}\_\{2\}\}=\\left\\\{\\begin\{array\}\[\]\{lll\}\\textbf\{sIdx\}\(\\texttt\{"3towers"\},2\)\.&&\\\\ \\textbf\{sCov\}\(\\texttt\{"3towers"\},\(A,B,C\)\)&\\leftarrow&\\textbf\{on\}\(A,\\textit\{table\}\),\\textbf\{on\}\(B,\\textit\{table\}\),\\textbf\{on\}\(C,\\textit\{table\}\),\\\\ &&A\\not=B,B\\not=C,A\\not=C\.\\end\{array\}\\right\\\},Ps^2,a^2,1=\{aCov\("3t\-mAB",move\(A,B\)\)←sChoice\("3towers"\),sCov\("3towers",\(A,B,C\)\),move\(A,B\)\.\}\.P\_\{\{\\hat\{s\}\}\_\{2\},\{\\hat\{a\}\}\_\{2,1\}\}=\\left\\\{\\begin\{array\}\[\]\{l@\{~\}l\}\\textbf\{aCov\}\(\\texttt\{"3t\-mAB"\},\\textit\{move\}\(A,B\)\)\\leftarrow&\\ \\textbf\{sChoice\}\(\\texttt\{"3towers"\}\),\\\\ &\\ \\textbf\{sCov\}\(\\texttt\{"3towers"\},\(A,B,C\)\),\\textbf\{move\}\(A,B\)\.\\end\{array\}\\right\\\}\.The encodings for the other abstract statess^1\{\\hat\{s\}\}\_\{1\}ands^3\{\\hat\{s\}\}\_\{3\}, and their corresponding abstract state\-action pairs follow the same pattern\. The encoding forPCovP\_\{\\textit\{Cov\}\}is the union of the ones for all abstract states and state\-action pairs with the choice rule added as defined above, i\.e\.PCov=\(Ps^1∪Ps^1,a^1,1∪Ps^1,a^1,2∪Ps^1,a^1,3\)∪\(Ps^2∪Ps^2,a^2,1∪…∪Ps^2,a^2,6\)∪\(Ps^3∪Ps^3,a^3,1\)∪\{1=\{sChoice\(R\):sCov\(R,\_\)\}\.\}P\_\{\\textit\{Cov\}\}=\(P\_\{\{\\hat\{s\}\}\_\{1\}\}\\cup P\_\{\{\\hat\{s\}\}\_\{1\},\{\\hat\{a\}\}\_\{1,1\}\}\\cup P\_\{\{\\hat\{s\}\}\_\{1\},\{\\hat\{a\}\}\_\{1,2\}\}\\cup P\_\{\{\\hat\{s\}\}\_\{1\},\{\\hat\{a\}\}\_\{1,3\}\}\)\\cup\(P\_\{\{\\hat\{s\}\}\_\{2\}\}\\cup P\_\{\{\\hat\{s\}\}\_\{2\},\{\\hat\{a\}\}\_\{2,1\}\}\\cup\\ldots\\cup P\_\{\{\\hat\{s\}\}\_\{2\},\{\\hat\{a\}\}\_\{2,6\}\}\)\\cup\(P\_\{\{\\hat\{s\}\}\_\{3\}\}\\cup P\_\{\{\\hat\{s\}\}\_\{3\},\{\\hat\{a\}\}\_\{3,1\}\}\)\\cup\\left\\\{1=\\\{\\textbf\{sChoice\}\(R\):\\textbf\{sCov\}\(R,\\\_\)\\ \\\}\.\\right\\\}\. Finally,PC^P\_\{\\hat\{C\}\}is defined as above\. The complete encoding is available in the supplementary material\.
### 3\.2Correctness
As for the correctness of the encoding, we establish the following result \(proof in the supplementary material\)\.
###### Proposition 1\.
Given an RMDPM=\(S,A,R,Ψ,p,γ\)M=\(S,A,R,\\Psi,p,\\gamma\), a CARCASSC^\\hat\{C\}, and labelsλ\\lambda, lets∈Ss\\in S,P=Ps∪PAs∪PCovP=P\_\{s\}\\cup P\_\{A\_\{s\}\}\\cup P\_\{\\textit\{Cov\}\},I∈𝒜𝒮\(P\)I\\in\\mathcal\{AS\}\(P\), ands^i=l1,…,lk∈S^\{\\hat\{s\}\}\_\{i\}=l\_\{1\},\\ldots,l\_\{k\}\\in\\hat\{S\}with variablesV1,…,VmV\_\{1\},\\ldots,V\_\{m\}\. Furthermore, letθ\\thetabe a substitution such thats^iθ\{\\hat\{s\}\}\_\{i\}\\thetais a ground instance ofs^i\{\\hat\{s\}\}\_\{i\}\. Then,Ps⊢SLDNFs^iθP\_\{s\}\\vdash\_\{\\text\{SLDNF\}\}\{\\hat\{s\}\}\_\{i\}\\thetaiffsCov\(λ\(s^i\),\(V1θ,…,Vmθ\)\)∈I\\textbf\{sCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\),\(V\_\{1\}\\theta,\\ldots,V\_\{m\}\\theta\)\)\\in I\.
Based on this result, desired properties of the encoding can be concluded\.
###### Corollary 1\.
ForMM,C^\{\\hat\{C\}\},λ\\lambda,ss, ands^\{\\hat\{s\}\}as in Prop\.[1](https://arxiv.org/html/2605.31444#Thmproposition1), the following properties \(P1\)\-\(P3\) hold:
1. \(P1\)𝒜𝒮\(Ps∪PAs∪PCov\)\\mathcal\{AS\}\(P\_\{s\}\\cup P\_\{A\_\{s\}\}\\cup P\_\{\\textit\{Cov\}\}\)corresponds one\-to\-one with the abstract states that covers∈Ss\\in S\.
2. \(P2\)EachI∈𝒜𝒮\(Ps∪PAs∪PCov\)I\\in\\mathcal\{AS\}\(P\_\{s\}\\cup P\_\{A\_\{s\}\}\\cup P\_\{\\textit\{Cov\}\}\), corresponding with the chosen abstract states^i\{\\hat\{s\}\}\_\{i\}, contains a listing of the admissible abstract actionsa^i,j∈A^s^i\{\\hat\{a\}\}\_\{i,j\}\\in\\hat\{A\}\_\{\{\\hat\{s\}\}\_\{i\}\}and their covered admissible concrete actionsp\(t¯\)=a∈As\\textbf\{p\}\(\\bar\{t\}\)=a\\in A\_\{s\}in the form ofaCov\(λ\(s^i,a^i,j\),p\(t¯\)\)\\textbf\{aCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\),\\text\{p\}\(\\bar\{t\}\)\)atoms\.
3. \(P3\)𝒜𝒮\(Ps∪PAs∪PC^\)\\mathcal\{AS\}\(P\_\{s\}\\cup P\_\{A\_\{s\}\}\\cup P\_\{\\hat\{C\}\}\)contains at most one such answer set, corresponding to the covering abstract state with the lowest index\.
Property \(P1\) follows from the choice rule inPCovP\_\{\\textit\{Cov\}\}\. To verify \(P2\), consider the rule
aCov\(λ\(s^i,a^i,j\),p\(u¯\)\)←sChoice\(λ\(s^i\)\),sCov\(λ\(s^i\),\(V¯\)\),p\(u¯\)\.\\textbf\{aCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\),\\text\{p\}\(\\bar\{u\}\)\)\\leftarrow\\textbf\{sChoice\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\)\),\\textbf\{sCov\}\(\\lambda\(\{\\hat\{s\}\}\_\{i\}\),\(\\bar\{V\}\)\),\\textbf\{p\}\(\\bar\{u\}\)\.inPs^i,a^i,jP\_\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}, which has a ground instance for every occurrence ofsCovinI∈𝒜𝒮\(Ps∪PAs∪PCov\)I\\in\\mathcal\{AS\}\(P\_\{s\}\\cup P\_\{A\_\{s\}\}\\cup P\_\{\\textit\{Cov\}\}\)\. The body of this ground instance will be true if \(i\)s^i\{\\hat\{s\}\}\_\{i\}is “chosen” inII\(by the choice rule inPCovP\_\{\\textit\{Cov\}\}\), \(ii\) the substitutionθ\\thetarelated with the ground instance is such thatPs⊢SLDNFs^iθP\_\{s\}\\vdash\_\{\\text\{SLDNF\}\}\{\\hat\{s\}\}\_\{i\}\\theta, and \(iii\) there exists an admissible concrete actionp\(u¯\)θ=a∈As\\textbf\{p\}\(\\bar\{u\}\)\\theta=a\\in A\_\{s\}\. These criteria correspond with the ones in the definition ofCov\(s^i,a^i,j\)\\textit\{Cov\}\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)and the rule head rightly asserts that\(s^i,a^i,j\)\(\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\)covers\(s,a\)\(s,a\)\. Generalising this argument, we can see howPs^i,a^i,jP\_\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,j\}\}identifies all admissible actions for which the covering relation holds\. To verify \(P3\), observe thatPC^P\_\{\\hat\{C\}\}is justPCovP\_\{\\textit\{Cov\}\}plus a constraint, eliminating all answer sets except the one where the index of the chosen covering abstract state is minimal\.
### 3\.3Online Learning for ASP\-Based CARCASS Abstractions
Algorithm[2](https://arxiv.org/html/2605.31444#algorithm2)illustrates how the ASP\-encoding of a CARCASS can be used in an online learning setting such as Algorithm[1](https://arxiv.org/html/2605.31444#algorithm1)\. The ASP encoding used here includes \(i\) the programs discussed so\-far, \(ii\) an optional programPBP\_\{B\}with background knowledge, and \(iii\) the program
Ps^∞≐\{cState\("true",\#sup\)\.cStateCovers\("true",\(\)\)\.\}P\_\{\{\\hat\{s\}\}\_\{\\infty\}\}\\doteq\\left\\\{\\ \\textbf\{cState\}\(\\texttt\{"true"\},\\\#sup\)\.\\quad\\textbf\{cStateCovers\}\(\\texttt\{"true"\},\(\)\)\.\\ \\right\\\}\(where\#sup\\\#supstands for the*supremum*, or\+∞\+\\infty\) representing an empty abstract states^∞=⊤\{\\hat\{s\}\}\_\{\\infty\}=\\topthat trivially covers every concrete state and always has the highest index\. The admissible actions areA^s^∞≐\{a^∞,1\}\\hat\{A\}\_\{\{\\hat\{s\}\}\_\{\\infty\}\}\\doteq\\\{\{\\hat\{a\}\}\_\{\\infty,1\}\\\}, such that⟦s^∞,a^∞,1⟧C^s≐As\{\\llbracket\{\{\\hat\{s\}\}\_\{\\infty\},\{\\hat\{a\}\}\_\{\\infty,1\}\}\\rrbracket\_\{\\hat\{C\}\}^\{s\}\}\\doteq A\_\{s\}covers all admissible concrete actions\.
Input:Current statesswith admissible actionsAsA\_\{s\}, background knowledgePBP\_\{B\}, and a CARCASSC^=⟨\(s^1,A^s^1\),…,\(s^n,A^s^n\)⟩\\hat\{C\}=\\langle\(\{\\hat\{s\}\}\_\{1\},\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{1\}\}\),\\ldots,\(\{\\hat\{s\}\}\_\{n\},\{\\hat\{A\}\}\_\{\{\\hat\{s\}\}\_\{n\}\}\)\\ranglewith labelsλ\\lambdaand encodingPC^P\_\{\\hat\{C\}\}
Output:Chosen abstract state
s^\{\\hat\{s\}\}, abstract actions
A^s^′⊆A^s^\\hat\{A\}^\{\\prime\}\_\{\{\\hat\{s\}\}\}\\subseteq\\hat\{A\}\_\{\{\\hat\{s\}\}\}, and
⟦s^,a^⟧C^s\{\\llbracket\{\{\\hat\{s\}\},\{\\hat\{a\}\}\}\\rrbracket\_\{\\hat\{C\}\}^\{s\}\}for all
a^∈A^s^′\{\\hat\{a\}\}\\in\\hat\{A\}^\{\\prime\}\_\{\{\\hat\{s\}\}\}\.
1
I←An arbitrarily chosen optimal answer setI∈𝒜𝒮\(PC^∪Ps∪PAs∪Ps^∞∪PB\)I\\leftarrow\\text\{ An arbitrarily chosen optimal answer set \}I\\in\\mathcal\{AS\}\(P\_\{\\hat\{C\}\}\\cup P\_\{s\}\\cup P\_\{A\_\{s\}\}\\cup P\_\{\{\\hat\{s\}\}\_\{\\infty\}\}\\cup P\_\{B\}\)
2
s^←λ−1\(l\)for the onlysChoice\(l\)∈I\{\\hat\{s\}\}\\leftarrow\\lambda^\{\-1\}\(l\)\\text\{ for the only \}\\textbf\{sChoice\}\(l\)\\in I
3if*s^=s^∞\{\\hat\{s\}\}=\{\\hat\{s\}\}\_\{\\infty\}*then
4
A^s^′←\{a^∞,1\}\\hat\{A\}^\{\\prime\}\_\{\\hat\{s\}\}\\leftarrow\\\{\{\\hat\{a\}\}\_\{\\infty,1\}\\\}
5
⟦s^,a^∞,1⟧C^s←As\{\\llbracket\{\{\\hat\{s\}\},\{\\hat\{a\}\}\_\{\\infty,1\}\}\\rrbracket\_\{\\hat\{C\}\}^\{s\}\}\\leftarrow A\_\{s\}
6
7else
8
A^s^′←\{a^∣aCov\(l,p\(t¯\)\)∈I∧λ\(s^,a^\)=l\}\\hat\{A\}^\{\\prime\}\_\{\\hat\{s\}\}\\leftarrow\\left\\\{\{\\hat\{a\}\}\\mid\\textbf\{aCov\}\(l,p\(\\bar\{t\}\)\)\\in I\\land\\lambda\(\{\\hat\{s\}\},\{\\hat\{a\}\}\)=l\\right\\\}
9for*a^∈A^s^′\{\\hat\{a\}\}\\in\\hat\{A\}^\{\\prime\}\_\{\\hat\{s\}\}*do
10
⟦s^,a^⟧C^s←\{p\(t¯\)∣aCov\(l,p\(t¯\)\)∈I∧λ\(s^,a^\)=l\}\{\\llbracket\{\{\\hat\{s\}\},\{\\hat\{a\}\}\}\\rrbracket\_\{\\hat\{C\}\}^\{s\}\}\\leftarrow\\left\\\{\\textbf\{p\}\(\\bar\{t\}\)\\mid\\textbf\{aCov\}\(l,\\textit\{p\}\(\\bar\{t\}\)\)\\in I\\land\\lambda\(\{\\hat\{s\}\},\{\\hat\{a\}\}\)=l\\right\\\}
11
12end for
13
14end if
return*s^\{\\hat\{s\}\},A^s^′\\hat\{A\}^\{\\prime\}\_\{\{\\hat\{s\}\}\},\{⟦s^,a^⟧C^s\}a^∈A^s^′\\left\\\{\{\\llbracket\{\{\\hat\{s\}\},\{\\hat\{a\}\}\}\\rrbracket\_\{\\hat\{C\}\}^\{s\}\}\\right\\\}\_\{\{\\hat\{a\}\}\\in\\hat\{A\}^\{\\prime\}\_\{\{\\hat\{s\}\}\}\}*
Algorithm 2Online interaction for ASP\-encoded CARCASSsAs for \(ii\), the CARCASS encoding can be combined with background knowledge wherePBP\_\{B\}may include weak constraints\. However, ifPBP\_\{B\}leads to multiple \(optimal\) answer sets, one needs to be careful that they agree on the chosen abstract state\. This can be ensured by adding the weak constraint:∼sChoice\(R\),sIdx\(R,I\)\.\[I@1,R\]:\\sim\\textbf\{sChoice\}\(R\),\\textbf\{sIdx\}\(R,I\)\.\\;\[I@1,R\]for tie\-breaking\.
In line 1 of Alg\.[2](https://arxiv.org/html/2605.31444#algorithm2), the augmented ASP encoding is handed to an ASP solver, which is expected to return an optimal answer set\. IfPB=∅P\_\{B\}=\\emptyset, a single answer set corresponding to the covering abstract state with the smallest index exists, as previously described\. In case of multiple optimal answer sets, the solver chooses one to return\. Lines 3–5 are then concerned with the special case in whichs^∞\{\\hat\{s\}\}\_\{\\infty\}was chosen, and lines 7–10 deal with the extraction of information from the returned answer set if some others^∈S^\{\\hat\{s\}\}\\in\\hat\{S\}was chosen\. The algorithm returns the chosen abstract states^\{\\hat\{s\}\}, a set of abstract actionsA^s^′⊆A^s^\\hat\{A\}^\{\\prime\}\_\{\{\\hat\{s\}\}\}\\subseteq\\hat\{A\}\_\{\{\\hat\{s\}\}\}that cover at least one admissible concrete action—to be used instead ofA^s^\\hat\{A\}\_\{\{\\hat\{s\}\}\}in the definition of abstract policies—and the sets⟦s^,a^⟧C^s\{\\llbracket\{\{\\hat\{s\}\},\{\\hat\{a\}\}\}\\rrbracket\_\{\\hat\{C\}\}^\{s\}\}for alla^∈A^s^′\{\\hat\{a\}\}\\in\\hat\{A\}^\{\\prime\}\_\{\{\\hat\{s\}\}\}\.
## 4Case Study: Blocks World
As a first application, we discussnn\-blocks world stacking tasks and present the encoding of a state\-action pair abstraction, inspired by the algorithms GN1 and GN2\(Slaney and Thiébaux[2001](https://arxiv.org/html/2605.31444#bib.bib80)\), where the size of the abstract state\-action space is constant innn\. As a trade\-off, the abstraction does not preserve the optimal policy for the \(NP\-hard\) blocks world planning problem\. We only glance over some of the details here; the full CARCASS encoding is available in the supplementary material\.
The abstraction was designed with a generalised version of the RMDP from Example[1](https://arxiv.org/html/2605.31444#Thmexample1)in mind\. We assume annn\-blocks world RMDP with an arbitrary number ofn≥1n\\geq 1blocks and an arbitrary stacking task, encoded in the reward structure and communicated to the agent via the set ofgoalatoms in the state descriptions\.
Several predicates were defined to support the abstraction, including the following:clear\(L\)\\textbf\{clear\}\(L\)denotes that locationLLis clear, i\.e\., no blocks are located on top, orLLis the table;above/2\\textbf\{above\}/2denotes the transitive closure ofon/2\\textbf\{on\}/2; andincomplete\(L,B\)\\textbf\{incomplete\}\(L,B\)denotes the existence of an*incomplete tower*in its final position, with locationLL, topmost blockBB, and other misplaced blocks allowed to occur on top ofBB; incompleteness means that some further block\(s\) must be added on top ofBBto fulfil the stacking task\. Note that the definition ofincompletedepends on bothonandgoal\.
###### Example 4\(Example[1](https://arxiv.org/html/2605.31444#Thmexample1)cont’d\)\.
We augments1s\_\{1\}with the factsclear\(table\)\\textbf\{clear\}\(\\textit\{table\}\),clear\(1\)\\textbf\{clear\}\(1\),clear\(2\)\\textbf\{clear\}\(2\),above\(0,table\)\\textbf\{above\}\(0,\\textit\{table\}\),above\(1,0\)\\textbf\{above\}\(1,0\),above\(1,table\)\\textbf\{above\}\(1,\\textit\{table\}\),above\(2,table\)\\textbf\{above\}\(2,\\textit\{table\}\), andincomplete\(table,1\)\\textbf\{incomplete\}\(\\textit\{table\},1\)\. To further illustrate incomplete towers, consider a different55\-blocks world RMDP with a \(partial\) stacking task where blocks0–33need to form a tower in ascending order and the position of block44is irrelevant\. The corresponding state
s3=\{on\(4,table\),on\(0,4\),on\(1,0\),on\(2,1\),on\(3,table\),goal\(1,0\),goal\(2,1\),goal\(3,2\)\}s\_\{3\}=\\left\\\{\\textbf\{on\}\(4,\\textit\{table\}\),\\textbf\{on\}\(0,4\),\\textbf\{on\}\(1,0\),\\textbf\{on\}\(2,1\),\\textbf\{on\}\(3,\\textit\{table\}\),\\textbf\{goal\}\(1,0\),\\textbf\{goal\}\(2,1\),\\textbf\{goal\}\(3,2\)\\\\ \\right\\\}is augmented withincomplete\(0,2\)\\textbf\{incomplete\}\(0,2\), reflecting the fact that blocks0,11,22, and44are in their final position, but33must still be added on top of22\. For states3∪\{goal\(4,3\)\}s\_\{3\}\\cup\\\{\\textbf\{goal\}\(4,3\)\\\}in yet another55\-blocks world RMDP with the task to stack all five blocks in ascending order, no goal tower exists since no blocks are in their final position\.
There are six abstract states, covering the following cases:
- •s^0\{\\hat\{s\}\}\_\{0\}: an incomplete tower exists that is clear and the next missing block is clear;
- •s^1\{\\hat\{s\}\}\_\{1\}: an incomplete tower exists but has other misplaced blocks on top;
- •s^2\{\\hat\{s\}\}\_\{2\}: an incomplete tower that is clear exists, but the next missing block is not clear;
- •s^3\{\\hat\{s\}\}\_\{3\}: no incomplete tower exists but the first block to construct one is not clear;
- •s^4\{\\hat\{s\}\}\_\{4\}: no incomplete tower exists and the first block to construct one is clear; and
- •s^∞\{\\hat\{s\}\}\_\{\\infty\}: all other states, i\.e\. goal states\.
To illustrate our encoding, we discusss^0\{\\hat\{s\}\}\_\{0\}ands^1\{\\hat\{s\}\}\_\{1\}in detail, including their admissible abstract actions\. The abstract states^0\{\\hat\{s\}\}\_\{0\}is encoded as
Ps^0=\{sIdx\(r0,0\)\.sCov\(r0,\(T,N\)\)←incomplete\(\_,T\),clear\(T\),goal\(N,T\),clear\(N\)\.\}\.P\_\{\{\\hat\{s\}\}\_\{0\}\}=\\left\\\{\\begin\{array\}\[\]\{lll\}\\textbf\{sIdx\}\(\\textit\{r0\},0\)\.&&\\\\ \\textbf\{sCov\}\(\\textit\{r0\},\(T,N\)\)&\\leftarrow&\\textbf\{incomplete\}\(\\\_,T\),\\textbf\{clear\}\(T\),\\textbf\{goal\}\(N,T\),\\textbf\{clear\}\(N\)\.\\end\{array\}\\right\\\}\.The next move towards fulfilling the stacking task is to place the next missing blockNNon top of the incomplete towerTT, which is captured by the abstract actiona^0,0\{\\hat\{a\}\}\_\{0,0\}, encoded as
Ps^0,a^0,0=\{aCov\(move\(n,t\),move\(N,T\)\)←sChoice\(r0\),sCov\(r0,\(T,N\)\)\.\}\.P\_\{\{\\hat\{s\}\}\_\{0\},\{\\hat\{a\}\}\_\{0,0\}\}=\\left\\\{\\begin\{array\}\[\]\{lll\}\\textbf\{aCov\}\(\\textit\{move\}\(n,t\),\\textit\{move\}\(N,T\)\)&\\leftarrow&\\textbf\{sChoice\}\(\\textit\{r0\}\),\\textbf\{sCov\}\(\\textit\{r0\},\(T,N\)\)\.\\end\{array\}\\right\\\}\.We also definea^i,∞\{\\hat\{a\}\}\_\{i,\\infty\}as a catch\-all abstract action, covering actions not covered bya^0,0\{\\hat\{a\}\}\_\{0,0\}, encoded as
Ps^i,a^i,∞=\{aCov\(others,move\(X,Y\)\)←clear\(X\),clear\(Y\),X≠Y,X≠table,notaCov\(move\(\_,\_\),move\(X,Y\)\),noton\(X,Y\)\.\},P\_\{\{\\hat\{s\}\}\_\{i\},\{\\hat\{a\}\}\_\{i,\\infty\}\}=\\left\\\{\\begin\{array\}\[\]\{l\}\\textbf\{aCov\}\(\\textit\{others\},\\textit\{move\}\(X,Y\)\)\\leftarrow\\textbf\{clear\}\(X\),\\textbf\{clear\}\(Y\),X\\neq Y,X\\neq\\textit\{table\},\\\\ \\qquad\\qquad\\qquad\\qquad\\textit\{not\}\\ \\textbf\{aCov\}\(move\(\\\_,\\\_\),move\(X,Y\)\),\\textit\{not\}\\ \\textbf\{on\}\(X,Y\)\.\\end\{array\}\\right\\\},such that it can be reused also for the other abstract states\. Next,s^1\{\\hat\{s\}\}\_\{1\}is encoded as
Ps^1=\{sIdx\(r1,1\)\.sCov\(r1,\(L,T,B\)\)←incomplete\(L,T\),notclear\(T\),above\(B,T\),clear\(B\)\.\}P\_\{\{\\hat\{s\}\}\_\{1\}\}=\\left\\\{\\begin\{array\}\[\]\{lll\}\\textbf\{sIdx\}\(\\textit\{r1\},1\)\.&&\\\\ \\textbf\{sCov\}\(\\textit\{r1\},\(L,T,B\)\)&\\leftarrow&\\textbf\{incomplete\}\(L,T\),\\textit\{not\}\\ \\textbf\{clear\}\(T\),\\textbf\{above\}\(B,T\),\\textbf\{clear\}\(B\)\.\\end\{array\}\\right\\\}with the admissible abstract actionsa^1,0\{\\hat\{a\}\}\_\{1,0\}anda^1,1\{\\hat\{a\}\}\_\{1,1\}:
Ps^1,a^1,0\\displaystyle P\_\{\{\\hat\{s\}\}\_\{1\},\{\\hat\{a\}\}\_\{1,0\}\}=\{aCov\(move\(b,table\),move\(B,table\)\)←sChoice\(r1\),sCov\(r1,\(\_,\_,B\)\)\.\}\\displaystyle=\\left\\\{\\begin\{array\}\[\]\{lll\}\\textbf\{aCov\}\(\\textit\{move\}\(b,\\textit\{table\}\),\\textit\{move\}\(B,\\textit\{table\}\)\)&\\leftarrow&\\textbf\{sChoice\}\(\\textit\{r1\}\),\\textbf\{sCov\}\(\\textit\{r1\},\(\\\_,\\\_,B\)\)\.\\end\{array\}\\right\\\}Ps^1,a^1,1\\displaystyle P\_\{\{\\hat\{s\}\}\_\{1\},\{\\hat\{a\}\}\_\{1,1\}\}=\{aCov\(move\(b,o\),move\(B,O\)\)←sChoice\(r1\),sCov\(r1,\(L,T,B\)\),clear\(O\),O≠B,O≠table\.\}\\displaystyle=\\left\\\{\\begin\{array\}\[\]\{lll\}\\textbf\{aCov\}\(\\textit\{move\}\(b,o\),\\textit\{move\}\(B,O\)\)&\\leftarrow&\\textbf\{sChoice\}\(\\textit\{r1\}\),\\textbf\{sCov\}\(\\textit\{r1\},\(L,T,B\)\),\\\\ &&\\textbf\{clear\}\(O\),O\\not=B,O\\not=\\textit\{table\}\.\\end\{array\}\\right\\\}To advances^1\{\\hat\{s\}\}\_\{1\}towards fulfilling the stacking task, all blocks on top ofTTmust be removed\. The topmost blockBBcan be placed either on the table \(a^1,0\{\\hat\{a\}\}\_\{1,0\}\) or on top of some other free block \(a^1,1\{\\hat\{a\}\}\_\{1,1\}\)\. All other actions are again covered bya^i,∞\{\\hat\{a\}\}\_\{i,\\infty\}\.
The intuition behind this abstraction relates back to the algorithms GN1 and GN2\(Slaney and Thiébaux[2001](https://arxiv.org/html/2605.31444#bib.bib80)\)\. The abstract states^0\{\\hat\{s\}\}\_\{0\}covers case \(2\) of the GN1 algorithm and moves covered bya^0,0\{\\hat\{a\}\}\_\{0,0\}are constructive moves\. The abstract statess^1\{\\hat\{s\}\}\_\{1\}ands^2\{\\hat\{s\}\}\_\{2\}cover states that are deadlocked and relate to the two cases in the definition of the functionδπ\\delta\_\{\\pi\}\.
## 5Case Study: MiniGrid
Next, we consider environments from the MiniGrid simulation library\(Chevalier\-Boisvertet al\.[2023](https://arxiv.org/html/2605.31444#bib.bib76)\)\. Solving such an environment typically requires the agent to navigate through a fixed layout of connected, rectangular rooms to reach the location of a goal tile as fast as possible, solving simple puzzles and avoiding hazards along the way\.
###### Example 5\(Door Key Environment\)\.
In states1s\_\{1\}\(Fig[2](https://arxiv.org/html/2605.31444#F2)\), the agent and a key are located in one room and the goal tile in the other, separated by a locked door\. The room layout changes between episodes, which makes the learning task more difficult\.

s1=\{obj\(goal,\(6,6\)\)obj\(door\(yellow,locked\),\(5,3\)\)obj\(key\(yellow\),\(3,5\)\)obj\(agent\(east\),\(3,1\)\)obj\(wall\(grey\),\(0,0\)\)obj\(wall\(grey\),\(0,1\)\)…\}s\_\{1\}=\\left\\\{\\begin\{array\}\[\]\{l\}\\textbf\{obj\}\(\\textit\{goal\},\(6,6\)\)\\\\ \\textbf\{obj\}\(\\textit\{door\}\(\\textit\{yellow\},\\textit\{locked\}\),\(5,3\)\)\\\\ \\textbf\{obj\}\(\\textit\{key\}\(\\textit\{yellow\}\),\(3,5\)\)\\\\ \\textbf\{obj\}\(\\textit\{agent\}\(\\textit\{east\}\),\(3,1\)\)\\\\ \\textbf\{obj\}\(\\textit\{wall\}\(\\textit\{grey\}\),\(0,0\)\)\\\\ \\textbf\{obj\}\(\\textit\{wall\}\(\\textit\{grey\}\),\(0,1\)\)\\\\ \\ldots\\end\{array\}\\right\\\}

Figure 2:Example Minigrid state: The left image shows a state, its relational representation is in the center, and a graph of the room layout, inferred using ASP, is to the right\.We present the encoding of a state abstraction that is applicable to a variety of such environments\. Background knowledge is used to infer a room layout from the raw state description, to build a graph based on the location of objects, and to compute the shortest path on that graph from the agent to the goal tile, representing an ordered set of subtasks that the agent must accomplish\. Paths can be further constrained, for example to ensure that the key is visited before the door\. The abstract state is constructed such that it contains only information relevant to the first node on the path\. The full encoding of the abstraction is available in the supplementary material\.
We assume a MiniGrid configuration in which the current state is fully observable, resembling a birds\-eye view as in Fig[2](https://arxiv.org/html/2605.31444#F2), and replace the default state description \(a matrix𝐀∈ℕ0w⋅h⋅3\\mathbf\{A\}\\in\\mathbb\{N\}\_\{0\}^\{w\\cdot h\\cdot 3\}, wherew×hw\\times his the size of the grid world\) with a relational state description, using the following atoms:obj\(O,\(X,Y\)\)\\textbf\{obj\}\(O,\(X,Y\)\)asserts that objectOOis located at the grid coordinates\(X,Y\)\(X,Y\);carries\(O\)\\textbf\{carries\}\(O\)asserts that the agent carries objectOO; andterminaldenotes the end of an episode\. The actions areright,left,forward,pickup,drop,toggle, anddone\.
The key components of our abstraction, inferred using background knowledge, are represented with the following atoms:oriented\(O\)\\textbf\{oriented\}\(O\)asserts the orientation of the agent;fronting\(\(X,Y\)\)\\textbf\{fronting\}\(\(X,Y\)\)that\(X,Y\)\(X,Y\)is directly in front of the agent;dangerous\(\(X,Y\)\)\\textbf\{dangerous\}\(\(X,Y\)\)that a dangerous object is at\(X,Y\)\(X,Y\);xdir\(X\)\\textbf\{xdir\}\(X\)andydir\(Y\)\\textbf\{ydir\}\(Y\)assert the general horizontal and vertical direction, respectively, of the next object on the path in relation to the agent;touching\(T\)\\textbf\{touching\}\(T\)asserts that the next objectTTon the path is directly in front of the agent; andingap\(G\)\\textbf\{ingap\}\(G\)that the agent is in a gap \(e\.g\. between rooms\)\.
###### Example 6\.
The computed path ins1s\_\{1\}\(Fig\.[2](https://arxiv.org/html/2605.31444#F2)\) is\(3,1\)−\(3,5\)−\(5,3\)−\(6,6\)\(3,1\)\-\(3,5\)\-\(5,3\)\-\(6,6\)\. Based on the next node on that path,\(3,5\)\(3,5\), the state is augmented with the factsoriented\(east\)\\textbf\{oriented\}\(\\textit\{east\}\),fronting\(\(4,1\)\)\\textbf\{fronting\}\(\(4,1\)\),xdir\(onAxis\)\\textbf\{xdir\}\(\\textit\{onAxis\}\),ydir\(south\)\\textbf\{ydir\}\(\\textit\{south\}\),touching\(none\)\\textbf\{touching\}\(\\textit\{none\}\), andingap\(none\)\\textbf\{ingap\}\(\\textit\{none\}\)\.
We use 757 abstract states \(excludings^∞\{\\hat\{s\}\}\_\{\\infty\}\):s^1\{\\hat\{s\}\}\_\{1\}covers potentially dangerous situations, encoded as
Ps^1=\{sIdx\(danger,1\)\.sCov\(danger,\(\)\)←fronting\(T\),dangerous\(T\)\.\},P\_\{\{\\hat\{s\}\}\_\{1\}\}=\\left\\\{\\begin\{array\}\[\]\{lll\}\\textbf\{sIdx\}\(\\textit\{danger\},1\)\.&&\\\\ \\textbf\{sCov\}\(\\textit\{danger\},\(\)\)&\\leftarrow&\\textbf\{fronting\}\(T\),\\textbf\{dangerous\}\(T\)\.\\end\{array\}\\right\\\},while the statess^2\{\\hat\{s\}\}\_\{2\}–s^757\{\\hat\{s\}\}\_\{757\}, representing all possible combinations of ground instances fororiented,xdir,ydir,touching, andingap, are compactly encoded as
Ps^2−757=\{label\(\(oriented\(O\),xdir\(X\),ydir\(Y\),touching\(T\),ingap\(G\)\)\)←oriented\(O\),xdir\(X\),ydir\(Y\),touching\(T\),ingap\(G\)\.sIdx\(S,2\)←label\(S\)\.sCov\(S,\(\)\)←label\(S\)\.\}\.P\_\{\{\\hat\{s\}\}\_\{2\-757\}\}=\\left\\\{\\begin\{array\}\[\]\{l\}\\textbf\{label\}\(\(\\textit\{oriented\}\(O\),\\textit\{xdir\}\(X\),\\textit\{ydir\}\(Y\),\\textit\{touching\}\(T\),\\textit\{ingap\}\(G\)\)\)\\\\ \\quad\\leftarrow\\textbf\{oriented\}\(O\),\\textbf\{xdir\}\(X\),\\textbf\{ydir\}\(Y\),\\textbf\{touching\}\(T\),\\textbf\{ingap\}\(G\)\.\\\\ \\textbf\{sIdx\}\(S,2\)\\leftarrow\\textbf\{label\}\(S\)\.\\\\ \\textbf\{sCov\}\(S,\(\)\)\\leftarrow\\textbf\{label\}\(S\)\.\\\\ \\end\{array\}\\right\\\}\.The background knowledge is designed such that every answer set can have at most one instance oflabel, which means that the order restriction can be relaxed in favour of a more compact encoding, such that all statess^2\{\\hat\{s\}\}\_\{2\}–s^757\{\\hat\{s\}\}\_\{757\}have the same index in the decision list\.
While actions can simply be stated as facts, e\.g\.,aCov\(left,left\)\.\\textbf\{aCov\}\(\\textit\{left\},\\textit\{left\}\)\., we can prune actions that are admissible in the concrete representation but have no effect \(i\.e\., do not lead to state changes\)\. This can be done by omitting such actions \(i\.e\., making them inadmissible\) in the abstract representation\. For example,aCov\(pickup,pickup\)←fronting\(T\),obj\(key\(\_\),T\)\\textbf\{aCov\}\(\\textit\{pickup\},\\textit\{pickup\}\)\\leftarrow\\textbf\{fronting\}\(T\),\\textbf\{obj\}\(\\textit\{key\}\(\\\_\),T\)ensures that thepickupaction is admissible in the abstract representation only if there is a key to be picked up\.
## 6Empirical Evaluation
We next investigate whether abstractions make large\-scale RMDPs tractable by reducing space and sample needs of the learning process\. Notably, there is no guarantee of convergence of abstractQQ\-learning in general nor that a learned abstract policy is optimal\. Our experiments are thus designed to shed light on\(E1\)the*stability of the learning process*,\(E2\)the*quality of the learned policy*, and\(E3\)the*differences in sample efficiency*between concrete and abstractQQ\-learning\.
Both concrete and abstractQQ\-learning were tested in a2020\-blocks world with the goal to stack all blocks in order, and in three MiniGrid environments\. Anϵ\\epsilon\-greedy behaviour policy was used for exploration; for full details, including platform, software, and parameter settings see the supplementary material\.
As for\(E1\), we observe the return of episodes across the learning process\. Concerning\(E2\), we measure for every realised episodehih\_\{i\}the solution quality by the*return*𝖦0\(hi\)\\mathsf\{G\}\_\{0\}\(h\_\{i\}\)at time 0\. We call an episode*successful*if it achieved a return above a baseline, viz\.𝖦0\(hi\)\>0\\mathsf\{G\}\_\{0\}\(h\_\{i\}\)\>0for MiniGrid and𝖦0\(hi\)≥62\\mathsf\{G\}\_\{0\}\(h\_\{i\}\)\\geq 62for Blocks World \(i\.e\., the worst\-case return of a naive unstack\-stack strategy\)\. For measuring\(E3\)sample efficiency, we use the*mean return*1n⋅∑i=1n𝖦0\(hi\)\\frac\{1\}\{n\}\\cdot\\sum\_\{i=1\}^\{n\}\\mathsf\{G\}\_\{0\}\(h\_\{i\}\)and count the number of successful episodes overnnrealised episodes\.
We repeated experiments 20 times\. The collected data is summarised using the first quartileQ1Q\_\{1\}, the medianQ2Q\_\{2\}, the third quartileQ3Q\_\{3\}, and the interquartile rangeIQR\. Fig\.[3](https://arxiv.org/html/2605.31444#F3)shows the learning curves of abstractQQ\-learning for the tested environments\. For the2020\-blocks world, the first positive median return is observed at episode 376 and after episode 2000 for merely 10 instancesQ1Q\_\{1\}dropped below the baseline of 62, and for none below 60\. For Door Key, Multi Room, and Four Rooms, the first positive median returns are observed after episodes 68, 133, and 287, respectively, all recorded median returns are positive after episodes 195, 1012, and 3137, respectively, and all recordedQ1Q\_\{1\}returns are positive after episodes 2048, 1962, and 4649, respectively\.
Figure 3:The learning curves of abstractQQ\-learning in four task environments\.Table[2](https://arxiv.org/html/2605.31444#T2)shows \(from left to right\) the mean return over all realised episodes and the percentage of successful episodes over all realised, resp\., the last 500 episodes, for abstract and concreteQQ\-learning in the four tested environments\. Clear differences can be observed in all environments when comparing abstract and concreteQQ\-learning\.
Table 2:Comparison of abstract and concreteQQ\-learning \(QL\) in four task environments\.\(E1\)Regarding*stability*, after initial learning, the returns for abstractQQ\-learning stay consistently high in all environments, with no visible fluctuations in the median or the interquartile range\. Therefore, if stability issues are present, the majority of episodes realised by abstractQQ\-learning are unaffected\. Still, stability issues may be present in less than 25% of experiment repetitions\.
\(E2\)Regarding*quality*, the high success rates of episodes over the last 500 episodes suggest that episodes of acceptable quality can be learned with high consistency\. We cannot make statements about the optimality of the learned policies\. However, for the2020\-blocks world, there are still episodes where the policy does not match the worst\-case return of a naive unstack\-stack policy\. This may be caused by the exploration policy, by other factors \(e\.g\., choice ofα\\alpha, more training needed\), or by the abstraction itself\.
\(E3\)For*sample efficiency*, abstractQQ\-learning improves on sample efficiency over concreteQQ\-learning bases on the differences in the mean returns and the percentages of successful episodes over the entire learning processes\.
In conclusion, abstractQQ\-learning reliably produced policies of acceptable quality and, moreover,*in a smaller number of episodes*as compared to concreteQQ\-learning\.
### 6\.1Comparison with Prolog\-based Abstractions
We also developed Prolog encodings \(to be found in the supplementary material\) to contrast the modelling styles encouraged by the two formalisms for the considered relational decision\-making problems\.
The Prolog encoding of the Blocks World CARCASS was relatively simple to obtain, as it is almost identical to the ASP encoding\. The differences in the modelling approaches become more apparent in the Minigrid abstraction, which can be viewed as a shortest\-path problem under relational constraints\. In the ASP encoding, admissible paths and optimality criteria are specified declaratively using choice rules, constraints, and optimisation statements, resulting in a compact encoding that closely follows the problem description\. In the Prolog encoding, solving the same abstraction requires an explicit implementation of the underlying search process \(in our case, a breadth\-first search\), leading to a more algorithmic representation in which aspects of control and problem specification are more explicitly represented\. Further details are provided in the supplementary material\.
## 7Related Work
Research on relational reinforcement learning has long explored ways of representing state\-action abstractions in first\-order logic\. Prominent examples include*logical Markov decision programs*\(LOMDPs\)\(Kersting and De Raedt[2004](https://arxiv.org/html/2605.31444#bib.bib89)\), which describe abstract states as logical rules and transitions as probabilistic effects of actions, and*relationalQQ\-learning*\(Morales[2003](https://arxiv.org/html/2605.31444#bib.bib104)\), which groups states and actions into logical templates that define when an abstract action can be applied\. Other extensions include*probabilistic relational models*\(PRMs\)\(Guestrin[2003](https://arxiv.org/html/2605.31444#bib.bib126)\), which decompose value functions across object classes, work on*stochastic policies for relational POMDPs*\(Itoh[2004](https://arxiv.org/html/2605.31444#bib.bib127)\), where policies are expressed as decision lists with probabilities, and a learning classifier system in first\-order logic \(FOXCS\)\(Mellor[2007](https://arxiv.org/html/2605.31444#bib.bib131)\)\. All of these approaches share with*CARCASS*\(van Otterlo[2009](https://arxiv.org/html/2605.31444#bib.bib113)\)the goal of reducing the state explosion problem through logical abstraction, but they differ in their underlying mechanisms: LOMDPs match concrete states against general logical rules,𝑅𝑄\\mathit\{RQ\}\-learning defines partitions of the state\-action space through conjunctions of relations, and PRMs assume additive decompositions over objects\. CARCASS instead represents abstractions as logical rules with background knowledge, implemented in Prolog using SLDNF resolution\. Like CARCASS, FOXCS represents policies via sets of rules, implemented in Prolog, that can generalise over states and actions\. Unlike CARCASS, the head of each rule consists of exactly one abstract action \(not multiple\) and the set of rules is not interpreted as a decision list, but individual rules advocate for their covered actions in a weighted voting process\. Our work continues the CARCASS line of research but replaces Prolog with a fully declarative ASP encoding, which allows richer forms of reasoning, including nonmonotonic inference and optimisation, within the abstraction process\.
Several works also combine*ASP and reinforcement learning*but quite differently\. For instance, relational meta\-policies have been encoded in ASP for hierarchical option selection\(Mitcheneret al\.[2022](https://arxiv.org/html/2605.31444#bib.bib56)\), and other works use ASP to restrict exploration by pruning infeasible actions, e\.g\.\(Ferreiraet al\.[2017](https://arxiv.org/html/2605.31444#bib.bib46)\)\. ASP was also used to encode entire relational MDPs, including states and value functions\(Saad[2011](https://arxiv.org/html/2605.31444#bib.bib53)\)\. However, none of these works proposed a CARCASS\-style general ASP\-based representation of*full state\-action abstractions*, which is our core contribution\.
## 8Conclusion
We investigated ASP for modelling state\-action pair abstractions in relational MDPs, building on the CARCASS framework\. We presented a general encoding method and evaluated it on Blocks World and MiniGrid tasks\. In both domains, abstract representations enabledQQ\-learning to learn high\-quality policies consistently with fewer samples than concrete representations, showing its potential as a knowledge\-driven framework for relational RL abstractions that supports elaboration\-tolerant policy learning\. ASP\-based abstractions support generalisation, action restrictions, and problem decomposition by allowing declarative encoding of domain knowledge\. They also support nonmonotonic reasoning, optimisation, and reasoning with incomplete knowledge, providing a flexible alternative to Prolog\-based CARCASS abstractions\.
Future work includes automatic generation and refinement of ASP abstractions\(Saribaturet al\.[2021](https://arxiv.org/html/2605.31444#bib.bib5)\), leveraging nondeterminism and optimisation to guide exploration, and testing transfer of abstract policies to larger or previously unseen tasks\. Additional directions include exploring elaboration\-tolerant design patterns, integrating ASP abstractions with function approximation methods, and evaluating the benefits of symbolic priors for RL in richer relational settings\.
## Acknowledgements
This work has been supported by funding from the Knowledge Foundation \(KK\-stiftelsen\) within the synergy project*Efficient and Trustworthy Industrial AI*\(ETIAI\)\.
## Competing interests
The authors declare none\.
## References
- G\. Brewka, T\. Eiter, and M\. Truszczyński \(2011\)Answer set programming at a glance\.Commun\. ACM54\(12\),pp\. 92–103\.External Links:[Document](https://dx.doi.org/10.1145/2043174.2043195)Cited by:[§1](https://arxiv.org/html/2605.31444#S1.p5.1)\.
- F\. Calimeri, W\. Faber, M\. Gebser, G\. Ianni, R\. Kaminski, T\. Krennwallner, N\. Leone, M\. Maratea, F\. Ricca, and T\. Schaub \(2020\)ASP\-Core\-2 input language format\.Theory Pract\. Log\. Program\.20\(2\),pp\. 294–309\.External Links:[Document](https://dx.doi.org/10.1017/S1471068419000450)Cited by:[§2\.1](https://arxiv.org/html/2605.31444#S2.SS1.p4.2)\.
- M\. Chevalier\-Boisvert, B\. Dai, M\. Towers, R\. de Lazcano, L\. Willems, S\. Lahlou, S\. Pal, P\. S\. Castro, and J\. Terry \(2023\)Minigrid & miniworld: modular & customizable reinforcement learning environments for goal\-oriented tasks\.CoRRabs/2306\.13831\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2306.13831),2306\.13831Cited by:[§5](https://arxiv.org/html/2605.31444#S5.p1.1)\.
- L\. A\. Ferreira, R\. A\. C\. Bianchi, P\. E\. Santos, and R\. L\. de Mántaras \(2017\)Answer set programming for non\-stationary markov decision processes\.Appl\. Intell\.47\(4\),pp\. 993–1007\.External Links:[Document](https://dx.doi.org/10.1007/S10489-017-0988-Y)Cited by:[§7](https://arxiv.org/html/2605.31444#S7.p2.1)\.
- C\. E\. Guestrin \(2003\)Planning under uncertainty in complex structured environments\.Stanford University\.Cited by:[§7](https://arxiv.org/html/2605.31444#S7.p1.2)\.
- H\. Itoh \(2004\)Towards learning to learn and plan by relational reinforcement learning\.InProceedings of the Workshop on Relational Reinforcement Learning at the Twenty\-First International Conference on Machine Learning \(RRL at ICML\-2004\),pp\. 34–39\.Cited by:[§7](https://arxiv.org/html/2605.31444#S7.p1.2)\.
- K\. Kersting and L\. De Raedt \(2004\)Logical markov decision programs and the convergence of logical td\(lambda\)\.InInductive Logic Programming, 14th International Conference, ILP 2004, Porto, Portugal, September 6\-8, 2004, Proceedings,R\. Camacho, R\. D\. King, and A\. Srinivasan \(Eds\.\),Lecture Notes in Computer Science, Vol\.3194,pp\. 180–197\.External Links:[Document](https://dx.doi.org/10.1007/978-3-540-30109-7%5F16)Cited by:[§7](https://arxiv.org/html/2605.31444#S7.p1.2)\.
- D\. Mellor \(2007\)A learning classifier system approach to relational reinforcement learning\.InLearning Classifier Systems, 10th International Workshop, IWLCS 2006, Seattle, MA, USA, July 8, 2006 and 11th International Workshop, IWLCS 2007, London, UK, July 8, 2007, Revised Selected Papers,J\. Bacardit, E\. Bernadó\-Mansilla, M\. V\. Butz, T\. Kovacs, X\. Llorà, and K\. Takadama \(Eds\.\),Lecture Notes in Computer Science,pp\. 169–188\.Cited by:[§7](https://arxiv.org/html/2605.31444#S7.p1.2)\.
- L\. Mitchener, D\. Tuckey, M\. Crosby, and A\. Russo \(2022\)Detect, understand, act: A neuro\-symbolic hierarchical reinforcement learning framework\.Mach\. Learn\.111\(4\),pp\. 1523–1549\.External Links:[Document](https://dx.doi.org/10.1007/S10994-022-06142-7)Cited by:[§7](https://arxiv.org/html/2605.31444#S7.p2.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. A\. Riedmiller, A\. Fidjeland, G\. Ostrovski, S\. Petersen, C\. Beattie, A\. Sadik, I\. Antonoglou, H\. King, D\. Kumaran, D\. Wierstra, S\. Legg, and D\. Hassabis \(2015\)Human\-level control through deep reinforcement learning\.Nat\.518\(7540\),pp\. 529–533\.Cited by:[§1](https://arxiv.org/html/2605.31444#S1.p2.1)\.
- E\. F\. Morales \(2003\)Scaling up reinforcement learning with a relational representation\.InProceedings of the Workshop on Adaptability in Multi\-Agent Systems at AORC 2003, Sydney, Australia,Cited by:[§7](https://arxiv.org/html/2605.31444#S7.p1.2)\.
- S\. Nienhuys\-Cheng and R\. de Wolf \(1997\)Foundations of inductive logic programming\.Lecture Notes in Computer Science, Vol\.1228,Springer\.External Links:[Document](https://dx.doi.org/10.1007/3-540-62927-0),ISBN 3\-540\-62927\-0Cited by:[§2\.1](https://arxiv.org/html/2605.31444#S2.SS1.p2.11)\.
- E\. Saad \(2011\)Bridging the gap between reinforcement learning and knowledge representation: A logical off\- and on\-policy framework\.InSymbolic and Quantitative Approaches to Reasoning with Uncertainty \- 11th European Conference, ECSQARU 2011, Belfast, UK, June 29\-July 1, 2011\. Proceedings,W\. Liu \(Ed\.\),LNCS, Vol\.6717,pp\. 472–484\.External Links:[Document](https://dx.doi.org/10.1007/978-3-642-22152-1%5F40)Cited by:[§7](https://arxiv.org/html/2605.31444#S7.p2.1)\.
- Z\. G\. Saribatur, T\. Eiter, and P\. Schüller \(2021\)Abstraction for non\-ground answer set programs\.Artif\. Intell\.300,pp\. 103563\.External Links:[Document](https://dx.doi.org/10.1016/J.ARTINT.2021.103563)Cited by:[§8](https://arxiv.org/html/2605.31444#S8.p2.1)\.
- J\. K\. Slaney and S\. Thiébaux \(2001\)Blocks world revisited\.Artif\. Intell\.125\(1\-2\),pp\. 119–153\.External Links:[Document](https://dx.doi.org/10.1016/S0004-3702%2800%2900079-5)Cited by:[§4](https://arxiv.org/html/2605.31444#S4.p1.2),[§4](https://arxiv.org/html/2605.31444#S4.p5.5),[Example 1](https://arxiv.org/html/2605.31444#Thmexample1.p1.18)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning \- an introduction, 2nd edition\.MIT Press\.Cited by:[§1](https://arxiv.org/html/2605.31444#S1.p1.1),[§1](https://arxiv.org/html/2605.31444#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.31444#S2.SS2.p1.8),[§2\.2](https://arxiv.org/html/2605.31444#S2.SS2.p3.12)\.
- M\. van Otterlo \(2004\)Reinforcement learning for relational MDPs\.InProceedings of the Annual Machine Learning Conference of Belgium and the Netherlands \(BeNeLearn 2004\), January 8\-9, 2004,A\. Nowé, T\. Lenaerts, and K\. Steenhaut \(Eds\.\),pp\. 138–145\.Cited by:[§1](https://arxiv.org/html/2605.31444#S1.p4.3)\.
- M\. van Otterlo \(2005\)A survey of reinforcement learning in relational domains\.Technical reportTechnical Report05\-31,CTIT Technical Report Series,Centre for Telematics and Information Technology \(CTIT\), Univ\. Twente, Enschede, The Netherlands\.Note:Accessed: 2025\-08\-20Cited by:[§1](https://arxiv.org/html/2605.31444#S1.p3.1)\.
- M\. van Otterlo \(2009\)The logic of adaptive behavior: knowledge representation and algorithms for adaptive sequential decision making under uncertainty in first\-order and relational domains\.Vol\.192,IOS Press\.Cited by:[§1](https://arxiv.org/html/2605.31444#S1.p4.6),[§2\.2](https://arxiv.org/html/2605.31444#S2.SS2.p2.17),[§2\.3](https://arxiv.org/html/2605.31444#S2.SS3.p1.1),[§7](https://arxiv.org/html/2605.31444#S7.p1.2),[Table 1](https://arxiv.org/html/2605.31444#T1),[1](https://arxiv.org/html/2605.31444#algorithm1)\.
- C\. J\. C\. H\. Watkins and P\. Dayan \(1992\)Q\-learning\.Machine Learning8\(3\),pp\. 279–292\.External Links:[Document](https://dx.doi.org/10.1007/BF00992698),ISSN 1573\-0565Cited by:[§1](https://arxiv.org/html/2605.31444#S1.p4.6),[§2\.2](https://arxiv.org/html/2605.31444#S2.SS2.p3.12)\.Similar Articles
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning
ARES proposes a framework for automatically constructing rubric-based RL data from pretraining documents, generating question-answer pairs and weighted rubrics to enable instance-level reward supervision for open-ended LLM responses, outperforming existing methods on multi-dimensional open-ended tasks.
Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL
This paper introduces CARL, a method for offline hierarchical reinforcement learning that exploits local dynamics regularity to learn reusable skills. The approach clusters state-goal pairs requiring similar action sequences, enabling more effective skill reuse and improved performance on complex humanoid tasks.
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
AutoResearchClaw is a multi-agent autonomous research system that improves scientific discovery through structured debate, self-healing execution, and human collaboration, outperforming previous systems on the ARC-Bench benchmark by 54.7%.
SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
This paper presents SPADER, a reinforcement learning framework for multi-answer QA that uses step-wise peer advantage for credit assignment and diversity-aware exploration rewards to improve recall of long-tail entities, achieving better performance on several benchmarks.
AI from concrete to abstract: demystifying artificial intelligence to the general public
This paper presents AIcon2abs, a methodology combining visual programming and WiSARD weightless neural networks to help general audiences, including children, understand AI concepts through hands-on learning activities. The approach integrates training and classification as first-class programming constructs to make the distinction between learning machines and conventional programs more intuitive.