Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

arXiv cs.AI 06/09/26, 04:00 AM Papers
reinforcement-learning neurosymbolic linear-temporal-logic safety offline-rl constraints transformer
Summary
Introduces a neurosymbolic framework that injects LTLf constraints into transformer-based reinforcement learning policies via differentiable automaton representations and a logic-based loss, improving constraint satisfaction while maintaining competitive returns.
arXiv:2606.08312v1 Announce Type: new Abstract: In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.
Original Article
View Cached Full Text
Cached at: 06/09/26, 08:55 AM
# Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies
Source: [https://arxiv.org/html/2606.08312](https://arxiv.org/html/2606.08312)
\\copyrightclause

Copyright for this paper by its authors\. Use permitted under Creative Commons License Attribution 4\.0 International \(CC BY 4\.0\)\.

\\conference

Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs \(SKILLED\-LLMs 2026\)

\[orcid=0009\-0005\-9247\-1786, email=ansarifard\.1970082@studenti\.uniroma1\.it \]

\[orcid=0009\-0004\-9547\-7115, email=mancanelli@diag\.uniroma1\.it \]

\[orcid=0000\-0002\-5639\-6038, email=umili@diag\.uniroma1\.it \]

\[orcid=0000\-0002\-9116\-251X, email=patrizi@diag\.uniroma1\.it \]

Matteo MancanelliElena UmiliFabio Patrizi

\(2022\)

###### Abstract

In this work we study offline reinforcement learning \(RL\) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces \(LTLf\\textsc\{LTL\}\_\{f\}\)\. Recently, transformer\-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem\. However, these methods optimize purely for reward and do not account for high\-level temporal requirements\. Here, we introduce a neurosymbolic framework that injectsLTLf\\textsc\{LTL\}\_\{f\}background knowledge into such transformer\-based RL policies\. Our approach compilesLTLf\\textsc\{LTL\}\_\{f\}formulas into deterministic finite automata \(DFAs\) and integrates them into the learning process through a differentiable representation and a logic\-based loss function\. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training\. The resulting method is architecture\-agnostic across different models\. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties\. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines\.

###### keywords:

Safe Reinforcement Learning\\sepNeurosymbolic Knowledge Injection\\sepLinear Temporal Logic

## 1Introduction

Safety is a central concern in reinforcement learning \(RL\)\[sutton1998reinforcement\], particularly when agents are deployed in real\-world or safety\-critical settings such as robotics, autonomous navigation, and decision\-support systems, where undesirable behaviors may incur irreversible damage, financial loss, or human harm\[safe\_rl\_survey\]\. Ensuring that learned policies respect some user\-defined constraints is therefore crucial for increasing the system correctness and reliability\. This challenge becomes even more pronounced in offline RL\[offline\_rl\_survey\], where policies are learned exclusively from pre\-collected data and cannot rely on online interaction to detect and correct unsafe behavior\.

Safe decision making often requires constraints that are*temporally extended*, rather than purely local or instantaneous\[prob\_shielding\_AAAI25\]\. Many requirements cannot be verified from a single state\-action pair, but only over entire trajectories\. For instance, consider the specification:*"eventually reach the goal while always avoiding hazards"\.*This requirement combines two temporal modalities: \(i\) a*liveness*condition \(the goal must be reached at some point before the episode ends\), and \(ii\) a*safety*condition \(no hazardous state may be visited at any preceding time step\)\. A policy that briefly steps on a hazard—even if it later reaches the goal—violates the specification\. A policy that remains safe but never reaches the goal also fails\. Satisfaction therefore depends on the*entire trace*, not on local reward maximization alone\.

In*offline reinforcement learning \(RL\)*, enforcing such constraints is particularly challenging\. Since no online interaction is allowed, constraint violations cannot be corrected via trial\-and\-error exploration\. Sequence\-model\-based policies\-such as Trajectory Transformers \(TTs\)\[trajectory\_transformers\]and Decision Transformers \(DTs\)\[decision\_transformers\]\-are appealing in this setting because they generate trajectories autoregressively and naturally model long\-range dependencies\. However, they operate as probabilistic sequence generators, while task and safety requirements are typically expressed as formal logical specifications\. This creates a gap between generative modeling and symbolic correctness guarantees\.

In this paper, we study*symbolic constraint injection*\[UmiliPMAI24,mezini2025neuro\]for offline transformer policies using Linear Temporal Logic over finite traces \(LTLf\\textsc\{LTL\}\_\{f\}\)\[LTLf\]as specification language\.LTLf\\textsc\{LTL\}\_\{f\}provides a natural language for specifying unambiguous temporal requirements over episodic tasks\. We compileLTLf\\textsc\{LTL\}\_\{f\}formulas into deterministic finite automata \(DFAs\) and integrate the resulting automata into the policy pipeline by training\-time regularization, via differentiable soft\-satisfaction signals derived from automaton progression\. Our framework supports different transformer\-based RL architectures, like TT and DT, through a shared interface that abstracts away architectural differences, and it is therefore architecture\-agnostic within the class of autoregressive sequence models\. It would be possible also to extend our approach with test\-time automaton\-aware constrained decoding, where generation is restricted to action sequences consistent with the logical specification, but we leave that for future research\.

We evaluate the proposed approach on temporally extended tasks in a navigation environment, ColourBomb\[prob\_shielding\_AAAI25\]\. The considered specifications span invariant safety, reachability, and combined reach\-while\-safe objectives\. The results section is structured to systematically expose the behavior of the method: the effects of logic regularization in TT and DT, cross\-benchmark trade\-offs between return and logical satisfaction, and runtime overhead\. To summarize, we make the following contributions:

- •A unified neurosymbolic offline sequence\-RL framework for injectingLTLf\\textsc\{LTL\}\_\{f\}constraints into transformer\-based policies through automata\-based modules that are architecture\-agnostic\.
- •A canonical finite\-trace evaluation stack ensuring consistent end\-of\-episode semantics across \(i\) dataset serialization, \(ii\) token\-to\-symbol mappings, and \(iii\) logical evaluation, thereby preventing semantic mismatches between model training and temporal\-logic assessment\.
- •An empirical evaluation against diverse temporal specifications in navigation domains, highlighting trade\-offs between performance and constraint satisfaction\.

## 2Background

### 2\.1Linear Temporal Logic over finite traces

Linear Temporal Logic \(LTL\)\[LTL\]is a formal language that extends traditional propositional logic with modal operators, allowing the specification of rules that must holdthrough time\. Here, we useLTLover finite traces \(LTLf\\textsc\{LTL\}\_\{f\}\)\[LTLf\], a popular variant of LTL which models finite, but length\-unbounded, traces of executions, making it suitable for finite\-horizon problems\. Given a finite setΣ\\Sigmaof atomic propositions, the set ofLTLf\\textsc\{LTL\}\_\{f\}formulasφ\\varphiis inductively defined as follows:

φ::=⊤∣⊥∣a∣¬φ∣φ1∧φ2∣Xφ∣φ1Uφ2,\\varphi::=\\top\\mid\\bot\\mid a\\mid\\lnot\\varphi\\mid\\varphi\_\{1\}\\land\\varphi\_\{2\}\\mid X\\varphi\\mid\\varphi\_\{1\}U\\varphi\_\{2\},wherea∈Σa\\in\\Sigma\. We use⊤\\topand⊥\\botto denoteTrueTrueandFalseFalse, respectively\.XX\(Strong Next\) andUU\(Until\) are temporal operators\. Other temporal operators areNN\(Weak Next\) andRR\(Release\), defined asNφ≡¬X¬φN\\varphi\\equiv\\neg X\\neg\\varphiandφ1Rφ2≡¬\(¬φ1U¬φ2\)\\varphi\_\{1\}R\\varphi\_\{2\}\\equiv\\neg\(\\neg\\varphi\_\{1\}U\\neg\\varphi\_\{2\}\);GG\(globally\)Gφ≡⊥RφG\\varphi\\equiv\\bot R\\varphi; andFF\(eventually\)Fφ≡⊤UφF\\varphi\\equiv\\top U\\varphi\.

A trace𝝈=\(σ1,σ2,…,σT\)\\boldsymbol\{\\sigma\}=\(\\sigma\_\{1\},\\sigma\_\{2\},\\dots,\\sigma\_\{T\}\)is a sequence of propositional assignments to the propositions inΣ\\Sigma, whereσt∈2Σ\\sigma\_\{t\}\\in 2^\{\\Sigma\}is the set of all and only propositions that are true at instanttt\. Additionally,\|𝝈\|=T\|\\boldsymbol\{\\sigma\}\|=Tdenotes the length of the trace\. Since every trace is finite,\|𝝈\|<∞\|\\boldsymbol\{\\sigma\}\|<\\infty;ϵ\\epsilondenotes the empty trace\. Note that, if the propositional symbols inΣ\\Sigmaare allmutually exclusive, i\.e\., the domain produces exactly one symbol true at each step, then we haveσt∈Σ\\sigma\_\{t\}\\in\\Sigma\. Letlast=\|𝝈\|−1last=\|\\boldsymbol\{\\sigma\}\|\-1\. We inductively define when anLTLf\\textsc\{LTL\}\_\{f\}formulaφ\\varphiis true at an instantii\(for0≤i≤last0\\leq i\\leq last\), written𝝈,i⊧φ\\boldsymbol\{\\sigma\},i\\models\\varphi, as follows:

- •𝝈,i⊧a\\boldsymbol\{\\sigma\},i\\models aiffa∈σia\\in\\sigma\_\{i\}, fora∈Σa\\in\\Sigma;
- •𝝈,i⊧¬φ\\boldsymbol\{\\sigma\},i\\models\\neg\\varphiiff𝝈,i⊧̸φ\\boldsymbol\{\\sigma\},i\\not\\models\\varphi;
- •𝝈,i⊧φ1∧φ2\\boldsymbol\{\\sigma\},i\\models\\varphi\_\{1\}\\land\\varphi\_\{2\}iff𝝈,i⊧φ1\\boldsymbol\{\\sigma\},i\\models\\varphi\_\{1\}and𝝈,i⊧φ2\\boldsymbol\{\\sigma\},i\\models\\varphi\_\{2\};
- •𝝈,i⊧Xφ\\boldsymbol\{\\sigma\},i\\models X\\varphiiffi<lasti<lastand𝝈,i\+1⊧φ\\boldsymbol\{\\sigma\},i\+1\\models\\varphi;
- •𝝈,i⊧φ1𝒰φ2\\boldsymbol\{\\sigma\},i\\models\\varphi\_\{1\}\\mathop\{\{\\mathcal\{U\}\}\}\\varphi\_\{2\}iff for somejjsuch thati≤j≤lasti\\leq j\\leq last, we have𝝈,j⊧φ2\\boldsymbol\{\\sigma\},j\\models\\varphi\_\{2\}, and for allkksuch thati≤k<ji\\leq k<j, we have𝝈,k⊧φ1\\boldsymbol\{\\sigma\},k\\models\\varphi\_\{1\}\.

We say that𝝈\\boldsymbol\{\\sigma\}satisfiesφ\\varphi, written𝝈⊧φ\\boldsymbol\{\\sigma\}\\models\\varphi, if𝝈,0⊧φ\\boldsymbol\{\\sigma\},0\\models\\varphi\. A formulaφ\\varphiis satisfiable if it is true wrt some trace𝝈\\boldsymbol\{\\sigma\}, and is valid, if it is true wrt every trace𝝈\\boldsymbol\{\\sigma\}\.

AnyLTLf\\textsc\{LTL\}\_\{f\}formulaφ\\varphican be translated into a Deterministic Finite Automaton \(DFA\)\[LTLf\]Aφ=\(2Σ,Q,q0,δ,F\)A\_\{\\varphi\}=\(2^\{\\Sigma\},Q,q\_\{0\},\\delta,F\), where2Σ2^\{\\Sigma\}is the automaton alphabet,QQis the finite set of states,q0∈Qq\_\{0\}\\in Qis the initial state,δ:Q×2Σ→Q\\delta:Q\\times 2^\{\\Sigma\}\\rightarrow Qis the transition function, andF⊆QF\\subseteq Qis the set of final states\. Additionally, we recursively define the extended transition function over tracesδ∗:Q×\(2Σ\)∗→Q\\delta^\{\*\}:Q\\times\(2^\{\\Sigma\}\)^\{\*\}\\rightarrow Qas:

δ∗\(q,ϵ\)=qδ∗\(q,σ⋅𝒙\)=δ∗\(δ\(q,σ\),𝒙\),\\begin\{array\}\[\]\{l\}\\delta^\{\*\}\(q,\\epsilon\)=q\\\\ \\delta^\{\*\}\(q,\\sigma\\cdot\\boldsymbol\{x\}\)=\\delta^\{\*\}\(\\delta\(q,\\sigma\),\\boldsymbol\{x\}\),\\end\{array\}\(1\)whereσ∈2Σ\\sigma\\in 2^\{\\Sigma\}is a symbol and𝒙∈\(2Σ\)∗\\boldsymbol\{x\}\\in\(2^\{\\Sigma\}\)^\{\*\}is a trace\. The automaton accepts the trace𝝈\\boldsymbol\{\\sigma\}iffδ∗\(q0,𝝈\)∈F\\delta^\{\*\}\(q\_\{0\},\\boldsymbol\{\\sigma\}\)\\in F, and in that case we say that𝝈\\boldsymbol\{\\sigma\}belongs to the language of the automaton, denoted asL\(Aφ\)L\(A\_\{\\varphi\}\)\. We have thatφ\\varphiandAφA\_\{\\varphi\}are equivalent because, for any trace𝝈∈\(2Σ\)∗\\boldsymbol\{\\sigma\}\\in\(2^\{\\Sigma\}\)^\{\*\}:

𝝈∈L\(Aφ\)⇔𝝈⊨φ\.\\boldsymbol\{\\sigma\}\\in L\(A\_\{\\varphi\}\)\\iff\\boldsymbol\{\\sigma\}\\vDash\\varphi\.\(2\)

### 2\.2Offline Reinforcement Learning

In Reinforcement Learning \(RL\), the interaction between an agent and the environment is commonly modeled as a Markov Decision Process \(MDP\)\[sutton1998reinforcement\], defined by the tuple\(S,A,t,r,γ\)\(S,A,t,r,\\gamma\)whereSSis the set of states,AAthe set of actions,t:S×A×S→\[0,1\]t:S\\times A\\times S\\rightarrow\[0,1\]the transition function s\.t\.∑s′∈St\(s,a,s′\)=1\\sum\_\{s^\{\\prime\}\\in S\}t\(s,a,s^\{\\prime\}\)=1,r:S×A→Rr:S\\times A\\rightarrow Rthe reward function, andγ∈\[0,1\]\\gamma\\in\[0,1\]the discount factor\. A policyπ:S→A\\pi:S\\rightarrow Amaps states to actions, while the value functionVπ\(s\)V^\{\\pi\}\(s\)denotes the expected discounted return obtained by followingπ\\pifrom statess\. The goal of the RL agent is to learn the optimal policyπ∗\\pi^\{\*\}maximizing expected discounted return\.

Traditional RL methods assume direct interaction with the environment during training\. The agent collects experience by executing actions, observing state transitions and rewards, and updating its policy accordingly\. This online interaction paradigm enables iterative improvement but may be costly, unsafe or even unfeasible in real\-world applications\. This motivates the study of offline RL\[offline\_rl\_survey\], where the agent is given a fixed dataset𝒟\{\\mathcal\{D\}\}of trajectories generated by some \(unknown\) behavior policy, and must learn a policy without additional environment interaction\. In order to prevent issues with out\-of\-distribution data, many offline RL algorithms incorporate mechanisms to constrain learned policies to remain close to the support data distribution\[kumar2020conservative\]\.

### 2\.3Reinforcement Learning as Sequence Modeling

Some works\[trajectory\_transformers,decision\_transformers\]propose an alternative perspective on RL, viewing it as a sequence modeling problem rather than a value estimation\. In\[trajectory\_transformers\]a trajectory is treated as a sequence of tokens and an autoregressive model is used to learn the next token in the sequence\. Suppose we haveNN\-dimensional states andMM\-dimensional actions\. A trajectory𝝉\\boldsymbol\{\\tau\}of horizonTTcan be represented as

𝝉=\(s11,s12,…,s1N,a11,…,a1M,r1,s21,…,rT\)\\boldsymbol\{\\tau\}=\(s\_\{1\}^\{1\},s\_\{1\}^\{2\},\.\.\.,s\_\{1\}^\{N\},a\_\{1\}^\{1\},\.\.\.,a\_\{1\}^\{M\},r\_\{1\},s\_\{2\}^\{1\},\.\.\.,r\_\{T\}\)\(3\)Subscripts on all tokens denote timestep and superscripts on states and actions denote dimension \(i\.e\.,stis\_\{t\}^\{i\}is theithi^\{\\text\{th\}\}dimension of the state at timett\)\. An autoregressive model is capable of estimating the probability of thettht^\{\\text\{th\}\}tokenxtx\_\{t\}given the set of all past tokens𝒙<𝒕\\boldsymbol\{x\_\{<t\}\}:

P\(xt∣x1,x2,…,xt−1\)=P\(xt∣𝒙<𝒕\)\.P\(x\_\{t\}\\mid x\_\{1\},x\_\{2\},\\dots,x\_\{t\-1\}\)=P\(x\_\{t\}\\mid\\boldsymbol\{x\_\{<t\}\}\)\.\(4\)Instead of learning value functions or policies, one can learn the joint distribution

P\(𝒙\)=∏i=1\|τ\|P\(xi∣𝒙<𝒊\)\.P\(\\boldsymbol\{x\}\)=\\prod\_\{i=1\}^\{\|\\tau\|\}P\(x\_\{i\}\\mid\\boldsymbol\{x\_\{<i\}\}\)\.\(5\)
At each generation step, a token must besampledfrom the next activity probability\. A common way of selecting the next token to feed into the autoregressor is togreedilychoose it by maximizing the next token probability at each step, as follows:

xt=argmaxxP\(xt∣𝒙<t\),∀t<\|τ\|\.\{x\}\_\{t\}=\\operatorname\*\{argmax\}\_\{x\}P\(x\_\{t\}\\mid\\boldsymbol\{x\}\_\{<t\}\),\\quad\\forall t<\|\\tau\|\.\(6\)In general, this greedy search strategy is sub\-optimal, because it may not produce the trace that maximizes the joint probability distribution in Equation[5](https://arxiv.org/html/2606.08312#S2.E5)\. Other sub\-optimal sampling strategies commonly used for this task include Beam Search, Random Sampling, and Temperature Sampling\[ramamaneiro24\]\.

Trajectory Transformer \(TT\) models are transformer\-based autoregressive architectures well suited for addressing RL as sequence modeling\. In the offline setting, such models can be trained via maximum likelihood using teacher forcing, by maximizing the log\-probability of trajectories in the dataset\. At inference time, planning can be performed by sampling or beam search, optionally biasing candidate trajectories according to cumulative reward or reward\-to\-go estimates\. This paradigm unifies dynamics modeling, policy learning, and behavior regularization within a single sequence model\. Note that this approach focuses exclusively on maximizing return and modeling the data distribution, but they may violate the constraints that emerge from \(or we want to impose to\) the domain\. This motivates our neurosymbolic framework guiding the learned trajectory distribution toward constraint satisfaction\.

## 3Related Work

### 3\.1Offline RL as sequence modeling

Offline Reinforcement Learning\[offline\_rl\_survey\]is frequently addressed by techniques that are based on value functions\[kumar2020conservative,kostrikov2021offline\], goal\-conditioned behavior cloning\[nair2018visual,ding2019goal\]and reward\-conditioned behavior cloning\[kumar2019reward,srivastava2019training\]\. As previously explained, recent works show how this problem can be tackled as a sequence model problem\. In\[emmons2021rvs\]RL via supervised learning is analyzed, this work shows that sometimes pure supervised learning is competitive wrt more complex architectures, given careful tuning of the policy capacity\. In contrast, other recent works have adopted the use of Transformers\[transformers\]as sequence prediction models, and they have provided a promising direction for offline RL, which is also flexible enough to be combined with mentioned techniques\.

In particular, Trajectory Transformers \(TTs\)\[trajectory\_transformers\]and Decision Transformers \(DTs\)\[decision\_transformers\]follow this idea by using beam\-search planning and reward conditioning, respectively\. Notably, these architectures are further studied combining techniques based on Dynamic Programming \(such as Q\-learning\)\[yamagata2023q\], online fine\-tuning\[zheng2022online\], few\-shot policy generalization\[xu2022prompting\], and automatically\-generated waypoints\[badrinath2023waypoint\]\. There are also works that do not build directly on TTs or DTs, such as\[huang2024diffusion\], where Transformers are used to overcome inefficiencies of Diffusion Models\.

### 3\.2Safe Reinforcement Learning

Safe RL focuses on methods for ensuring that learned policies satisfy some given safety constraints\. Online RL techniques\[achiam2017constrained,tessler2018reward\]typically model safety requirements as constraints on the expected cost, and enforce safety through constrained optimization or Lagrangian formulations during policy learning\. Recently, also offline Safe RL is gaining popularity, including techniques for both soft constraints\[xu2022constraints\]and hard constraints\[zheng2024safe\]\. Adaptations of DTs and TTs are also studied in the context of offline Safe RL\[liu2023constrained,wang2024safe,guo2024temporal,zhang2023saformer\], confirming the growing interest in these models in recent years\. These works share a common formulation grounded in Constrained MDPs, where safety is enforced by conditioning the policy on cost signals and constraint thresholds\. Despite their differences, all these approaches operate by modifying the conditioning signals or relabeling trajectory returns to steer generation toward safe behavior\. In contrast, our method leaves the trajectory data and its reward/cost structure untouched and instead regularizes the training objective with a differentiable logic loss derived from DFA progression, injecting temporal\-logic knowledge directly into the learning process rather than into the data representation\. Recent work\[tian2023reinforcement\]also addresses RL using TTs andLTLf\\textsc\{LTL\}\_\{f\}constraints, but their approach differs from ours as it relies on reward shaping derived from DFA states to imitate trajectories\.

An alternative line of work enforces safety through shielding\[alshiekh2018safe,jansen2018safe,belardinelli2025probabilistic\], where a runtime monitor prevents the agent from executing actions that may lead to unsafe states\. While shielding provides strong safety guarantees, it typically assumes the availability of an explicit environment model, it is primarily designed for on\-policy learning, and it is structurally limited to the "safety fragment" of formal specifications\. In contrast, our method is well suited for offline RL, does only require knowledge of the constraints without assuming any other knowledge of the environment, and can be used to impose any LTLf formula as a constraint, also the ones outside the safety fragment\.

### 3\.3Temporally\-constrained sequence generation

Recently, there has been increasing interest in constraining autoregressive sequence generation through logical knowledge\. Applications span Cyber\-Physical Systems\[STLnet\], Business Process Management\[UmiliPMAI24,DiFrancescomarino17\], and natural language generation with Large Language Models \(LLMs\)\[trident,llm\_beam\_search\_1,llm\_beam\_search\_2,llm\_sampling1,montecarlo\_llm,pseudosemantic\_loss\]\. Most prior work incorporates constraints at*test time*, guiding suffix generation via constrained beam search\[DiFrancescomarino17,trident,llm\_beam\_search\_1,llm\_beam\_search\_2\], auxiliary tractable models\[llm\_aux\_mod\_1,llm\_aux\_mod\_2\], or conditioned sampling\[llm\_sampling1,montecarlo\_llm\]\. In contrast, we enforce constraints at*training time*through an auxiliary logical loss\. Only few approaches follow this direction\[STLnet,UmiliPMAI24,pseudosemantic\_loss\]\. Among them,\[UmiliPMAI24,pseudosemantic\_loss\]estimate the probability of satisfying a formula via pseudoprobabilities or Monte Carlo, while STLnet\[STLnet\]relies on a student–teacher scheme that is difficult to apply in discrete domains\[UmiliPMAI24\]\.

A key limitation of existing methods is that they target*homogeneous*sequences \(e\.g\., text\), assuming that logical formulas are directly expressed over the generated tokens\. While reasonable for language, this assumption is limiting in RL settings\. Here, we consider trajectories of states, actions, and rewards \(as formalized in Eq\.[3](https://arxiv.org/html/2606.08312#S2.E3)\), where constraints are naturally specified over abstract, high\-level*symbols*that must be grounded in subsequences representing states or state–action pairs\. Our approach builds on\[UmiliPMAI24,mezini2025neuro\], which approximateLTLf\\textsc\{LTL\}\_\{f\}satisfaction via Monte Carlo and use it as a training loss\. We extend this framework to Decision and Trajectory Transformers for safe offline RL applications\.

## 4Method

### 4\.1Problem Formulation

In offline RL, the agent is provided with a fixed dataset𝒟=\{𝝉\(i\)\}i=1N\\mathcal\{D\}=\\\{\\boldsymbol\{\\tau\}^\{\(i\)\}\\\}\_\{i=1\}^\{N\}of previously collected trajectories\. Each trajectory is a finite sequence of transitions and an*end\-of\-trace*symbolEOT, i\.e\.𝝉=\(y1,…,yT,EOT\)\\boldsymbol\{\\tau\}=\(y^\{1\},\\ldots,y^\{T\},\\texttt\{EOT\}\), where each transition tokenyjy^\{j\}group encodes state, action, reward, and optional environment\-specific auxiliary fields \(e\.g\., safety cost\)\. Here, we also consider some prior knowledge about the task, expressed asLTLf\\textsc\{LTL\}\_\{f\}constraints, to capture high\-level behavioral requirements\. Given anLTLf\\textsc\{LTL\}\_\{f\}formulaφ\\varphi, our objective is to learn a trajectory generation policy that maximizes both the expected return of the policy and the probability that generated trajectories satisfyφ\\varphi\. We define atomic propositions through an environment\-specific extractor

Π\(yt\)∈\{0,1\}K\\Pi\(y^\{t\}\)\\in\\\{0,1\\\}^\{K\}and theLTLf\\textsc\{LTL\}\_\{f\}formulaφ\\varphiis defined over theseKKpropositional symbols\. Note that, in our experiments, proposition extraction isstate\-centric, i\.e\.φ\\varphiis defined only over state dimensions, but our approach is fully general and supports also imposing constraints on action, state\-action pairs and auxiliary\-terms\.

A logic adapter is used to define the authoritative token schema and the unique end token identifier shared by dataset builders, token\-to\-symbol mapping, DFA rollout, and logic\-loss evaluation\. Satisfaction evaluation follows the standardLTLf\\textsc\{LTL\}\_\{f\}semantics \(see Sec\.[2\.1](https://arxiv.org/html/2606.08312#S2.SS1)\) and is therefore defined on complete finite traces; traces without an end marker are treated as incomplete and unsatisfied\. This design avoids inconsistent termination handling across modules and ensures agreement between hard DFA evaluation and differentiable soft evaluation\.

As explained in the background, we assume an autoregressive neural modelfθf\_\{\\theta\}with trainable parametersθ\\thetathat estimates an approximationPθP\_\{\\theta\}of the probability of the next eventτt\\tau\_\{t\}given a trace of previous events𝝉<t\\boldsymbol\{\\tau\}\_\{<t\}\(Equation[4](https://arxiv.org/html/2606.08312#S2.E4)\):

y~t=fθ\(𝝉<t\)P\(τt=vi∣𝝉<t\)≈y~t\[i\]\.\\begin\{array\}\[\]\{ll\}\\tilde\{y\}\_\{t\}=f\_\{\\theta\}\(\\boldsymbol\{\\tau\}\_\{<t\}\)\\\\ P\(\\tau\_\{t\}=v\_\{i\}\\mid\\boldsymbol\{\\tau\}\_\{<t\}\)\\approx\\tilde\{y\}\_\{t\}\[i\]\.\\end\{array\}\(7\)where,τt\\tau\_\{t\}denotes the token at positiontt, andviv\_\{i\}is an element of the token vocabulary\. Note that we do not make any assumptions about the neural model, except that it can estimate the probability of the next activity given a sequence of previous ones\. As a result, our approach is entirelymodel\-agnosticand can be readily applied to any autoregressive model \(provided that the logic adapter is changed consistently\)\. The model parameters are typically trained using a supervised lossL𝒟L\_\{\\mathcal\{D\}\}, evaluated on a dataset𝒟\\mathcal\{D\}of ground\-truth traces obtained by observing the process\. The loss for a trace𝝉∈𝒟\\boldsymbol\{\\tau\}\\in\\mathcal\{D\}of lengthTTis defined as follows:

L𝒟\(𝝉\)=1T∑t=1Tcross\-entropy\(fθ\(𝝉<t\),τt\)\.L\_\{\\mathcal\{D\}\}\(\\boldsymbol\{\\tau\}\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\text\{cross\-entropy\}\(f\_\{\\theta\}\(\\boldsymbol\{\\tau\}\_\{<t\}\),\\tau\_\{t\}\)\.\(8\)
Here, the goal is to maximize the probabilityPθ⊨φP\_\{\\theta\\vDash\\varphi\}that traces𝝉∼Pθ\\boldsymbol\{\\tau\}\\sim P\_\{\\theta\}, sampled from the autoregressor, satisfy the specification:

Pθ⊨φ=𝔼𝝉∼Pθ\[𝝉⊨φ\]=∑𝝉Pθ\(𝝉\)𝕀\{𝝉⊨φ\}\.P\_\{\\theta\\vDash\\varphi\}=\\mathbb\{E\}\_\{\\boldsymbol\{\\tau\}\\sim P\_\{\\theta\}\}\[\\boldsymbol\{\\tau\}\\vDash\\varphi\]=\\sum\_\{\\boldsymbol\{\\tau\}\}P\_\{\\theta\}\(\\boldsymbol\{\\tau\}\)\\,\\mathbb\{I\}\\\{\\boldsymbol\{\\tau\}\\vDash\\varphi\\\}\.\(9\)
Computing this probability exactly is infeasible, since it would require to enumerate all possible traces of maximum lengthTT\. In\[UmiliPMAI24,mezini2025neuro\]a differentiable procedure to approximate the previous probability at each optimization step of the autoregressor is designed\. In particular, they introduce a logic loss functionLφL\_\{\\varphi\}, and compute the training objective as a linear combination of the logic and the supervised loss:

L=αLφ\+\(1−α\)L𝒟,L=\\alpha L\_\{\\varphi\}\+\(1\-\\alpha\)L\_\{\\mathcal\{D\}\},\(10\)withα\\alphabeing a trade\-off parameter between 0 and 1 that balances the influence of each loss on the training process\. Here we adapt the logic loss computation to offline RL domains and apply it at training time, together with other techniques at test time to maximize adherence with the constraints\.

### 4\.2Automata\-Based Satisfaction and Logic Loss

Now, we briefly discuss how to approximate the target probability in Equation[9](https://arxiv.org/html/2606.08312#S4.E9)\. Specifically,\[mezini2025neuro\]uses a Monte Carlo estimation by sampling a set of complete traces\{𝝉\(1\),𝝉\(2\),…,𝝉\(N\)\}∼Pθ\\\{\\boldsymbol\{\\tau\}^\{\(1\)\},\\boldsymbol\{\\tau\}^\{\(2\)\},\\ldots,\\boldsymbol\{\\tau\}^\{\(N\)\}\\\}\\sim P\_\{\\theta\}according to the distribution learned by the autoregressor, and compute an approximation of the target probability as follows:

P^θ⊨φ=1N∑i=1N𝕀\{𝝉\(i\)⊨φ\}\.\\hat\{P\}\_\{\\theta\\vDash\\varphi\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\\{\\boldsymbol\{\\tau\}^\{\(i\)\}\\vDash\\varphi\\\}\.\(11\)Note that the sampled traces𝝉\(i\)\\boldsymbol\{\\tau\}^\{\(i\)\}are generated by a NN that returns a probability distribution at each time step, and thus they are replaced by a probabilistic counterpart𝝉~\(i\)\\tilde\{\\boldsymbol\{\\tau\}\}^\{\(i\)\}, where symbols are sampled from that probability distribution\. Moreover, the indicator function𝕀\\mathbb\{I\}in Equation[11](https://arxiv.org/html/2606.08312#S4.E11)is replaced by a method that computes the \(probabilistic\) compliance of a probabilistic trace with the knowledge\. To achieve this, we can leverage on*\(i\)*the Gumbel\-Softmax sampling, that generates differentiable, near one\-hot suffixes during training and*\(ii\)*DeepDFA\[deepdfa\_ecai2024\], that encodes temporal logic properties as a recurrent layer, enabling efficient and differentiable evaluation of logical constraints\.

#### Gumbel\-Softmax sampling\.

To sample from the probability distribution returned by a NN, but still maintain differentiability, we can use the Gumbel\-Softmax sampling, which produces approximations of discrete symbols as one\-hot\-like vectors\. Given the predicted token distributiony~t\\tilde\{y\}\_\{t\}, we get

x~t=softmax\(log⁡\(y~t\)\+Gtemp\)\\tilde\{x\}\_\{t\}=\\text\{softmax\}\\left\(\\frac\{\\log\(\\tilde\{y\}\_\{t\}\)\+G\}\{temp\}\\right\)\(12\)whereGGis a Gumbel noise vector andtemptempis a temperature parameter controlling the sharpness of the distribution\. These soft tokens are then fed into the differentiable DFA, which computes the probability that the sampled trajectory satisfies the LTLf specification\. Astemp→0temp\\rightarrow 0, the output approaches a discrete one\-hot vector, while fortemp=1temp=1, it remains close to the original continuous probabilities iny~t\\tilde\{y\}\_\{t\}\. Since the next activity is only*probabilistically*grounded, we denote it asτ~t\\tilde\{\\tau\}\_\{t\}\.

#### DeepDFA\.

As explained in Sec\.[2\.1](https://arxiv.org/html/2606.08312#S2.SS1), eachLTLf\\textsc\{LTL\}\_\{f\}constraintφ\\varphican be compiled into an equivalent DFAAφA\_\{\\varphi\}\. We extend the DFA to handle the specialEOTsymbol, such that only traces includingEOTare accepted\. There exists automatic translation tools, such asltlf2DFA\[fuggitti\-ltlf2dfa\], that can be used off\-the\-shelf\. Many alternatives exist in the literature to evaluate continuous satisfaction of temporal constraints over sequences ofsoft\(probabilistic or fuzzy\) symbol assignments\[umili\_kr23,deepdfa\_ecai2024,DonadelloFIMM25,NesyA\]\. In our work we used DeepDFA\[deepdfa\_ecai2024\], following what has been done in\[UmiliPMAI24,mezini2025neuro\]\. DeepDFA is a neural, probabilistic relaxation of a standard DFA, where the automaton is represented in matrix form and the input symbols, states, and outputs areprobabilistically grounded\. This allows us to compute the probabilistic compliancePDDFA\(𝝉~\(i\)⊨φ\)P\_\{\\mathrm\{DDFA\}\}\(\\tilde\{\\boldsymbol\{\\tau\}\}^\{\(i\)\}\\vDash\\varphi\)of a sampled trace𝝉~\(i\)\\tilde\{\\boldsymbol\{\\tau\}\}^\{\(i\)\}with the knowledgeφ\\varphi\. Therefore, Equation[11](https://arxiv.org/html/2606.08312#S4.E11)becomes:

P^θ⊨φ=1N∑i=1NPDDFA\(𝝉~\(i\)⊨φ\)\.\\hat\{P\}\_\{\\theta\\vDash\\varphi\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}P\_\{\\mathrm\{DDFA\}\}\(\\tilde\{\\boldsymbol\{\\tau\}\}^\{\(i\)\}\\vDash\\varphi\)\.\(13\)Finally, we are able to compute the logic lossLφL\_\{\\varphi\}, which enforces the satisfaction of prior knowledge over entire traces:

Lφ=−log⁡\(P^θ⊨φ\)\.L\_\{\\varphi\}=\-\\log\\left\(\\hat\{P\}\_\{\\theta\\vDash\\varphi\}\\right\)\.\(14\)

## 5Experimental Evaluation

To evaluate the proposed framework, we conduct experiments111The code is publicly available at: https://github\.com/ashkanans/nesy\_rl\_test\.on a controlled navigation benchmark designed to simulate safety\-critical decision making under temporally extended constraints\. The experiments focus on understanding how the proposed logic regularization influences the behavior of transformer\-based policies and how different levels of logical guidance affect the trade\-off between task performance and constraint satisfaction\.

### Environment

We evaluate our approach on the ColourBomb environment\[prob\_shielding\_AAAI25\], a grid\-based navigation domain illustrated in Figure[1](https://arxiv.org/html/2606.08312#S5.F1)\. The environment contains:

- •Start cell \(S\):the initial location of the agent\.
- •Goal cells \(P/Y/U\):terminal states \(identified by different colours\) that provide positive reward and end the episode\.
- •Bomb cells \(B\):hazardous states that terminate the episode with negative reward\.
- •Walls \(W\):non\-traversable cells that block movement\.

![Refer to caption](https://arxiv.org/html/2606.08312v1/figures/colourbomb/cb_environment_layout.png)Figure 1:ColourBomb environment structure\.The agent must navigate from the start location toward one of the goal cells while avoiding bombs and obstacles\. The environment is episodic with a finite horizon\. Each step incurs a small penalty to encourage efficient navigation, while reaching a goal yields positive reward and entering a bomb cell yields negative reward\. This environment is particularly suitable for testing our framework, and its flexibility enables the construction of differentLTLf\\textsc\{LTL\}\_\{f\}constraints with increasing levels of complexity\.

### Temporal Specifications

We evaluate the system using two representative families ofLTLf\\textsc\{LTL\}\_\{f\}specifications:

- •Safety:requires the agent to avoid bomb cells throughout the entire trajectory
- •Reach\-while\-Safe:requires the agent to eventually reach a goal while always avoiding bomb G\(¬bomb\)∧F\(goal\)G\(\\neg bomb\)\\land F\(goal\)

Note thatF\(goal\)F\(goal\)is a liveness property in temporal logics terminology\. Note also that, since we assume*\(i\)*goal is terminal,*\(ii\)*goal is not a bomb, and*\(iii\)*traces stop when goal is reached, then we could rewrite the constraint as\(¬bomb\)Ugoal\(\\neg bomb\)\\,U\\,goal, which ensures that the agent remains in safe states until a goal is reached\. We show the automata for to these two specifications in Fig\.[2](https://arxiv.org/html/2606.08312#S5.F2)\.

These specifications allow us to analyze the behavior of the method under both purely safety\-oriented and mixed safety\-and\-performance objectives\. One could also consider more complex formulas, for instance requiring the agent to visit coloured states in a given order\. Notably, our framework goes beyond most papers on Safe RL, allowing the specification of arbitrary temporal constraints\. In future work, we aim to explore the use of combined formulas drawn fromDeclarepatterns\[declare2006\]\.

![Refer to caption](https://arxiv.org/html/2606.08312v1/figures/safe_dfa2.png)

![Refer to caption](https://arxiv.org/html/2606.08312v1/figures/reach_while_safe_dfa.png)

Figure 2:DFAs for Safety and Reach\-while\-Safe constraints
### Methods Compared

We evaluate variants of the Trajectory Transformer and Decision Transformer architectures with different levels of logical regularization\. We consider vanilla TT and DT as the baseline model, without any logical regularization\. This corresponds to settingα=0\\alpha=0to Equation[10](https://arxiv.org/html/2606.08312#S4.E10)\. Then, we impose logic regularization, that is, the training objective combines the standard sequence\-prediction loss with the proposed logic loss computed by the use of differentiable DFAs\. In order to control the weight of logic regularization and study how it affects return and constraint satisfaction, we vary the trade\-off parameterα∈\{0\.01,0\.05,0\.1,0\.2,0\.4,0\.6,0\.8\}\\alpha\\in\\\{0\.01,0\.05,0\.1,0\.2,0\.4,0\.6,0\.8\\\}\. All models use greedy decoding at inference time to isolate the effect of the logic regularization during training\.

### Evaluation Metrics

We evaluate policies using several metrics capturing both task performance and logical compliance:

- •Return:the average episodic reward obtained by the agent\.
- •Satisfaction Rate:the fraction of trajectories that satisfy theLTLf\\textsc\{LTL\}\_\{f\}specification when evaluated using the compiled DFA\.
- •Goal Rate:the proportion of episodes in which the agent successfully reaches a goal state\.
- •Bomb Hit Rate:the fraction of trajectories that terminate by entering a bomb cell\.

### Experiments on Decision Transformers

Here we report results obtained by adding the logic loss to Decision Transformers\. The goal of these experiments is to assess the effectiveness of logic regularization on both safety and task performance\.

#### Invariant Safety\.

We report Decision Transformer \(DT\) alpha\-sweep results with greedy decoding\. For the invariant safety specificationG\(¬bomb\)G\(\\neg bomb\), low\-to\-mid regularization strengths \(α≤0\.4\\alpha\\leq 0\.4\) do not improve behavior over vanilla DT: the agent still hits bombs and never reaches the goal\. At higher regularization, bothα=0\.6\\alpha=0\.6andα=0\.8\\alpha=0\.8achieve full safety satisfaction and zero bomb\-hit rate\. However,α=0\.6\\alpha=0\.6is overly conservative \(goal rate=0=0\), whileα=0\.8\\alpha=0\.8preserves safety and also reaches the goal, yielding the strongest return\.

![Refer to caption](https://arxiv.org/html/2606.08312v1/figures/colourbomb/cb_dt_alpha_sweep_avoid_bombs.png)Figure 3:DT results for the invariant safety constraint as a function ofα\\alpha\. Plots report the mean and standard deviation over three runs\.
#### Reach\-while\-Safe\.

For the conjunctive specificationG\(¬bomb\)∧F\(goal\)G\(\\neg bomb\)\\land F\(goal\), the setting is harder because safety and goal achievement must hold together\. As in the invariant case,α≤0\.4\\alpha\\leq 0\.4does not change the baseline failure mode\. Atα=0\.6\\alpha=0\.6, DT removes bomb hits but still fails to satisfy the full formula due to zero goal\-reaching\. Onlyα=0\.8\\alpha=0\.8jointly achieves safety and goal attainment, giving the best satisfaction and return\. These results are consistent with the findings in the invariant security case\.

![Refer to caption](https://arxiv.org/html/2606.08312v1/figures/colourbomb/cb_dt_alpha_sweep_reach_goal_while_safe.png)Figure 4:DT results for the Reach\-while\-Safe constraint as a function ofα\\alpha\. Plots report the mean and standard deviation over three runs\.
#### Quantitative Summary\.

Table[1](https://arxiv.org/html/2606.08312#S5.T1)highlights a sharp transition in DT behavior under logic regularization\. While vanilla DT completely fails—yielding zero satisfaction and always hitting bomb states—high logical regularization \(α=0\.8\\alpha=0\.8\) improves significantly the behavior of the model\. In this setting, the policy simultaneously achieves perfect satisfaction, zero bomb\-hit rate, and consistent goal reaching across both specifications, leading to a substantial improvement in return\.

SpecSettingSatisfactionGoalBomb HitReturnsafetyvanilla \(α=0\\alpha=0\)0\.0000\.0001\.000\-1\.460safetylogic \(α=0\.8\\alpha=0\.8\)1\.0001\.0000\.000\-0\.780reach\-while\-safevanilla \(α=0\\alpha=0\)0\.0000\.0001\.000\-1\.460reach\-while\-safelogic \(α=0\.8\\alpha=0\.8\)1\.0001\.0000\.000\-0\.780Table 1:Decision Transformer key results\.

### Experiments on Trajectory Transformers

We now present preliminary results obtained by our implementation based on Trajectory Transformers\.

#### Invariant Safety\.

We first analyze the invariant safety specificationG\(¬bomb\)G\(\\neg bomb\)\. Figure[5](https://arxiv.org/html/2606.08312#S5.F5)shows that small values ofα\\alphado not improve safety compared to the baseline, while larger values of the logic regularization provide measurable benefits\. Settingα=0\.8\\alpha=0\.8achieves the highest satisfaction rate\. In general, the logic\-regularized policy reduces the probability of entering bomb cells and improves overall compliance with the specification\. At the same time, it does not affect task performance, for which our method maintains comparable results\. Notably, settingα=0\.8\\alpha=0\.8improves also the return with respect to the baseline\.

![Refer to caption](https://arxiv.org/html/2606.08312v1/figures/colourbomb/cb_alpha_sweep_avoid_bombs.png)Figure 5:Results under the safety constraint as a function of the weighting loss parameterα\\alpha\. Plots report the mean and standard deviation over three runs\.
#### Reach\-while\-Safe\.

We next include the goal state in theLTLf\\textsc\{LTL\}\_\{f\}constraint by the specificationG\(¬bomb\)∧F\(goal\)G\(\\neg bomb\)\\land F\(goal\), which requires the agent to reach a goal while remaining safe throughout the trajectory\. Figure[6](https://arxiv.org/html/2606.08312#S5.F6)summarizes the results for this specification\. As expected, satisfying the constraint becomes significantly harder\. As a result, the satisfaction rate is substantially lower than in the invariant safety case\. Nevertheless, we can still observe that our approach outperforms the baseline for all values ofα≥0\.1\\alpha\\geq 0\.1, withα=0\.1\\alpha=0\.1providing the best results\. This shows that logic regularization still leads to improvements for more complex specifications\. It is worth highlighting that, also in this case, the return associated with valuesα≥0\.1\\alpha\\geq 0\.1is slightly better than what we observe for the baseline\.

![Refer to caption](https://arxiv.org/html/2606.08312v1/figures/colourbomb/cb_alpha_sweep_reach_goal_while_safe.png)Figure 6:Results under the Reach\-while\-Safe constraint as a function of the weighting loss parameterα\\alpha\. Plots report the mean and standard deviation over three runs\.
#### Cross\-Specification Comparison\.

Table[2](https://arxiv.org/html/2606.08312#S5.T2)compares the best configurations for each specification\. The invariant safety objective can be improved primarily by reducing bomb\-hit events, which directly increases the probability of satisfying the safety constraint\. In contrast, the Reach\-while\-Safe objective remains limited by the difficulty of achieving the goal while maintaining safety simultaneously\. In many trajectories, the agent avoids bombs successfully but fails to reach the goal within the episode horizon\. This result highlights the intrinsic trade\-off between safety and task completion in environments where safe paths towards the goal may be longer or harder to discover\.

#### Quantitative Summary\.

Table[2](https://arxiv.org/html/2606.08312#S5.T2)reports representative configurations illustrating the trade\-offs observed in the experiments\. For the invariant safety task, the logic\-regularized model withα=0\.8\\alpha=0\.8increases both the satisfaction rate, the goal rate, and the return\. For the Reach\-while\-Safe specification, the best configuration isα=0\.1\\alpha=0\.1, where our approach again improves both the satisfaction rate and the return\. It is worth noting that including the reachability of the goal state in the trace constraint helps us in improving the goal\-reached rate itself, while slightly degrading performances with respect to the bomb\-hit rate\. These results confirm that incorporating logical knowledge into the training objective can improve safety\-related metrics while preserving competitive performance in both DT and TT\. In both cases, tuning theα\\alphaparameter remains crucial for achieving high performance\. However, compared to DT, the improvements observed in the TT experiments are surprisingly more moderate\. We leave to future work a deeper investigation into the reasons behind this difference\.

SpecSettingSatisfactionGoalBomb HitReturnsafetyvanilla \(α=0\\alpha=0\)0\.3530\.0180\.647\-0\.929safetylogic \(α=0\.8\\alpha=0\.8\)0\.4990\.0270\.470\-0\.792reach\-while\-safevanilla \(α=0\\alpha=0\)0\.0180\.0180\.647\-0\.929reach\-while\-safelogic \(α=0\.1\\alpha=0\.1\)0\.0470\.0470\.650\-0\.903Table 2:Trajectory Transformer key results\.

## 6Conclusion

We presented a neurosymbolic framework for injectingLTLf\\textsc\{LTL\}\_\{f\}constraints into transformer\-based policies for offline RL\. Our approach aims at constructing an automaton\-aware decision mechanisms that guide trajectory generation, and it relies on two features: a differentiable representation of DFAs and a differentiable logic loss that regularizes the training objective\. The proposed framework is architecture\-agnostic and can be applied to different models like Trajectory Transformers and Decision Transformers\. Unlike related popular approaches, we don’t need to engineer the underlying reward function, nor we need to focus on a small subset of temporal specifications\. Preliminary experiments on the ColourBomb navigation benchmark show that logic regularization can improve constraint satisfaction while maintaining competitive task performance\. These findings suggest that integrating symbolic background knowledge into offline sequence\-based RL is a promising direction for improving policy reliability in safety\-critical domains\. Future work will extend the empirical evaluation to additional benchmarks, model architectures, temporal specifications, and decoding mechanisms\. It would be significant also to perform an extensive end\-to\-end comparison with well\-established frameworks for Safe RL\.

###### Acknowledgements\.

This work has been supported by the the PNRR MUR project FAIR \(No\. PE0000013\), and the Italian National Ph\.D\. on Artificial Intelligence at Sapienza University of Rome\.

## Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT and Claude in order to: Improve writing style\. After using these services, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content\.

## References
Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

Similar Articles

NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic

LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

Agentic Transformers Provably Learn to Search via Reinforcement Learning

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

Submit Feedback

Similar Articles

NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic
LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks
Agentic Transformers Provably Learn to Search via Reinforcement Learning
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents