Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

arXiv cs.LG 05/14/26, 04:00 AM Papers
temporal-logic runtime-monitoring autonomous-systems perception embedding formal-verification robotics
Summary
This paper proposes Embedding Temporal Logic (ETL), a temporal logic that monitors perception-based autonomous systems directly in learned embedding spaces, enabling specification of high-level perceptual concepts and achieving strong empirical agreement with ground-truth semantics.
arXiv:2605.12651v1 Announce Type: new Abstract: Runtime monitoring of autonomous systems traditionally relies on mapping continuous sensor observations to discrete logical propositions defined over low-dimensional state variables. This abstraction breaks down in perception-driven settings, where such mappings require additional learned modules that are often computationally expensive, brittle, and semantically misaligned. In this work, we propose Embedding Temporal Logic (ETL), a temporal logic that performs monitoring directly in learned embedding spaces. ETL defines predicates through distances between observed embeddings and target embeddings derived from reference observations. This formulation allows specifications to capture high-level perceptual concepts, such as similarity to visual goals or avoidance of semantic regions, that are difficult or impossible to express using traditional predicates. By composing these predicates with temporal operators, ETL naturally expresses temporally extended and sequential perceptual behaviors. We introduce ETL monitors for evaluating specifications over bounded embedding traces, along with a conformal calibration procedure that provides reliable and safety-oriented predicate evaluation. We evaluate our approach across multiple manipulation environments to show that ETL achieves strong empirical agreement with ground-truth semantics, including accurate monitoring of temporally composed behaviors.
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:16 AM
# Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
Source: [https://arxiv.org/html/2605.12651](https://arxiv.org/html/2605.12651)
Parv Kapoor Software and Societal Systems Department Carnegie Mellon University parvk@andrew\.cmu\.edu &Abigail Hammer∗ Software and Societal Systems Department Carnegie Mellon University arhammer@andrew\.cmu\.edu Ashish Kapoor Scaled Foundations ashish@generalrobotics\.company &Karen Leung Aeronautics and Astronautics Department University of Washington kymleung@uw\.edu &Eunsuk Kang Software and Societal Systems Department Carnegie Mellon University eunsukk@andrew\.cmu\.edu

###### Abstract

Runtime monitoring of autonomous systems traditionally relies on mapping continuous sensor observations to discrete logical propositions defined over low\-dimensional state variables\. This abstraction breaks down in perception\-driven settings, where such mappings require additional learned modules that are often computationally expensive, brittle, and semantically misaligned\. In this work, we propose*Embedding Temporal Logic*\(ETL\), a temporal logic that performs monitoring directly in learned embedding spaces\. ETL defines predicates through distances between observed embeddings and target embeddings derived from reference observations\. This formulation allows specifications to capture high\-level perceptual concepts, such as similarity to visual goals or avoidance of semantic regions, that are difficult or impossible to express using traditional predicates\. By composing these predicates with temporal operators, ETL naturally expresses temporally extended and sequential perceptual behaviors\. We introduce ETL monitors for evaluating specifications over bounded embedding traces, along with a conformal calibration procedure that provides reliable and safety\-oriented predicate evaluation\. We evaluate our approach across multiple manipulation environments to show that ETL achieves strong empirical agreement with ground\-truth semantics, including accurate monitoring of temporally composed behaviors\.

## 1Introduction

Modern autonomous systems, from self\-driving vehicles to robotic manipulators, increasingly rely on learned representations for perception, prediction, and decision making\(hafner2020dreamer;tdmpc2; Zhouet al\.,[2025](https://arxiv.org/html/2605.12651#bib.bib42);kim24openvla;intelligence2025pi06vlalearnsexperience;ye2026worldactionmodelszeroshot\)\. These representations allow autonomous systems to overcome the challenges of operating on explicit state space representations \(such as object poses and velocities\), which often require state estimation pipelines and auxiliary localization modules\. We refer to systems that operate on learned representations as*perception\-based systems*\. Perception\-based systems map high\-dimensional sensor streams such as images, video, or lidar to compact latent representations, which are then consumed by downstream policies, planners, or world models\(nvidia2026worldsimulationvideofoundation;baniodeh2025scalinglawsmotionforecasting\)\.

A promising approach to achieving high\-assurance autonomy involves formally specifying desired properties of a system and applying techniques such as formal verification\(verification\-survey2019\)and runtime monitoring\(Maler and Nickovic,[2004](https://arxiv.org/html/2605.12651#bib.bib10)\)to check whether the system satisfies these properties\(Seshiaet al\.,[2018](https://arxiv.org/html/2605.12651#bib.bib3)\)\. In particular,*runtime monitoring*has attracted considerable interest as it can be deployed*online*to provide rigorous guarantees about the system behavior without incurring the cost of an exhaustive offline analysis\. A runtime monitor periodically evaluates the execution of a system and raises an alert when the system exhibits undesirable behavior\(Colombo and Pace,[2022](https://arxiv.org/html/2605.12651#bib.bib86)\)\. This ability to provide lightweight but rigorous online assurance has led to successful applications in domains such as autonomous vehicles\(schon2026spatiotemporal\), drones\(gu2023successful\), and robotic manipulators\(8836933\)\.

Runtime monitoring relies on the availability of*formal specifications*that capture the desired properties of a system\. For systems with low\-dimensional state representations, specification notations such assignal temporal logic \(STL\)\(Maler and Nickovic,[2004](https://arxiv.org/html/2605.12651#bib.bib10)\)provide an expressive formalism to specify behavioral properties over logical*predicates*\. Each predicate encodes a condition over a state variable that can be evaluated as true or false at each step of execution, e\.g\., whether its position, velocity, or force is below or above a given threshold\.

However, for perception\-based systems, writing such formal specifications over learned representations remains an open challenge\(Seshiaet al\.,[2018](https://arxiv.org/html/2605.12651#bib.bib3)\)\. For these systems, low\-dimensional state representations are often unavailable or require specialized, ad\-hoc perception modules\. For example, translating a property such as “the robot is near the obstacle” or “the gripper is holding the object” into predicates requires either an additional classifier or detector, or a handcrafted feature extractor tailored to the task \(e\.g\., to detect whether the concept of the gripper “holding” an object is present in the current scene\)\(Hekmatnejadet al\.,[2024](https://arxiv.org/html/2605.12651#bib.bib89)\)\. Adding these modules can introduce new sources of brittleness, calibration errors, and domain dependence into the system\. Worse, whenever the vocabulary of concepts for specification evolves \(e\.g\., to also be able to express properties about the gripper “dropping” an object\), it may be necessary to augment the existing perception modules or add new ones to support the new concepts\. Overall, there is a*fundamental mismatch*between \(i\) the latent space over which typical perception systems operate and \(ii\) the low\-dimensional state space over which specifications in existing temporal logic notations are expressed\.

In this paper, we propose a new approach for formally specifying and monitoring the behavioral properties of perception\-based autonomous systems\. The key idea is to employ*embeddings*, pretrained vector representations of observations, as a first\-class concept in specifications, and express a property in terms of distances between a*target embedding*\(an ideal representation of real\-world concepts that the system interacts with\) and an*observed embedding*\(a representation generated by an encoder from a sensor observation during system execution\)\. The key insight is that*pretrained encoders already embed semantic proximity in geometry*\(Radfordet al\.,[2021](https://arxiv.org/html/2605.12651#bib.bib11); Oquabet al\.,[2024](https://arxiv.org/html/2605.12651#bib.bib95)\): observations of semantically similar scenes map to nearby vectors in latent space\. This makes perceptual properties directly expressible as geometric predicates; for instance, “being near the obstacle” can be represented as “‖zt−zobstacle‖2\\\|z\_\{t\}\-z\_\{\\mathrm\{obstacle\}\}\\\|\_\{2\}is small,” wherezobstaclez\_\{\\mathrm\{obstacle\}\}is an encoder representation of a reference image of that obstacle andztz\_\{\\mathrm\{t\}\}is an encoding of the current scene image\. An expressive temporal logic specification can then be constructed by combining multiple embedding\-based predicates and used as part of a runtime monitor to ensure that the system satisfies its desired property \(e\.g\., “If the gripper is holding an object, it will not drop the object until it is moved to a deposit box”\)\.

Although this idea is conceptually simple, making it useful for formal specification poses multiple challenges\. First, there is the question of how target embeddings are generated: they can come from a reference image, a demonstration, or a set of both, and different choices can induce meaningfully different predicates\. Additionally, embeddings are learned, continuous, and model\-dependent representations, and geometric proximity is not guaranteed to align perfectly with the logical distinctions required for monitoring\. A central challenge, therefore, is to turn embedding\-space similarity into a well\-defined specification primitive: one must decide which geometric relationships correspond to predicate satisfaction, how to calibrate decision thresholds, and how these predicates are composed to create system specifications\. These issues make embedding\-based specifications substantially more challenging than simply reusing learned features inside an existing monitor\.

In this paper, we make the following four contributions: \(i\) we introduce*Embedding Temporal Logic*\(ETL\), a temporal logic for specifying perceptual behaviors directly over observations \(Section[3\.1](https://arxiv.org/html/2605.12651#S3.SS1)\); \(ii\) we formally define Boolean satisfaction semantics over bounded embedding traces, yielding an online monitor for perceptual specifications \(Sections[3\.1](https://arxiv.org/html/2605.12651#S3.SS1)and[3\.2](https://arxiv.org/html/2605.12651#S3.SS2)\); \(iii\) we propose data\-driven methods for calibrating embedding predicate thresholds, making them feasible for safety\-oriented monitoring \(Section[4](https://arxiv.org/html/2605.12651#S4)\); and \(iv\) we evaluate ETL\-based monitors across navigation and manipulation domains, showing that they can faithfully monitor atomic and sequential perceptual behaviors across diverse environments \(Section[5](https://arxiv.org/html/2605.12651#S5)\)\.

## 2Background and Related Work

#### Formal Specifications for Robotic Systems

Temporal logics, such as Linear Temporal Logic \(LTL\), STL, and Metric Temporal Logic \(MTL\) have been used to formally verify complex behaviors in cyber\-physical and robotic systems\. These logics have been used for trajectory planning\(Kress\-Gazitet al\.,[2009](https://arxiv.org/html/2605.12651#bib.bib79); Sunet al\.,[2022](https://arxiv.org/html/2605.12651#bib.bib63); Leunget al\.,[2023](https://arxiv.org/html/2605.12651#bib.bib52)\), reinforcement learning\(Aksarayet al\.,[2016](https://arxiv.org/html/2605.12651#bib.bib66); Aluret al\.,[2023](https://arxiv.org/html/2605.12651#bib.bib82); Alooret al\.,[2023](https://arxiv.org/html/2605.12651#bib.bib73)\), runtime monitoring\(Bartocciet al\.,[2018](https://arxiv.org/html/2605.12651#bib.bib49)\), and adaptive control\(Ramanet al\.,[2014](https://arxiv.org/html/2605.12651#bib.bib37); Belta and Sadraddini,[2019](https://arxiv.org/html/2605.12651#bib.bib57); Lindemann and Dimarogonas,[2019](https://arxiv.org/html/2605.12651#bib.bib68); Kapooret al\.,[2025](https://arxiv.org/html/2605.12651#bib.bib96)\)\. These logics can struggle with systems that rely on ML for perception, where input data can have a variable number of objects in frame and evolving bounding boxes\. Recently, Spatiotemporal Perception Logic \(STPL\)\(Hekmatnejadet al\.,[2024](https://arxiv.org/html/2605.12651#bib.bib89)\)was introduced, which combined Timed Quality Temporal Logic\(Dokhanchiet al\.,[2018](https://arxiv.org/html/2605.12651#bib.bib88)\)with spatial logic and allows quantification over objects, as well as 2D and 3D spatial reasoning\.

#### Pretrained Vision Encoders

Recent advances in representation learning have produced pretrained vision encoders that are sufficiently expressive to serve as general\-purpose perceptual representations across a wide range of visual domains\(Oquabet al\.,[2024](https://arxiv.org/html/2605.12651#bib.bib95)\)\. As a result, these models provide a practical foundation for extending temporal logic beyond low\-dimensional state space representations\. Pretrained vision encoders such as CLIP\(Radfordet al\.,[2021](https://arxiv.org/html/2605.12651#bib.bib11)\)and DINOv2\(Oquabet al\.,[2024](https://arxiv.org/html/2605.12651#bib.bib95)\)provide a shared embedding space in which perceptual similarity can be measured, making them a natural choice for defining specification predicates over observations\.

#### Specification Based Runtime Monitoring

Given a logical specification that is well\-defined over bounded traces and encodes a desired system property, a runtime monitor is an online evaluation at each time step during an execution to assess whether the execution satisfies the given specification\(Maler and Nickovic,[2004](https://arxiv.org/html/2605.12651#bib.bib10); Bartocciet al\.,[2018](https://arxiv.org/html/2605.12651#bib.bib49)\)\. In practice, runtime monitoring often goes beyond Boolean verdicts and employs quantitative semantics, such as*robustness*measures in temporal logic, which provide a real\-valued signal indicating how much a trace satisfies or violates the specification\(Fainekos and Pappas,[2009](https://arxiv.org/html/2605.12651#bib.bib9)\)\. These quantitative monitors are particularly useful in continuous and stochastic systems, as they enable graded feedback and can be integrated with optimization or control algorithms for real\-time decision making\.

#### Conformal Prediction

Conformal prediction\(10\.5555/1062391\)is a distribution\-free calibration framework that turns a held\-out calibration set into a finite\-sample statistical guarantee under only anexchangeabilityassumption\. The exchangeability of data means that the calibration and test samples are identically distributed and order\-independent\. At a high level, it is a method for calibrating a model so that its predictions on unseen examples come with a reliability guarantee\. We employ conformal prediction theory for calibrating thresholds for ETL predicates in Section[4](https://arxiv.org/html/2605.12651#S4)\.

## 3Embedding Temporal Logic

![Refer to caption](https://arxiv.org/html/2605.12651v1/x1.png)Figure 1:Overview of embedding\-based runtime monitoring\.Top:Target observationsOG=\{og1,…,ogn\}O\_\{G\}=\\\{o\_\{g1\},\\dots,o\_\{gn\}\\\}and the online observationoto\_\{t\}are encoded by a pretrained vision encoder into target embeddingszGz\_\{G\}and the current embeddingztz\_\{t\}\. An embedding predicate then evaluates whetherztz\_\{t\}reaches the target set\.Bottom left:The embedding trajectory is projected onto its first two principal components, showing its evolution relative to the target embeddings\.Bottom right:Runtime monitoring computes the distance fromztz\_\{t\}tozGz\_\{G\}over time, thresholds it usingϵ\\epsilon, and evaluates the temporal logic specification\.### 3\.1Syntax and Semantics

In our approach, a perception\-based system is assumed to make an observation about the real world through a sensor \(e\.g\., camera\) at each step in its execution\. This observation is then passed through an encoder that translates the observation into an embedding\. We formally model such a system as an*Embedding Temporal Structure*\.

###### Definition 1\(Embedding Temporal Structure\)\.

An*Embedding Temporal Structure*is a tuple

ℳ≡\(𝒮,𝒪,𝒵,ϕobs,ψenc,D𝒵,APz\),\\mathcal\{M\}\\equiv\(\\mathcal\{S\},\\mathcal\{O\},\\mathcal\{Z\}\{\},\\phi\_\{obs\},\\psi\_\{enc\},D\_\{\\mathcal\{Z\}\{\}\},AP\_\{z\}\),where𝒮,𝒪,𝒵\\mathcal\{S\},\\mathcal\{O\},\\mathcal\{Z\}\{\}denote the sets of ground\-truth states, observations, and embedding spaces, respectively;ϕobs:𝒮→𝒪\\phi\_\{obs\}:\\mathcal\{S\}\\to\\mathcal\{O\}is the observation function that maps the ground\-truth states to the states that can be observed with the given sensor\(s\);ψenc:𝒪→𝒵\\psi\_\{enc\}:\\mathcal\{O\}\\to\\mathcal\{Z\}\{\}is the embedding function that converts an observation into an embedding;D𝒵D\_\{\\mathcal\{Z\}\{\}\}is a set of admissible distance/similarity functionsd:𝒵×𝒵→ℝ≥0d:\\mathcal\{Z\}\{\}\\times\\mathcal\{Z\}\{\}\\to\\mathbb\{R\}\_\{\\geq 0\}; andAPzAP\_\{z\}is the set of embedding predicates \(Definition[5](https://arxiv.org/html/2605.12651#Thmdefinition5)\)\.

Conceptually,zzis an approximation of a latent variable; i\.e\., the state of the world that is only indirectly observable by the system\. The*organization*of latent representations of observations within the embedding space𝒵\\mathcal\{Z\}\{\}is induced by the encoderψenc\\psi\_\{enc\}and the downstream models that subsequently consume this embedding for further tasks \(e\.g\., object identification and prediction\) and shape the embedding organization as part of their training\. This organization of embeddings in turn has a pronounced effect on the distances between embeddings\.

To reason about a system’s temporal behavior, we must consider not only individual states but sequences of states over time\. We therefore define an execution of the system and the corresponding embedding\-space representation induced by the observation and encoder mappings\.

###### Definition 2\(Execution\)\.

An execution of the systemℳ\\mathcal\{M\}is a finite or infinite sequence of statesς=s0,s1,s2,…\\varsigma=s\_\{0\},s\_\{1\},s\_\{2\},\\ldotssuch that each statests\_\{t\}belongs to the state space𝒮\\mathcal\{S\}for every time indext∈ℕt\\in\\mathbb\{N\}\.

Each state in an execution can be mapped throughϕobs\\phi\_\{obs\}andψenc\\psi\_\{enc\}, yielding a corresponding representation in the embedding space\.

###### Definition 3\(Representation Map\)\.

The mapping from the ground\-truth states to embeddings is the*representation map*η≡ψenc∘ϕobs:S→𝒵\\eta\\equiv\\psi\_\{enc\}\\circ\\phi\_\{obs\}:S\\to\\mathcal\{Z\}\{\}\.

###### Definition 4\(Trace\)\.

Given an executionς=\(si\)i∈ℕ\\varsigma=\(s\_\{i\}\)\_\{i\\in\\mathbb\{N\}\}, the associated*trace*is the sequenceσ=\(zi\)i∈ℕ\\sigma=\(z\_\{i\}\)\_\{i\\in\\mathbb\{N\}\}, wherezi=η\(si\)z\_\{i\}=\\eta\(s\_\{i\}\)for alli∈ℕi\\in\\mathbb\{N\}\.

#### Embedding Predicates

Unlike classical temporal logics where atomic propositions are Boolean predicates overSS, ETL predicates are defined over the embedding space𝒵\\mathcal\{Z\}\{\}\.

###### Definition 5\(Embedding Predicate\)\.

An*embedding predicate*ap∈APzap\\in AP\_\{z\}is a tupleap≡\(𝒵target,d,ϵ,⋈,a\)ap\\equiv\(\\mathcal\{Z\}\_\{target\},d,\\epsilon,\\bowtie,a\)where:𝒵target⊆𝒵\\mathcal\{Z\}\_\{target\}\\subseteq\\mathcal\{Z\}\{\}is a set of target embeddings;d∈D𝒵d\\in D\_\{\\mathcal\{Z\}\{\}\}, whered:𝒵×𝒵→ℝ≥0d:\\mathcal\{Z\}\{\}\\times\\mathcal\{Z\}\{\}\\rightarrow\\mathbb\{R\}\_\{\\geq 0\}, is a distance function;ϵ∈ℝ≥0\\epsilon\\in\\mathbb\{R\}\_\{\\geq 0\}is a threshold; andaais an aggregation operator \(e\.g\.,min\\min,max\\max\)\.

The set of target embeddings identifies sensor inputs that the system should either reach or avoid in order for the system to either be considered safe or to have reached as a goal\. Additional details for specifying target embeddings are discussed in Appendix[A\.1](https://arxiv.org/html/2605.12651#A1.SS1)\.

We now define how an Embedding Predicate can be evaluated\.

###### Definition 6\(Predicate Satisfaction\)\.

For a given embedding predicateap∈APzap\\in AP\_\{z\}and embeddingzz, the evaluation of an embedding predicate is defined as:

δap\(z\)=a\(\{d\(z,zg\)∣zg∈𝒵target\}\)\\delta\_\{ap\}\(z\)=a\(\\\{d\(z,z\_\{g\}\)\\mid z\_\{g\}\\in\\mathcal\{Z\}\_\{target\}\\\}\)Then, the satisfaction of the predicate is the Boolean value determined byδap\(z\)⋈ϵ\\delta\_\{ap\}\(z\)\\bowtie\\epsilon\.

#### Example 1\.

Consider an embedding predicateapapas illustrated in Figure[1](https://arxiv.org/html/2605.12651#S3.F1), where the target images depict a desired visual concept \(e\.g\., object not grasped\) and are encoded into a target set𝒵target\\mathcal\{Z\}\_\{target\}\. Given the embeddingztz\_\{t\}of the current observation, predicate evaluation measures the distance betweenztz\_\{t\}and the target set\. Letddbe the L2 distance,a=mina=\\min, and⋈⁣=⁣≤\\bowtie=\\leq\. Then,

δap\(zt\)=minzg∈𝒵target⁡d\(zt,zg\)\.\\delta\_\{ap\}\(z\_\{t\}\)=\\min\_\{z\_\{g\}\\in\\mathcal\{Z\}\_\{target\}\}d\(z\_\{t\},z\_\{g\}\)\.Usinga=mina=\\min, we ensure thatδap\(zt\)\\delta\_\{ap\}\(z\_\{t\}\)is the distance from the current observation to the closest target in the embedding space\. Then, the predicate is satisfied ifδap\(zt\)≤ϵ\\delta\_\{ap\}\(z\_\{t\}\)\\leq\\epsilon, meaning that the current observation embedding lies sufficiently close to at least one target embedding, indicating that the desired visual concept is present\.

Now that we have defined embedding predicates, we next define the syntax and semantics of ETL to reason over sequences of embeddings over time\.

###### Definition 7\(ETL Syntax\)\.

The*ETL syntax*of formulaφ\\varphiis recursively defined as follows:

φ::=ap\|¬φ\|φ1∧φ2\|φ1Uφ2\\displaystyle\\varphi::=ap\\;\|\\;\\neg\\varphi\\;\|\\;\\varphi\_\{1\}\\land\\varphi\_\{2\}\\;\|\\;\\varphi\_\{1\}\\textbf\{U\}\{\}\\varphi\_\{2\}

As in LTL, the until operatorUcan be used to express the*eventually*\(F\) operator and*always*\(G\) operator:Fφ=TrueUφ\\textbf\{F\}\{\}\\varphi=\\texttt\{True\}\\textbf\{U\}\{\}\\varphiandGφ=¬F¬φ\\textbf\{G\}\{\}\\varphi=\\neg\\textbf\{F\}\{\}\\neg\\varphi\.

###### Definition 8\(ETL Semantics\)\.

For a given traceσ\\sigma, the*ETL semantics*for timestepiiare defined similarly to those of LTL, but over embedding traces:

σ,i⊧ap⇔δap\(zi\)⋈ϵ\\displaystyle\\sigma,i\\models ap\\iff\\delta\_\{ap\}\(z\_\{i\}\)\\bowtie\\epsilonσ,i⊧¬φ⇔σ,i⊧̸φ\\displaystyle\\sigma,i\\models\\neg\\varphi\\iff\\sigma,i\\not\\models\\varphiσ,i⊧φ1∧φ2⇔σ,i⊧φ1andσ,i⊧φ2\\displaystyle\\sigma,i\\models\\varphi\_\{1\}\\land\\varphi\_\{2\}\\iff\\sigma,i\\models\\varphi\_\{1\}\\text\{ and \}\\sigma,i\\models\\varphi\_\{2\}σ,i⊧φ1Uφ2⇔∃j≥isuch thatσ,j⊧φ2and∀k∈\[i,j\),σ,k⊧φ1\\displaystyle\\sigma,i\\models\\varphi\_\{1\}\\textbf\{U\}\{\}\\varphi\_\{2\}\\iff\\exists j\\geq i\\text\{ such that \}\\sigma,j\\models\\varphi\_\{2\}\\text\{ and \}\\forall k\\in\[i,j\),\\sigma,k\\models\\varphi\_\{1\}

#### Robustness of ETL

Certain types of temporal logics, such as STL\(Maler and Nickovic,[2004](https://arxiv.org/html/2605.12651#bib.bib10)\), allow for a quantitative notion of satisfaction, called*robustness*\(Fainekos and Pappas,[2009](https://arxiv.org/html/2605.12651#bib.bib9)\), that represents the degree to which the system satisfies or violates a specification\. To enable ETL to be used for quantitative monitoring, we also introduce a notion of robustness for ETL\. For the sake of brevity, the formal semantics for robustness are provided in Appendix[B](https://arxiv.org/html/2605.12651#A2)\.

#### Example 2

Reusing the embedding predicateapapas defined in Example 1, consider the ETL specificationφ=G\(ap\)\\varphi=\\textbf\{G\}\(ap\), which requires that the visual concept encoded by predicateapapis always observed\. Suppose the system generates the embedding traceσ=z12,z13,z14,z15,\\sigma=z\_\{12\},z\_\{13\},z\_\{14\},z\_\{15\},with minimum distances to𝒵target\\mathcal\{Z\}\_\{target\}as\[δap\(zt\)\]t=1215=\[0\.327,0\.374,0\.403,0\.427\]\.\[\\delta\_\{ap\}\(z\_\{t\}\)\]\_\{t=12\}^\{15\}=\[0\.327,\\;0\.374,\\;0\.403,\\;0\.427\]\.

Using the robustness semantics for predicates,ρ\(ap,σ,t,3\)=ϵ−δap\(zt\),\\rho\(ap,\\sigma,t,3\)=\\epsilon\-\\delta\_\{ap\}\(z\_\{t\}\),and settingϵ=ϵg=0\.409\\epsilon=\\epsilon\_\{g\}=0\.409, the predicate robustness at each timestep is:\[0\.082,0\.035,0\.006,−0\.019\]\.\[0\.082,\\;0\.035,\\;0\.006,\\;\-0\.019\]\.The robustness of the temporal specification is thenρ\(φ,σ,0,3\)=mint∈\[0,3\]⁡ρ\(ap,σ,t,3\)=−0\.019\.\\rho\(\\varphi,\\sigma,0,3\)=\\min\_\{t\\in\[0,3\]\}\\rho\(ap,\\sigma,t,3\)=\-0\.019\.Intuitively, although the system remains close to the object not being grasped for most of the window, the specification is violated because at timestep1515the embedding moves outside the thresholdϵ\\epsilon, corresponding to the frame t=15 in Figure[1](https://arxiv.org/html/2605.12651#S3.F1)where the object is grasped\.

### 3\.2ETL Specification Monitors

An ETL monitor evaluates the satisfaction of a specification over an embedding trace for a given system\. ETL monitors are defined only over the finite trace of a system in order to be evaluated during the execution of the system, and thus are only defined for specifications that are well\-defined over finite traces\.

###### Definition 9\(Finite Trace\)\.

For a given traceσ\\sigma, a*finite trace*σ≤t\\sigma\_\{\\leq t\}is a trace consisting of the firstt\+1t\+1states ofσ\\sigma\.

We refer to a finite trace as a trace when clear from context\. An ETL monitor is formally defined over a finite trace of embeddings by the following definition\.

###### Definition 10\(ETL Monitor\)\.

For a traceσ≤t\\sigma\_\{\\leq t\}, systemℳ\\mathcal\{M\}, and specificationφ\\varphi, an*ETL Monitor*isMφ\(σ≤t\)=\(r0φ,…,rtφ\)∈\{−1,\+1\}t\+1M\_\{\\varphi\}\(\\sigma\_\{\\leq t\}\)=\(r^\{\\varphi\}\_\{0\},\\ldots,r^\{\\varphi\}\_\{t\}\)\\in\\\{\-1,\+1\\\}^\{t\+1\}, whereriφ=sgn⁡\(ρ\(φ,σ≤i,0,i\)\)r^\{\\varphi\}\_\{i\}=\\operatorname\{sgn\}\(\\rho\(\\varphi,\\sigma\_\{\\leq i\},0,i\)\)for each0≤i≤t0\\leq i\\leq t, withsgn⁡\(x\)=\+1\\operatorname\{sgn\}\(x\)=\+1ifx≥0x\\geq 0and−1\-1otherwise\.

Intuitively, when the monitor is positive for a timestep, the system satisfies the ETL specification at that timestep\. However, when the robustness of the system becomes negative at timestepii, the monitor raises an alert to indicate that the observed execution violatesφ\\varphiover the prefix\[0,i\]\[0,i\]\.

#### Semantic Correctness

We denote an ETL monitor to be*semantically correct*iff it is equivalent with evaluation over ground\-truth executions\. A formal definition of semantic correctness is provided in Appendix[C](https://arxiv.org/html/2605.12651#A3)\. This notion of semantic correctness provides the basis for evaluating ETL monitors in practice\. In our experiments \(Section[5](https://arxiv.org/html/2605.12651#S5)\), we approximate semantic correctness by comparing the outputs of the ETL monitor against ground\-truth monitors derived from state\-based specifications\.

### 3\.3Constructing ETL Specifications in Practice

Utilizing the formal semantics presented in Section[3\.1](https://arxiv.org/html/2605.12651#S3.SS1)requires several design choices for concrete instantiation: one must determine how target embeddings are specified, how observations are mapped into the representation space, which distance function is used to compare embeddings, and how predicate thresholds are calibrated in order to align with the intended semantic concept\. In practice, target embeddings should be selected by the engineer in the same format as the sensor input; e\.g\., for a camera, a target would be provided as a reference image\. The distance function selection depends on the training objective of the encoder and ideally should align with the organization of the embedded space\. Additional discussion on these design decisions can be found in Appendix[A](https://arxiv.org/html/2605.12651#A1)\.

## 4Threshold Calibration for ETL Predicates

The preceding section described how ETL predicates are instantiated in practice through target embeddings, encoders, and distance functions\. We now turn to threshold calibration\. Recall from Definition[5](https://arxiv.org/html/2605.12651#Thmdefinition5)that ETL predicates are evaluated by comparing embedding distances against a thresholdϵ\\epsilon\. These thresholds play a critical role: they determine when a continuous similarity measure corresponds to predicate satisfaction\. As a result, ETL specifications are not only parameterized by the encoder and distance function, but also by calibrated thresholds that define when a predicate holds\. We propose two approaches to calibrate thresholds\. For formal definitions of the thresholds, we refer readers to Appendix[D](https://arxiv.org/html/2605.12651#A4)\.

### 4\.1F1\-Optimal Threshold

We calibrate each predicate threshold on a held\-out set,𝒟cal\\mathcal\{D\}\_\{cal\}, of trajectories with ground\-truth labels per timestep\. For each timesteptt, we compute the embedding distancedtd\_\{t\}to the target embedding and, for a candidate thresholdϵ1\\epsilon\_\{1\}, predict that the predicate holds wheneverdt≤ϵ1d\_\{t\}\\leq\\epsilon\_\{1\}\. We then search over possible threshold values based on the observed calibration distances and select the threshold,ϵF1\\epsilon\_\{F1\}, that maximizes the F1 score with respect to the ground\-truth labels\. Intuitively, this chooses the threshold that best matches the intended concept by balancing false positives and false negatives on the calibration data\.

### 4\.2Conformal Threshold with Recall Guarantee

In safety\-critical monitoring, missing a true positive event is often more costly than raising a false alarm\. To bias threshold selection toward high recall, we compute an alternative threshold using*split conformal prediction*,ϵCP\\epsilon\_\{\\mathrm\{CP\}\}\(lei2017distributionfreepredictiveinferenceregression;10\.1561/2200000101\)\. We employ conformal prediction as it allows us to make no assumptions about the underlying distribution of embedding distances and provides a guarantee that transfers directly to deployment\-time predicate evaluation\.

In this work, we propose conformal*ETL predicate calibration*by adapting conformal prediction to select predicate thresholds from embedding distance scores\. Letς1,…,ςncal\\varsigma^\{1\},\\dots,\\varsigma^\{n\_\{cal\}\}be calibration demonstrations, disjoint from training, withzgz\_\{g\}denoting the target latent embedding with a semantically corresponding ground\-truth specificationω\\omega\. For each calibration,ςi\\varsigma^\{i\}, a score is computed based on the maximum distance from the frame to the target such that the target satisfiesω\\omega\. Then, for a user\-given error levelα∈\(0,1\)\\alpha\\in\(0,1\), we sort the calibration scores, computek=⌈\(1−α\)\(ncal\+1\)⌉k=\\lceil\(1\-\\alpha\)\(n\_\{cal\}\+1\)\\rceiland selectϵCP\\epsilon\_\{CP\}to be the k\-th smallest score\. The formal definition and proof for conformal recall guarantees based on calibration demonstrations can be found in Appendix[D](https://arxiv.org/html/2605.12651#A4)\.

## 5Evaluation

We evaluate the semantic correctness of ETL monitors by comparing embedding predicate outputs against ground\-truth propositions\.111Artifacts for reproducibility are provided at[https://github\.com/ETLMonitoringAuthors/ETLMonitoring](https://github.com/ETLMonitoringAuthors/ETLMonitoring)All experiments were run on a compute cluster using two NVIDIA GeForce RTX 5090 GPUs, each with 34\.2 GB of memory\.

### 5\.1Simple Navigation Dubins Car

First, we evaluate ETL specification based runtime monitoring in a two dimensional navigation task where privileged information about the state space is available, similar to the one proposed inanysafe\. This fully observable setup allows direct comparison between embedding\-based predicates and ground\-truth specifications\.

#### Setup

The navigation task is defined for a robot that respects discrete\-time Dubins car dynamics in a controlled environment\. We generateN=100N=100trajectories using a feedback controller with obstacle avoidance\. For each task, we construct pairs of equivalent specifications in the ground\-truth state space,ωi\\omega\_\{i\}, and in the embedding space,φi\\varphi\_\{i\}\. We utilize the encoder from the world model used inanysafe, which is based on Dreamer\(Hafner2025\)\. The encoder produces a task\-relevant representation space in which distances reflect semantic similarity between observations\. We evaluate four specification patterns:*Reach*\(FA\\textbf\{F\}A\),*Avoid*\(G¬C\\textbf\{G\}\\neg C\),*Reach\-Avoid*\(FA∧G¬C\\textbf\{F\}A\\land\\textbf\{G\}\\neg C\), and*Sequential*\(F\(A∧FB\)\\textbf\{F\}\(A\\land\\textbf\{F\}B\)\)\.AA,BB, andCCdenote reaching a goal in the top\-right, the top\-left, and the bottom\-right corners of the environment respectively\. For visualization, see the goals in Figure[2\(a\)](https://arxiv.org/html/2605.12651#S5.F2.sf1)\. ETL monitoring is evaluated by comparing the satisfaction of each embedding\-based specificationφ\\varphiwith the satisfaction of corresponding ground\-truth specificationsω\\omegaon the respective latent and state traces for each timestep of each trace\. We report precision, recall, and F1 to assess how closely ETL predicates match ground\-truth events\. For temporal specifications, we measure trajectory\-level satisfaction agreement with the ground\-truth monitor; for sequential specifications, we also report ordering accuracy of detected subgoals\. Additional details can be found in Appendix[E\.1](https://arxiv.org/html/2605.12651#A5.SS1)\.

#### Results

We now discuss our findings; a tabularized version of results can be found in Appendix[E\.2](https://arxiv.org/html/2605.12651#A5.SS2)\.

##### Can ETL monitors achieve semantic correctness as defined in Definition 15 with respect to ground\-truth specifications in a controlled environment?

ETL shows strong semantic correctness for atomic predicates, visualized in Figure[2\(a\)](https://arxiv.org/html/2605.12651#S5.F2.sf1)\. Across reach, avoid, and reach\-avoid specifications, the F1\-optimal monitor achieves F1 scores of0\.800\.80–0\.850\.85with agreement above96%96\\%, indicating that embedding predicates closely match the corresponding state\-based events\. This implies that embedding\-space predicates can serve as faithful proxies for state\-based propositions\. Additionally, for the sequential specificationA→BA\\rightarrow B, the monitor achieves100%100\\%precision, recall, and ordering agreement at the episode level\. Thus, ETL is not limited to detecting isolated semantic events;*ETL can also correctly track their ordering over time\.*

##### Does threshold calibration produce the intended precision–recall tradeoff for safety\-oriented monitoring?

Relative to the F1\-optimal threshold,ϵCP\\epsilon\_\{CP\}increases recall for reach \(from0\.830\.83to0\.930\.93\) while incurring only a modest drop in precision \(from0\.870\.87to0\.790\.79\), with essentially no change in agreement\. This shows that conformal calibration provides a practical safety\-oriented operating point: it makes the monitor more conservative against missed detections without substantially changing overall semantic alignment\.

Overall, the Dubins results show that ETL monitors align closely with state\-based specification monitors, support monitoring of temporally composed behaviors, and exhibit a clear precision–recall tradeoff under threshold calibration\.

![Refer to caption](https://arxiv.org/html/2605.12651v1/x2.png)\(a\)Boolean predicate traces on the Dubins car\. GoalsAA,BB, andCCare visualized along top of the image; the graph plots the L2 distance between embeddings and along with the F1\-optimal thresholds\. The bars along the bottom show when the latent and ground\-truth predicates are satisfied \(darker bars\); two observation states are displayed on the left of the graph\.
![Refer to caption](https://arxiv.org/html/2605.12651v1/x3.png)\(b\)Predicate\-monitoring F1 scores across domains\.
![Refer to caption](https://arxiv.org/html/2605.12651v1/x4.png)\(c\)Comparison of ETL and Qwen2\-VL\-2B on phasic DROID episodes\.

Figure 2:Qualitative and quantitative ETL results\.

### 5\.2Simulated Manipulation Tasks

To assess whether ETL remains effective beyond the controlled navigation domain, we evaluate it for simulated contact\-rich manipulation tasks with richer visual scenes, complex object interactions, and additional distractor objects that make perceptual monitoring harder\. We consider two environments with complementary task structure: D3IL\(jia2024towards\)and MetaWorld\(Hansen2025Newt\)\. For D3IL we consider two complex tasks:Sortingrequires a Franka robot to push two blocks into their corresponding color\-matching target boxes, whileStackingrequires the robot to arrange colored blocks in a target region\. For MetaWorld, we evaluate ETL on robotic\-arm pick\-and\-place tasks with sequential grasp and place subgoals\. Unlike single\-goal manipulation tasks, these benchmarks expose phase\-like progress structures, making them a natural testbed for evaluating whether ETL predicates can monitor task\-relevant semantic milestones in rich manipulation settings\.

#### Setup

We compare our ETL\-based monitoring approach against recent embedding\-based monitoring baselines: PCA\-kmeans\(liu2024multitaskinteractiverobotfleet\)and logpZO\(xu2023faildetect\)\. PCA\-kmeans clusters principal components of successful observation embeddings and scores a new observation by its distance to the nearest cluster center, while logpZO fits a flow\-matching density model over observation embeddings and uses the inferred latent norm as an uncertainty score\. These baselines are the most direct comparison for ETL because they operate on observation embeddings and detect distributional deviations\. We evaluate monitoring performance by comparing predicted satisfaction or violation labels against ground\-truth labels derived from simulator rewards and state variables\. Additional details regarding our experimental setup are provided in Appendix[F\.1](https://arxiv.org/html/2605.12651#A6.SS1)\.

#### Results: Can ETL outperform embedding\-based monitoring baselines on simulated manipulation tasks?

Figure[2\(b\)](https://arxiv.org/html/2605.12651#S5.F2.sf2)shows that ETL matches or exceeds the observation embedding baselines on average across the three simulated manipulation environments\. UsingϵF1\\epsilon\_\{F1\}, ETL achieves the highest average F1 score of0\.8170\.817, compared with0\.7900\.790for logpZO and0\.7510\.751for PCA\-kmeans\. ETL improves over the strongest baseline onStacking\(0\.8970\.897vs\.0\.8520\.852\) and slightly outperforms PCA\-kmeans onSorting\(0\.5930\.593vs\.0\.5850\.585\)\.Sortingis the hardest setting for all methods, suggesting that its OOD shifts induce more ambiguous changes in the embedding space\. On MetaWorld pick\-place\-wall, logpZO reaches near\-saturated performance \(0\.9920\.992F1\), indicating that the predicate boundary aligns closely with the support of successful embeddings\. ETL remains competitive at0\.9610\.961F1, while PCA\-kmeans lags behind at0\.8150\.815\.

### 5\.3Real World Evaluation via DROID

Finally, we evaluate ETL on real\-world robot data to assess its effectiveness for monitoring in\-the\-wild temporally extended manipulation behaviors\. To this end, we utilize the manipulation dataset DROID\(khazatsky2024droid\), which contains large\-scale, diverse, and compositional manipulation demonstrations in more realistic settings\.

#### Setup

We analyze manipulation episodes with compositional, multi\-phase structure\. We instantiate ETL specifications over phase\-based predicates, where each phase corresponds to a contiguous interaction window in the demonstration\. This setup allows us to evaluate how ETL monitoring performs over real\-world manipulation episodes in comparison to a Vision Language Model \(VLM\)\-based baseline monitor\. For evaluation, we compute F1 score, agreement, and sequential ordering accuracy over phase\-based predicates\. For our baseline, we use a state\-of\-the\-art VLM: Qwen2\-VL\(wang2024qwen2vlenhancingvisionlanguagemodels\)\. For each phase, the VLM is prompted with the ground\-truth desired behavior and phase frame to determine if a task was completed; the response is distilled to a Boolean value, which is then used to compute frame\-level F1, agreement, and sequential ordering accuracy for comparison with the ETL monitor\.

#### Results: Can ETL accurately monitor perceptual specifications on real\-world manipulation data in comparison to a VLM\-based baseline for monitoring phasic manipulation behaviors?

ETL achieves a mean F1 score of0\.8130\.813and mean agreement of0\.9400\.940, whereas the Qwen2\-VL baseline reaches a mean F1 score of0\.3900\.390and mean agreement of0\.5670\.567\. Though ETL and Qwen2\-VL perform competitively on one of the five tasks, these results demonstrate that ETL generalizes to diverse, unstructured manipulation scenes, and is markedly more reliable overall, especially on phases that are visually similar but semantically distinct\. Additionally, ETL correctly identifies that sequential phasic ordering in4/54/5episodes \(Figure[2\(c\)](https://arxiv.org/html/2605.12651#S5.F2.sf3)\), indicating it can monitor compositional task structure beyond isolated phase detection\. The single failure occurs in an episode with low inter\-phase separability, suggesting that monitoring performance degrades when distinct phases occupy overlapping regions of the latent space\. Tabularized results can be found in Appendix[G\.1](https://arxiv.org/html/2605.12651#A7.SS1)\.

Overall, these results show that ETL extends beyond simulation to real\-world manipulation data\. ETL successfully monitors temporal and sequential task structure across demonstrations\. Moreover, compared to a VLM\-based baseline, ETL provides substantially more reliable monitoring of phasic behaviors without requiring explicit language supervision\.

## 6Conclusion and Limitations

In this work we propose a novel temporal logic \(ETL\) defined over the embedding space of pretrained encoders for perception\-based autonomous systems\. Our experiments support three main conclusions\. First, embedding predicates can faithfully approximate ground\-truth propositions across controlled navigation and manipulation tasks\. Second, ETL supports monitoring temporally extended properties: sequential specifications are accurately tracked across all evaluated environments\. Third, threshold calibration provides a meaningful operating tradeoff: the F1\-optimal threshold yields the strongest overall alignment with ground truth, while the conformal threshold prioritizes high recall with a distribution\-free guarantee\.

The current approach has two main limitations: latent predicates are not yet fully interpretable in human\-understandable terms, and monitoring performance depends on whether task\-relevant semantic concepts are well separated in the encoder’s representation\. These limitations point to promising directions for future work, including more transparent predicate explanations, encoder selection and adaptation for improved semantic separationzarlenga2022conceptembeddingmodelsaccuracyexplainability, temporal abstraction over subtask boundarieswang2026temporalstraighteninglatentplanning, and adaptive online thresholdingareces2025online\.

## References

- D\. Aksaray, A\. Jones, Z\. Kong, M\. Schwager, and C\. Belta \(2016\)Q\-learning for robust satisfaction of signal temporal logic specifications\.External Links:1609\.07409Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- J\. J\. Aloor, J\. Patrikar, P\. Kapoor, J\. Oh, and S\. Scherer \(2023\)Follow the rules: online signal temporal logic tree search for guided imitation learning in stochastic domains\.In2023 IEEE International Conference on Robotics and Automation \(ICRA\),Vol\.,pp\. 1320–1326\.External Links:[Document](https://dx.doi.org/10.1109/ICRA48891.2023.10160953)Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Alur, O\. Bastani, K\. Jothimurugan, M\. Perez, F\. Somenzi, and A\. Trivedi \(2023\)Policy synthesis and reinforcement learning for discounted ltl\.InInternational Conference on Computer Aided Verification,pp\. 415–435\.Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Bartocci, J\. Deshmukh, A\. Donzé, G\. Fainekos, O\. Maler, D\. Ničković, and S\. Sankaranarayanan \(2018\)Specification\-Based Monitoring of Cyber\-Physical Systems: A Survey on Theory, Tools and Applications\.InLectures on Runtime Verification: Introductory and Advanced Topics,E\. Bartocci and Y\. Falcone \(Eds\.\),Lecture Notes in Computer Science,pp\. 135–175\(en\)\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-75632-5%5F5),ISBN 978\-3\-319\-75632\-5Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Belta and S\. Sadraddini \(2019\)Formal methods for control synthesis: an optimization perspective\.Annual Review of Control, Robotics, and Autonomous Systems2\(1\),pp\. 115–140\.External Links:[Document](https://dx.doi.org/10.1146/annurev-control-053018-023717),[Link](https://doi.org/10.1146/annurev-control-053018-023717),https://doi\.org/10\.1146/annurev\-control\-053018\-023717Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Colombo and G\. J\. Pace \(2022\)What is runtime verification\.InRuntime Verification: A Hands\-On Approach in Java,pp\. 9–15\.Cited by:[§1](https://arxiv.org/html/2605.12651#S1.p2.1)\.
- A\. Dokhanchi, H\. B\. Amor, J\. V\. Deshmukh, and G\. Fainekos \(2018\)Evaluating perception systems for autonomous vehicles using quality temporal logic\.InRuntime Verification: 18th International Conference, RV 2018, Limassol, Cyprus, November 10–13, 2018, Proceedings 18,pp\. 409–416\.Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- G\. E\. Fainekos and G\. J\. Pappas \(2009\)Robustness of temporal logic specifications for continuous\-time signals\.Theoretical Computer Science410\(42\),pp\. 4262–4291\.Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2605.12651#S3.SS1.SSS0.Px3.p1.1),[Definition 11](https://arxiv.org/html/2605.12651#Thmdefinition11.p1.6.4)\.
- M\. Hekmatnejad, B\. Hoxha, J\. V\. Deshmukh, Y\. Yang, and G\. Fainekos \(2024\)Formalizing and evaluating requirements of perception systems for automated vehicles using spatio\-temporal perception logic\.The International Journal of Robotics Research43\(2\),pp\. 203–238\.External Links:[Document](https://dx.doi.org/10.1177/02783649231223546),[Link](https://doi.org/10.1177/02783649231223546),https://doi\.org/10\.1177/02783649231223546Cited by:[§1](https://arxiv.org/html/2605.12651#S1.p4.1),[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Kapoor, K\. Mizuta, E\. Kang, and K\. Leung \(2025\)STLCG\+\+: a masking approach for differentiable signal temporal logic specification\.IEEE Robotics and Automation Letters10\(9\),pp\. 9240–9247\.Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Kress\-Gazit, G\. E\. Fainekos, and G\. J\. Pappas \(2009\)Temporal\-logic\-based reactive mission and motion planning\.IEEE transactions on robotics25\(6\),pp\. 1370–1381\.Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Leung, N\. Aréchiga, and M\. Pavone \(2023\)Backpropagation through signal temporal logic specifications: infusing logical structure into gradient\-based methods\.The International Journal of Robotics Research,pp\. 02783649221082115\.Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Lindemann and D\. V\. Dimarogonas \(2019\)Control barrier functions for signal temporal logic tasks\.IEEE Control Systems Letters3,pp\. 96–101\.External Links:[Link](https://api.semanticscholar.org/CorpusID:50767137)Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- O\. Maler and D\. Nickovic \(2004\)Monitoring temporal properties of continuous signals\.InFORMATS/FTRTFT,External Links:[Link](https://api.semanticscholar.org/CorpusID:15642684)Cited by:[§1](https://arxiv.org/html/2605.12651#S1.p2.1),[§1](https://arxiv.org/html/2605.12651#S1.p3.1),[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2605.12651#S3.SS1.SSS0.Px3.p1.1)\.
- M\. Oquab, T\. Darcet, T\. Moutakanni, H\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. Haziza, F\. Massa, A\. El\-Nouby, M\. Assran, N\. Ballas, W\. Galuba, R\. Howes, P\. Huang, S\. Li, I\. Misra, M\. Rabbat, V\. Sharma, G\. Synnaeve, H\. Xu, H\. Jegou, J\. Mairal, P\. Labatut, A\. Joulin, and P\. Bojanowski \(2024\)DINOv2: learning robust visual features without supervision\.External Links:[Link](https://arxiv.org/abs/2304.07193),2304\.07193Cited by:[§1](https://arxiv.org/html/2605.12651#S1.p5.3),[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2605.12651#S1.p5.3),[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px2.p1.1)\.
- V\. Raman, A\. Donze, M\. Maasoumy, R\. M\. Murray, A\. Sangiovanni\-Vincentelli, and S\. Seshia \(2014\)Model predictive control with signal temporal logic specifications\.InConference on Decision and Control,Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- S\. A\. Seshia, A\. Desai, T\. Dreossi, D\. J\. Fremont, S\. Ghosh, E\. Kim, S\. Shivakumar, M\. Vazquez\-Chanlatte, and X\. Yue \(2018\)Formal specification for deep neural networks\.InAutomated Technology for Verification and Analysis,pp\. 20–34\.Cited by:[§1](https://arxiv.org/html/2605.12651#S1.p2.1),[§1](https://arxiv.org/html/2605.12651#S1.p4.1)\.
- D\. Sun, J\. Chen, S\. Mitra, and C\. Fan \(2022\)Multi\-agent motion planning from signal temporal logic specifications\.IEEE Robotics and Automation LettersPP,pp\. 1–1\.External Links:[Link](https://api.semanticscholar.org/CorpusID:245986629)Cited by:[§2](https://arxiv.org/html/2605.12651#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Zhou, H\. Pan, Y\. LeCun, and L\. Pinto \(2025\)DINO\-wm: world models on pre\-trained visual features enable zero\-shot planning\.External Links:[Link](https://arxiv.org/abs/2411.04983),2411\.04983Cited by:[§1](https://arxiv.org/html/2605.12651#S1.p1.1)\.

## Appendix AConstructing ETL Specifications in Practice

The formal semantics proposed in the preceding section define ETL as a specification language over embedding traces\. To use ETL in practice, however, several design choices must be instantiated concretely: one must determine how target embeddings are specified, how observations are mapped into the representation space, which distance function is used to compare embeddings, and how predicate thresholds are calibrated so that they align with the intended semantic concept\. In this section, we first outline how target embeddings are specified\. Then, we highlight the different choices of encoders and distance functions\.

### A\.1Specifying Target Embeddings

ETL is specified using predicates over*target embeddings*that correspond to parts of the physical world against which current observations are evaluated\. In practice, the specifier \(e\.g\., a system engineer\) would not directly specify the mathematical representations of embeddings\. Instead, the targets would be provided in the same format as the sensor’s input, e\.g\., for a camera, a target would be provided as a reference image, that is then translated into the target embeddings via the pretrained encoder\. This allows specifications such as*“eventually reach a state similar to this image”*or*“always avoid states resembling fire”*without requiring explicit symbolic labels\. By enabling specification directly over learned representations rather than explicit symbolic labels, our approach removes the need to manually define a finite predicate vocabulary from observations, a process that has traditionally depended on substantial expert knowledge and task\-specific engineering in robotics\.

### A\.2Choice of Encoders and Distance functions

The choice of distance function is another key practical consideration in evaluating ETL satisfaction\. Distances over pretrained image embeddings capture perceptual similarity, but lack temporal context because these encoders are trained on static observations\. World models mitigate this limitation by mapping observations into a latent space that evolves as a function of past states and actions\. When trained with reconstruction\-based objectives, such models often preserve much of the geometry of the original embedding space while enriching it with temporal structure\. Consequently, similarity relationships defined over observation embeddings can often be carried over to the latent space in a meaningful way\.

Accordingly, the choice ofddshould align with the geometry induced by the representation\. For instance, cosine distance is a natural choice for contrastive embeddings, whereas L2 distance is often more appropriate for reconstruction\-based latent representations\. More generally, the selected distance should be one that best reflects semantic similarity in the underlying space\. We empirically study the effect of this design choice in Section[5](https://arxiv.org/html/2605.12651#S5)\.

## Appendix BQuantitative semantics of ETL

###### Definition 11\(ETL Robustness\)\.

For a traceσ\\sigma, the*robustness of an ETL formula*at timestepiiis defined as follows:

ρ\(ap,σ,i,b\)\\displaystyle\\rho\(ap,\\sigma,i,\\textbf\{b\}\{\}\)=\\displaystyle=\{ϵ−δap\(zi\)⋈∈\{≤,<\}δap\(zi\)−ϵ⋈∈\{≥,\>\}\\displaystyle\\begin\{cases\}\\epsilon\-\\delta\_\{ap\}\(z\_\{i\}\)&\\bowtie\\in\\\{\\leq,<\\\}\\\\ \\delta\_\{ap\}\(z\_\{i\}\)\-\\epsilon&\\bowtie\\in\\\{\\geq,\>\\\}\\end\{cases\}ρ\(¬φ,σ,i,b\)\\displaystyle\\rho\(\\neg\\varphi,\\sigma,i,\\textbf\{b\}\{\}\)=\\displaystyle=−ρ\(φ,σ,i,b\)\\displaystyle\-\\rho\(\\varphi,\\sigma,i,\\textbf\{b\}\{\}\)ρ\(φ1∧φ2,σ,i,b\)\\displaystyle\\rho\(\\varphi\_\{1\}\\land\\varphi\_\{2\},\\sigma,i,\\textbf\{b\}\{\}\)=\\displaystyle=min⁡\(ρ\(φ1,σ,i,b\),ρ\(φ2,σ,i,b\)\)\\displaystyle\\min\\big\(\\rho\(\\varphi\_\{1\},\\sigma,i,\\textbf\{b\}\{\}\),\\rho\(\\varphi\_\{2\},\\sigma,i,\\textbf\{b\}\{\}\)\\big\)ρ\(Gφ,σ,i,b\)\\displaystyle\\rho\(\\textbf\{G\}\{\}\\varphi,\\sigma,i,\\textbf\{b\}\{\}\)=\\displaystyle=infk∈\[i,b\]ρ\(φ,σ,k,b\)\\displaystyle\\textbf\{inf\}\_\{k\\in\[i,\\textbf\{b\}\{\}\]\}\\rho\(\\varphi,\\sigma,k,\\textbf\{b\}\{\}\)ρ\(Fφ,σ,i,b\)\\displaystyle\\rho\(\\textbf\{F\}\{\}\\varphi,\\sigma,i,\\textbf\{b\}\{\}\)=\\displaystyle=supk∈\[i,b\]ρ\(φ,σ,k,b\)\\displaystyle\\textbf\{sup\}\_\{k\\in\[i,\\textbf\{b\}\{\}\]\}\\rho\(\\varphi,\\sigma,k,\\textbf\{b\}\{\}\)whereinfandsupare the infimum and supremum operators, respectively\. Note that the computation of the satisfaction score is restricted to a subsequence ofσ\\sigmabetweeniitob\(i\.e\.,zi,zi\+1,…\.,zb\)z\_\{i\},z\_\{i\+1\},\.\.\.\.,z\_\{\\textbf\{b\}\{\}\}\)\. The boundbmay be determined by, for example, the planning horizon used by a planner \(i\.e\., the length of action sequence used by the planner for behavioral prediction\)\. For brevity, the definition for the until \(U\) operator is omitted, but similar to the one described inFainekos and Pappas \[[2009](https://arxiv.org/html/2605.12651#bib.bib9)\]\.

## Appendix CSemantic Correctness of ETL Monitors

To compare ETL monitors to ground\-truth executions, we define a ground\-truth monitor over finite executions\.

###### Definition 12\(Finite Execution\)\.

For a given executionς\\varsigma, a*finite execution*ς≤t\\varsigma\_\{\\leq t\}is an execution consisting of the firstt\+1t\+1states ofς\\varsigma\.

We refer to a finite execution as an execution when clear from context\.

###### Definition 13\(Ground\-Truth Monitor\)\.

Letω\\omegabe a temporal logic formula over*symbolic*predicates, and letς≤t\\varsigma\_\{\\leq t\}be a finite execution prefix\. The*ground\-truth monitor*forω\\omegaonς≤t\\varsigma\_\{\\leq t\}is the binary\-valued trace

GTω\(ς≤t\)=\(b0ω,b1ω,…,btω\)∈\{−1,\+1\}t\+1,GT\_\{\\omega\}\(\\varsigma\_\{\\leq t\}\)=\\bigl\(b^\{\\omega\}\_\{0\},\\,b^\{\\omega\}\_\{1\},\\,\\dots,\\,b^\{\\omega\}\_\{t\}\\bigr\)\\in\\\{\-1,\+1\\\}^\{t\+1\},where, for each prefixς≤i\\varsigma\_\{\\leq i\}with0≤i≤t0\\leq i\\leq t,

biω=\{\+1ifς≤i⊧ω,−1otherwise\.b^\{\\omega\}\_\{i\}=\\begin\{cases\}\+1&\\text\{if \}\\varsigma\_\{\\leq i\}\\models\\omega,\\\\ \-1&\\text\{otherwise\.\}\\end\{cases\}Here, satisfactionς≤i⊧ω\\varsigma\_\{\\leq i\}\\models\\omegais defined by the semantics of the underlying temporal logic over symbolic predicates\.

We define the semantic correctness of monitorMMusing the following definitions\.

###### Definition 14\(Semantic Correspondence\)\.

LetMMbe an embedding temporal structure with representation mapη=ψenc∘ϕobs\\eta=\\psi\_\{\\mathrm\{enc\}\}\\circ\\phi\_\{\\mathrm\{obs\}\}\. Letω\\omegabe a specification over ground\-truth state trajectoriesς≤t\\varsigma\_\{\\leq t\}and letφ\\varphibe an ETL specification over finite embedding tracesσ≤t\\sigma\_\{\\leq t\}\. We say thatφ\\varphi*corresponds semantically*toω\\omegawith respect toMMif, for every finite execution prefixς≤t=\(s0,…,st\)\\varsigma\_\{\\leq t\}=\(s\_\{0\},\\dots,s\_\{t\}\), lettingσ≤t=\(η\(s0\),…,η\(st\)\),\\sigma\_\{\\leq t\}=\(\\eta\(s\_\{0\}\),\\dots,\\eta\(s\_\{t\}\)\),we haveσ≤t⊧φ⇔ς≤t⊧ω\.\\sigma\_\{\\leq t\}\\models\\varphi\\iff\\varsigma\_\{\\leq t\}\\models\\omega\.

Intuitively, we say thatφ\\varphiis a latent space specificationsemantically correspondswith the ground\-truth specificationω\\omegaif both are intended to capture the same behavioral property, withω\\omegadefined over state\-based predicates andφ\\varphidefined over embedding\-space predicates\.

###### Definition 15\(Semantic Correctness of an ETL Monitor\)\.

Letω\\omegabe a ground\-truth specification and letφ\\varphibe an ETL specification that corresponds semantically toω\\omega\. An ETL monitorMMis*semantically correct*with respect to\(φ,ω\)\(\\varphi,\\omega\)if, for every execution prefix,Mφ\(σ≤t\)=GTω\(ς≤t\)\.M\_\{\\varphi\}\(\\sigma\_\{\\leq t\}\)=GT\_\{\\omega\}\(\\varsigma\_\{\\leq t\}\)\.

## Appendix DFormal Definitions of Threshold Calibration

We provide additional details and intuitive understanding of thresholds and their calibration for ETL monitoring\.

Intuitively,ϵ\\epsilondefines the*boundary of a concept*in the representation space \(e\.g\., how close an observation must be to a goal embedding to count as “reached”\)\. Unlike state\-based predicates, thresholds in embedding spaces depend on the geometry induced by the encoder and the choice of distance function\. Poorly chosen thresholds can lead to high false negatives \(overly strict predicates\) or false positives \(overly permissive predicates\), making calibration essential\.

We propose a data\-driven approach to calibrate thresholds for ETL predicates\. We assume access to a dataset of trajectories \(e\.g\., simulator rollouts or demonstrations\) with ground\-truth signalsς≤t\\varsigma\_\{\\leq t\}that satisfy a TL specificationω\\omegaat timestepttin the underlying state space\. Here,ω\\omegais a TL specification that semantically corresponds to the ETL specification being instantiated\. Then, each trajectory is mapped to an embedding trace using the representation mapη\\eta, yielding embeddingszt=η\(st\)z\_\{t\}=\\eta\(s\_\{t\}\)\. From these embeddings, we compute distances to the target set asdt=du\(zt,Tu\)d\_\{t\}=d\_\{u\}\(z\_\{t\},T\_\{u\}\)\. We then select thresholds using a held\-out calibration set of sizencaln\_\{\\text\{cal\}\}\.

### D\.1F1\-Optimal Threshold

###### Definition 16\(F1\-Optimal Threshold\)\.

Let𝒟cal=\{\(dt,yt\)\}t=1N\\mathcal\{D\}\_\{\\mathrm\{cal\}\}=\\\{\(d\_\{t\},y\_\{t\}\)\\\}\_\{t=1\}^\{N\}be the calibration set, wheredt∈ℝ≥0d\_\{t\}\\in\\mathbb\{R\}\_\{\\geq 0\}is the embedding distance at timestepttandyt∈\{0,1\}y\_\{t\}\\in\\\{0,1\\\}is the corresponding ground\-truth predicate label\. For any candidate thresholdϵ∈ℝ≥0\\epsilon\\in\\mathbb\{R\}\_\{\\geq 0\}, define the induced prediction

y^t\(ϵ\)=𝟏\[dt≤ϵ\]\.\\hat\{y\}\_\{t\}\(\\epsilon\)=\\mathbf\{1\}\[d\_\{t\}\\leq\\epsilon\]\.Let

F1\(ϵ\)=2TP\(ϵ\)2TP\(ϵ\)\+FP\(ϵ\)\+FN\(ϵ\),\\mathrm\{F1\}\(\\epsilon\)=\\frac\{2\\,\\mathrm\{TP\}\(\\epsilon\)\}\{2\\,\\mathrm\{TP\}\(\\epsilon\)\+\\mathrm\{FP\}\(\\epsilon\)\+\\mathrm\{FN\}\(\\epsilon\)\},whereTP\(ϵ\)\\mathrm\{TP\}\(\\epsilon\),FP\(ϵ\)\\mathrm\{FP\}\(\\epsilon\), andFN\(ϵ\)\\mathrm\{FN\}\(\\epsilon\)denote the numbers of true positives, false positives, and false negatives, respectively, obtained by comparingy^t\(ϵ\)\\hat\{y\}\_\{t\}\(\\epsilon\)againstyty\_\{t\}over𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\.

The*F1\-optimal threshold*is defined as

ϵF1∈arg⁡maxϵ∈\{d1,…,dN\}⁡F1\(ϵ\)\.\\epsilon\_\{F1\}\\in\\arg\\max\_\{\\epsilon\\in\\\{d\_\{1\},\\dots,d\_\{N\}\\\}\}\\mathrm\{F1\}\(\\epsilon\)\.If multiple thresholds achieve the same maximum F1 score, ties may be broken arbitrarily\.

### D\.2Conformal Threshold with Recall Guarantee

In this work, we propose conformal*ETL predicate calibration*by adapting conformal prediction to select predicate thresholds from embedding distance scores\. Here, the goal is not to produce a prediction interval around a model output, but to calibrate the threshold used for ETL predicate evaluation\. We do this in two stages\. Within each calibration trajectory, we compute a single score from all satisfying timesteps by taking the largest embedding distance among the frames that satisfy the predicate\. This score represents the*hardest positive case*in that trajectory: any threshold smaller than this value would miss at least one true positive frame in that same trajectory\. Then across trajectories, we collect one such score from each calibration trajectory and apply split conformal prediction to these held\-out scores\. The resulting thresholdϵCP\\epsilon\_\{\\mathrm\{CP\}\}is therefore calibrated not to individual frames, but between different trajectories\. As a result, for a new trajectory drawn from the same distribution, with probability at least1−α1\-\\alpha, the threshold is large enough to cover that trajectory’s hardest positive case, and hence will detect every ground\-truth positive timestep in that trajectory\. Here,α∈\(0,1\)\\alpha\\in\(0,1\)is a user\-specified error level\.

#### Setup

Letς1,…,ςncal\\varsigma^\{1\},\\ldots,\\varsigma^\{n\_\{\\mathrm\{cal\}\}\}bencaln\_\{\\mathrm\{cal\}\}calibration demonstrations, disjoint from training, and letzgz\_\{g\}denote the target latent embedding that semantically corresponds to a ground\-truth specificationω\\omega\(see Definition[14](https://arxiv.org/html/2605.12651#Thmdefinition14)\)\. For each calibration demonstrationςi\\varsigma^\{i\}, define the nonconformity score

scorei=maxt∈\[0,\|ςi\|\]⁡\{d\(zt\(i\),zg\)ifδωGT\(st\(i\)\)=1−∞otherwise\.score\_\{i\}=\\max\_\{t\\in\[0,\|\\varsigma^\{i\}\|\]\}\\begin\{cases\}d\(z\_\{t\}^\{\\\!\\left\(i\\right\)\},z\_\{g\}\)&\\mathrm\{if\}\\;\\delta^\{GT\}\_\{\\omega\}\(s\_\{t\}^\{\(i\)\}\)=1\\\\ \-\\infty&\\mathrm\{otherwise\.\}\\\\ \\end\{cases\}for a distance functiond∈D𝒵d\\in D\_\{\\mathcal\{Z\}\}\{\}and whereδωGT\(st\(i\)\)=1\\delta^\{GT\}\_\{\\omega\}\(s\_\{t\}^\{\(i\)\}\)=1indicates that the ground\-truth state at timettsatisfies predicateω\\omega\. Intuitively,scoreiscore\_\{i\}is the most difficult positive timestep in demonstrationςi\\varsigma\_\{i\}: any threshold belowscoreiscore\_\{i\}would fail to classify at least one ground\-truth positive frame in that demonstration\. Each calibration demonstration contains at least one ground\-truth positive timestep\.

###### Definition 17\(Conformal Calibration Threshold\)\.

For a calibration set of sizencaln\_\{\\mathrm\{cal\}\}and a target mis\-recall levelα∈\(0,1\)\\alpha\\in\(0,1\), a*conformal calibration threshold*ϵCP\\epsilon\_\{\\mathrm\{CP\}\}is a value inℝ≥0\\mathbb\{R\}\_\{\\geq 0\}defined as follows:

ϵCP=score\(k\),k=⌈\(1−α\)\(ncal\+1\)⌉,\\epsilon\_\{\\mathrm\{CP\}\}=score\_\{\(k\)\},\\qquad k=\\left\\lceil\(1\-\\alpha\)\(n\_\{\\mathrm\{cal\}\}\+1\)\\right\\rceil,wherescore\(k\)score\_\{\(k\)\}denotes thekk\-th smallest value among\{scorei\}i=1ncal\\\{score\_\{i\}\\\}\_\{i=1\}^\{n\_\{\\mathrm\{cal\}\}\}\.

###### Theorem 1\(Conformal Recall Guarantee\)\.

Assume the calibration demonstrations and a future test demonstration are exchangeable\. Then

ℙ\(∀t:δωGT\(st\(i\)\)=1⇒d\(zt,zω\)≤ϵCP\)≥1−α\.\\mathbb\{P\}\\\!\\left\(\\forall t:\\;\\delta^\{GT\}\_\{\\omega\}\(s\_\{t\}^\{\(i\)\}\)=1\\;\\Rightarrow\\;d\(z\_\{t\},z\_\{\\omega\}\)\\leq\\epsilon\_\{\\mathrm\{CP\}\}\\right\)\\geq 1\-\\alpha\.Equivalently, with probability at least1−α1\-\\alphaover a newly drawn test demonstration, the embedding predicate achieves perfect per\-demonstration recall, i\.e\., it detects every ground\-truth positive timestep\.

###### Proof sketch\.

By exchangeability, the augmented set of scores

score1,…,scorencal,scoretestscore\_\{1\},\\ldots,score\_\{n\_\{\\mathrm\{cal\}\}\},score\_\{\\mathrm\{test\}\}is exchangeable, where

scoretest=maxt∈ςtestδωGT\(st\)=1⁡d\(zt,zω\)\.score\_\{\\mathrm\{test\}\}=\\max\_\{\\begin\{subarray\}\{c\}t\\in\\varsigma\_\{\\mathrm\{test\}\}\\\\ \\delta^\{GT\}\_\{\\omega\}\(s\_\{t\}\)=1\\end\{subarray\}\}d\\\!\\left\(z\_\{t\},z\_\{\\omega\}\\right\)\.The standard split conformal order\-statistic argument implies

ℙ\(scoretest≤score\(k\)\)≥kncal\+1≥1−α\.\\mathbb\{P\}\\\!\\left\(score\_\{\\mathrm\{test\}\}\\leq score\_\{\(k\)\}\\right\)\\geq\\frac\{k\}\{n\_\{\\mathrm\{cal\}\}\+1\}\\geq 1\-\\alpha\.By construction, the eventscoretest≤ϵCPscore\_\{\\mathrm\{test\}\}\\leq\\epsilon\_\{\\mathrm\{CP\}\}is exactly the event that every positive timestep in the test demonstration satisfiesd\(zt,zω\)≤ϵCPd\(z\_\{t\},z\_\{\\omega\}\)\\leq\\epsilon\_\{\\mathrm\{CP\}\}, which proves the recall guarantee\. ∎

At deployment, the predicate is evaluated solely by checking whether

d\(zt,zω\)≤ϵCP,d\(z\_\{t\},z\_\{\\omega\}\)\\leq\\epsilon\_\{\\mathrm\{CP\}\},The guarantee holds for any embedding function and any underlying data distribution, provided the calibration and test demonstrations remain exchangeable\.

## Appendix EAdditional Evaluation Details for Dubins Car

### E\.1Details for Dubins Car Setup

The navigation task is defined for a robot that respects discrete\-time Dubins car dynamics\. A state is defined ass=\[px,py,θ\]s=\[p^\{x\},p^\{y\},\\theta\]wherest\+1=st\+Δt\[vcos⁡\(θt\),vsin⁡\(θt\),at\]s\_\{t\+1\}=s\_\{t\}\+\\Delta t\[v\\cos\(\\theta\_\{t\}\),v\\sin\(\\theta\_\{t\}\),a\_\{t\}\]with continuous angular velocityat∈A=\[−amax,amax\]a\_\{t\}\\in A=\[\-a\_\{\\max\},a\_\{\\max\}\]\. We fixamax=1\.25a\_\{\\max\}=1\.25rad/s andv=1v=1m/s, with discretization atΔt=0\.05s\\Delta t=0\.05s\. We generateN=100N=100trajectories using a feedback controller with obstacle avoidance\. For each task, we construct pairs of equivalent specifications in the ground\-truth state space and the embedding space\. The ground\-truth specifications,ωi\\omega\_\{i\}, are defined over the state space of the car \(e\.g\., the positionpx,pyp^\{x\},p^\{y\}, angular velocityyy, etc\.\)\. In parallel, the ETL specifications,φi\\varphi\_\{i\}, are defined over the embedded images observed during simulation; target embeddings are obtained from the simulated observations corresponding to selected goal states\. We utilize the encoder from the world model used inanysafe, which is based on Dreamer\[Hafner2025\]with a Recurrent State Space Model \(RSSM\)\. The encoder produces a task\-relevant representation space in which distances reflect semantic similarity between observations\.

We evaluate four specification patterns:*Reach*\(FA\\textbf\{F\}A\),*Avoid*\(G¬C\\textbf\{G\}\\neg C\),*Reach\-Avoid*\(FA∧G¬C\\textbf\{F\}A\\land\\textbf\{G\}\\neg C\), and*Sequential*\(F\(A∧FB\)\\textbf\{F\}\(A\\land\\textbf\{F\}B\)\)\.AA,BB, andCCdenote reaching the top\-right goal\|pt−\(0\.8,0\.8\)\|<0\.25\|p\_\{t\}\-\(0\.8,0\.8\)\|<0\.25, reaching the top\-left goal\|pt−\(−0\.8,0\.8\)\|<0\.25\|p\_\{t\}\-\(\-0\.8,0\.8\)\|<0\.25, and entering the obstacle proximity zone\|pt−\(0\.8,−0\.8\)\|<0\.5\|p\_\{t\}\-\(0\.8,\-0\.8\)\|<0\.5, respectively\. This setup enables direct comparison between embedding\-based satisfaction and ground\-truth logical semantics\. Thresholds are calibrated on a 40/60 calibration/test split using both F1\-optimal and conformal procedures\. In all experiments, we useα=0\.10\\alpha=0\.10for computingϵcp\\epsilon\_\{cp\}\.

### E\.2Results for Dubins Car

The results for the experiments with the Dubins car can be found in Table[1](https://arxiv.org/html/2605.12651#A5.T1), including F1, Precision, Recall, and Agreement scores for both F1\-Optimal and Conformal Prediction thresholds\.

Table 1:ETL predicate evaluation on the Dubins car\.

## Appendix FAdditional Details for Manipulation Tasks

We provide qualitative visualizations of ETL predicate behavior on manipulation tasks\. These plots illustrate how embedding distances evolve over time and how predicate satisfaction aligns with ground\-truth signals\.

### F\.1Tabularized Results

We present tabularized results from the experiments from Section[5\.2](https://arxiv.org/html/2605.12651#S5.SS2)in Table[2](https://arxiv.org/html/2605.12651#A6.T2)and Table[3](https://arxiv.org/html/2605.12651#A6.T3)\.

Table 2:ETL predicate evaluation on MetaWorld \(test split\)\. Frame\-level metrics are reported for all predicates\.Table 3:F1 score for failure / predicate detection across three environments\. All methods use the F1\-optimal threshold; ETL \(CP\) additionally reports the class\-conditional split\-CP threshold\. Per\-column best inbold\.
### F\.2Sequential Predicate Trace

Figure[3](https://arxiv.org/html/2605.12651#A6.F3)illustrates dual\-predicate traces for the pick\-place\-wall task\. The plots show distances to both subgoal embeddings \(zAz\_\{A\}for grasp andzBz\_\{B\}for place\), along with corresponding predicate activations\.

![Refer to caption](https://arxiv.org/html/2605.12651v1/x5.png)Figure 3:Dual\-predicate Boolean timelines formw\-pick\-place\-wall\. Top panels show distances to grasp \(zAz\_\{A\}\) and place \(zBz\_\{B\}\) embeddings with thresholdsϵ∗\\epsilon^\{\*\}andϵCP\\epsilon\_\{CP\}\. Lower panels show ETL and ground\-truth predicate activations\.The embedding distances tozAz\_\{A\}andzBz\_\{B\}decrease at distinct timesteps corresponding to grasp and placement events\. This temporal separation is preserved in the embedding space, allowing ETL to correctly identify the ordering of subgoals across all trajectories\.

## Appendix GAdditional DROID Evaluation Details

### G\.1Tabularized Results

We present tabularized results from the experiments from Section[5\.3](https://arxiv.org/html/2605.12651#S5.SS3)in Table[4](https://arxiv.org/html/2605.12651#A7.T4)\.

Table 4:Comparison of ETL and Qwen2\-VL\-2B on phasic DROID episodes\.
### G\.2Ground Truth Construction

Unlike simulation environments, DROID does not provide explicit task\-phase annotations\. We derive ground\-truth predicates directly from proprioceptive signals:

πhold\(t\)\\displaystyle\\pi\_\{\\text\{hold\}\}\(t\):gripper\(t\)\>0\.5\\displaystyle:\\text\{gripper\}\(t\)\>0\.5πrelease\(t\)\\displaystyle\\pi\_\{\\text\{release\}\}\(t\):gripper\(t\)<0\.15∧∃t′<t,πhold\(t′\)\\displaystyle:\\text\{gripper\}\(t\)<0\.15\\;\\wedge\\;\\exists\\,t^\{\\prime\}<t,\\,\\pi\_\{\\text\{hold\}\}\(t^\{\\prime\}\)The sequential specification is defined as:

∃t1<t2:πhold\(t1\)∧πrelease\(t2\),\\exists\\,t\_\{1\}<t\_\{2\}:\\pi\_\{\\text\{hold\}\}\(t\_\{1\}\)\\wedge\\pi\_\{\\text\{release\}\}\(t\_\{2\}\),corresponding to a pick\-and\-place interaction\. For tasks where proprioception is not sufficient to identify phase of tasks, we manually annotate the videos and generate the ground truth predicates\.

### G\.3Encoder and Representation

We use theSVD VAEfrom Ctrl\-World\[ctrlworld\], a video diffusion model trained directly on DROID data\. Each frame is encoded into a4×24×404\\times 24\\times 40latent tensor, flattened to a 3,840\-dimensional vector\. Distances are computed using cosine similarity\. This encoder is domain\-matched to DROID and provides stronger geometric separation than general\-purpose encoders such as DINOv2\.

### G\.4Sequential Evaluation Details

We identify 25 episodes containing valid grasp\-then\-release transitions, using 10 for calibration and 15 for evaluation\.

Table 5:Sequential predicate evaluation on DROID \(SVD VAE, wrist camera\)\.The release predicate exhibits low precision due to visual ambiguity: the approach phase with an open gripper is visually similar to the post\-release phase\. Despite this, sequential ordering is correctly identified in all episodes\.

### G\.5Discussion of Failure Modes

ETL performance depends on the separability of task\-relevant concepts in the embedding space\. For manipulation tasks, success states \(e\.g\., button press, object placement\) are highly distinctive, leading to sharp transitions in embedding distance and high predicate accuracy\.

Failure modes arise when embeddings corresponding to different semantic states overlap in the latent space, which can result in delayed or premature predicate activation near decision boundaries\. However, such cases are rare in these tasks, and conformal thresholds help mitigate missed detections by prioritizing recall\.

Overall, these qualitative results reinforce that embedding distances provide a reliable and interpretable signal for detecting semantic events and temporal structure in manipulation tasks\.

We evaluate whether*Embedding Temporal Logic*\(ETL\) monitoring can serve as a competitive, interpretable alternative to state of the art failure predictors for generative robot policies\. Our central hypothesis is that explicitly encoding the*sequential milestone structure*of a task into a small number of latent spec anchors, and then asking whether each anchor was ever approached during execution, yields a sharper failure signal than methods that either model the full observation distribution or score each timestep independently against a single goal embedding\.
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

Similar Articles

Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

Temporal Attention for Adaptive Control of Euler-Lagrange Systems with Unobservable Memory

Temporal-Distance JEPA: Plan-Aware Representation Learning for Latent World Model Predictive Control

Submit Feedback

Similar Articles

Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations
Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
Temporal Attention for Adaptive Control of Euler-Lagrange Systems with Unobservable Memory
Temporal-Distance JEPA: Plan-Aware Representation Learning for Latent World Model Predictive Control