Revealing Safety-Critical Scenarios for UTM via Transformer

arXiv cs.AI Papers

Summary

This research paper proposes a transformer-based reinforcement learning framework to automatically generate safety-critical test scenarios for Unmanned Traffic Management (UTM) systems, achieving an 8× improvement in vulnerability discovery efficiency over expert-guided testing.

arXiv:2606.31114v1 Announce Type: new Abstract: Unmanned Traffic Management (UTM) systems are cloud-based platforms designed to manage and coordinate multiple aerial vehicles remotely. UTM systems are safety-critical which cannot tolerate failures like crash or collision. To reveal latent vulnerabilities, there are neither optimal failure-exposing demonstrations nor clear reward signals. Additionally, UTM's self-healing capability introduces the ``long-tail effect'' of critical failures. We propose framing UTM vulnerability discovery as a sequence modeling problem amenable to transformer-based RL architectures. Our approach leverages attention mechanisms to directly model the relationship among system states, and predict optimal actions. Our framework introduces a Policy Model that generates targeted test scenarios and an Action Sampler that enforces domain constraints. We use a risk-based reward function to guide exploration. Through extensive evaluation on a 700-hour simulation study, we demonstrate an 8$\times$ improvement in vulnerability discovery efficiency compared to expert-guided testing. It also discovers critical edge cases that traditional methods have missed.
Original Article
View Cached Full Text

Cached at: 07/01/26, 05:37 AM

# Revealing Safety-Critical Scenarios for UTM via Transformer
Source: [https://arxiv.org/html/2606.31114](https://arxiv.org/html/2606.31114)
Huaze Tang Tsinghua University &Bill Zeng111The smoke testing refers to the basic functionality testing of UTM system\. This is conducted as the initial testing after a new build or version of the UTM system\. Meituan &Chao Wang22footnotemark:2 Tsinghua University Zhenpeng Shi22footnotemark:2 Tsinghua University &Qian Zhang Meituan &Wenbo Ding Tsinghua University Corresponding Author\.[ding\.wenbo@sz\.tsinghua\.edu\.cn](https://arxiv.org/html/2606.31114v1/[email protected])

###### Abstract

Unmanned Traffic Management \(UTM\) systems are cloud\-based platforms designed to manage and coordinate multiple aerial vehicles remotely\. UTM systems are safety\-critical which cannot tolerate failures like crash or collision\. To reveal latent vulnerabilities, there are neither optimal failure\-exposing demonstrations nor clear reward signals\. Additionally, UTM’s self\-healing capability introduces the “long\-tail effect” of critical failures\. We propose framing UTM vulnerability discovery as a sequence modeling problem amenable to transformer\-based RL architectures\. Our approach leverages attention mechanisms to directly model the relationship among system states, and predict optimal actions\. Our framework introduces a Policy Model that generates targeted test scenarios and an Action Sampler that enforces domain constraints\. We use a risk\-based reward function to guide exploration\. Through extensive evaluation on a 700\-hour simulation study, we demonstrate an 8×\\timesimprovement in vulnerability discovery efficiency compared to expert\-guided testing\. It also discovers critical edge cases that traditional methods have missed\.

*K*eywordsUnmanned Traffic Management⋅\\cdotSafety\-Critical Testing⋅\\cdotDecision Transformer⋅\\cdotOffline Reinforcement Learning⋅\\cdotScenario Generation

## 1Introduction

The rapid emergence of the low\-altitude economy has necessitated robust Unmanned Traffic Management \(UTM\) systems\(FAA,[2023](https://arxiv.org/html/2606.31114#bib.bib97)\)\. UTM performs centralized traffic control among aerial vehicles\. Most system failures in UTM are intolerable, like crashing, collisions, airspace violations\(Kopardekar,[2014](https://arxiv.org/html/2606.31114#bib.bib100); Kopardekaret al\.,[2016](https://arxiv.org/html/2606.31114#bib.bib88)\), making it vital to discover potential vulnerability scenarios during UTM iteration and deployment\. In this context, a scenario is defined as a temporal segment of operational trajectories and environmental information of intelligent agents managed by the UTM system\(Tianet al\.,[2022](https://arxiv.org/html/2606.31114#bib.bib50); Zhonget al\.,[2021](https://arxiv.org/html/2606.31114#bib.bib40)\)\.

![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/Overview_of_UAV_system_v5.png)Figure 1:Overview of the operational environment in Unmanned aircraft system Traffic Management \(UTM\) of System\-Under\-Test \(SUT\)\.The UTM system operates in a variety of environments, including urban, suburban, and rural areas\. Each setting poses distinct challenges, such as high\-density air traffic in urban regions and limited infrastructure in rural areas, requiring strong management and coordination strategies\. Rapid fault detection across these diverse scenarios is essential for maintaining safety and preventing catastrophic failures in real\-world deployments\.UTMs are developed with self\-healing functionality\(L\. Gladenceet al\.,[2021](https://arxiv.org/html/2606.31114#bib.bib61)\)\. This feature helps prevent system failures automatically\. However, it creates a challenge for testing\. Most historical data contains only safe operational trajectories\. The challenge of “long\-tail effect” is thereby introduced\. Moreover, these edge cases often emerge from subtle multi\-agent runtime interaction\(Wedad Alawadet al\.,[2023](https://arxiv.org/html/2606.31114#bib.bib63)\)rather than simple component failures, suffering from a lack of optimal demonstrations or clear reward signals\.

To address these challenges, we reformulate UTM vulnerability discovery as a sequence modeling problem amenable to modern transformer\-based\(Chenet al\.,[2021](https://arxiv.org/html/2606.31114#bib.bib1)\)reinfocement learning \(RL\) architectures, as detailed in Section[3\.5](https://arxiv.org/html/2606.31114#S3.SS5)\. Our approach leverages the transformer’s powerful attention mechanism\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.31114#bib.bib45)\)to directly model the relationship between system states, actions \(anomaly perturbation\), and outcomes in latent space\(Leeet al\.,[2022](https://arxiv.org/html/2606.31114#bib.bib14)\), as well as directly predict actions\. This allows us to generate targeted test scenarios while avoiding both distributional barriers\(Fuet al\.,[2024](https://arxiv.org/html/2606.31114#bib.bib30)\)and the shortage of expert demonstration data\(Bhargavaet al\.,[2023a](https://arxiv.org/html/2606.31114#bib.bib116)\)\.

Our framework consists of two key components: a Policy Model \(PM\) that learns to generate targeted test scenarios based on both historical operational data and real\-time system states, and an Action Sampler \(AS\) that enforces domain constraints to ensure physical plausibility\. The PM captures complex temporal dependencies and inter\-agent interactions through its attention mechanism, enabling it to identify patterns that may lead to system vulnerabilities\. We evaluate this approach in a large\-scale simulation environment spanning 700 hours of testing across diverse operational contexts\. Results demonstrate that our framework achieves over 8 times higher efficiency in vulnerability discovery compared to expert\-guided exploitation, while also uncovering critical edge cases that traditional methods failed to detect\. This significant improvement in testing efficiency and effectiveness showcases the potential of transformer\-based approaches in fortifying mission\-critical systems through intelligent scenario generation\.

## 2Related Works

### 2\.1Mission Critical System Testing

Testing autonomous systems, especially those operating in complex and dynamic environments like UTM systems and autonomous vehicles, requires comprehensive evaluation under diverse and challenging scenarios\. While these domains may differ in their specific applications, they share common challenges in scenario generation: the need to efficiently explore vast parameter spaces, identify safety\-critical cases, and maintain scenario plausibility\.

In the field of autonomous driving scenario generation, researchers have proposed various approaches to generate challenging test scenarios\.O’Kellyet al\.\([2019](https://arxiv.org/html/2606.31114#bib.bib109)\)presented a scalable testing framework based on rare\-event simulation based on GAIL and cross\-entropy sampling\.Dinget al\.\([2020](https://arxiv.org/html/2606.31114#bib.bib106)\)formulated scenario generation as a reinforcement learning problem, proposing an adaptive editing framework that constructs safety\-critical scenarios through sequential editing operations \(e\.g\., adding or modifying agents\)\. In their subsequent work,Dinget al\.\([2024](https://arxiv.org/html/2606.31114#bib.bib107)\)further explored a retrieval\-augmented generation \(RAG\) approach, which generates new scenarios by retrieving and combining features from existing scenarios, achieving better controllability\.Liuet al\.\([2024](https://arxiv.org/html/2606.31114#bib.bib108)\)proposed a reinforcement learning\-based editing framework that supports more flexible scenario construction\.

### 2\.2Scenario Generation via Reinforcement Learning

Learning to generate plausible trajectories from historical data is fundamental to both behavior prediction and scenario generation\. Early works in inverse reinforcement learning \(IRL\)\(Ng and Russell,[2000](https://arxiv.org/html/2606.31114#bib.bib114)\)aim to recover the underlying reward function from expert demonstrations, with maximum entropy IRL\(Ziebartet al\.,[2008](https://arxiv.org/html/2606.31114#bib.bib115)\)providing a probabilistic framework that addresses the ambiguity in expert behaviors\. To scale to high\-dimensional problems, GAIL\(Ho and Ermon,[2016](https://arxiv.org/html/2606.31114#bib.bib112)\)and its variants like AIRL\(Fuet al\.,[2018](https://arxiv.org/html/2606.31114#bib.bib111)\)employ adversarial training to directly learn policies through explicit reward recovery\. Recently, transformer\-based approaches have shown promising results by reformulating sequential decision making as a sequence modeling problem\. Decision Transformer\(Chenet al\.,[2021](https://arxiv.org/html/2606.31114#bib.bib1)\)demonstrates that trajectory generation can be achieved through autoregressive prediction conditioned on desired returns, while Trajectory Transformer\(Janneret al\.,[2021](https://arxiv.org/html/2606.31114#bib.bib113)\)treats both states and actions as tokens in a unified sequence model\. These approaches provide different perspectives on trajectory modeling: IRL methods focus on understanding the underlying decision\-making process\(Arora and Doshi,[2020](https://arxiv.org/html/2606.31114#bib.bib110)\), while transformer\-based methods leverage the power of attention mechanisms to capture long\-range dependencies in behavioral patterns\.

Their complementary strengths suggest potential benefits in combining both frameworks for more effective trajectory and scenario generation\. In this work, we hybrid the innovation of IRL that learn the world characteristics instead of expert behavior, along with the capability in latent space of Transformer architectures, in favor of extend the boarder of UTM testing scenario generation\.

## 3Problem Analysis

### 3\.1UTM Characteristics

UTM is a system to ensure safe and efficient operation of multiple UAVs \(Unmanned Aerial Vehicles\) in shared airspace\(Kopardekar,[2014](https://arxiv.org/html/2606.31114#bib.bib100)\), without requiring human air traffic controllers to manage each UAV directly\. Detailed concepts are further illustrated in Appendix\.[A](https://arxiv.org/html/2606.31114#A1)\. The UTM system needs to process high\-dimensional data inputs of multiple UAVs, as well as collision detection and route optimization simultaneously\.

The objective of modern UTM vulnerability discovery demands active generation of failure\-triggering sequences\. However, learning must occur from a dataset dominated by safe trajectories, with only sporadic and potentially suboptimal examples of failures\. This presents a unique imbalanced learning challenge\.

GivenNNUAVs, the UTM system evolves according to its internal protocols, generating operational trajectories𝝉=\{𝐬t,𝐚t\}t=1T\\boldsymbol\{\\tau\}=\\\{\{\\mathbf\{s\}\}\_\{t\},\{\\mathbf\{a\}\}\_\{t\}\\\}\_\{t=1\}^\{T\}, where each state𝐬t=\{𝐨it\}i=1N\{\\mathbf\{s\}\}\_\{t\}=\\\{\{\\mathbf\{o\}\}\_\{i\}^\{t\}\\\}\_\{i=1\}^\{N\}encapsulates the system observations including UAV states and each action𝐚t=\{ait\}i=1N\{\\mathbf\{a\}\}\_\{t\}=\\\{a\_\{i\}^\{t\}\\\}\_\{i=1\}^\{N\}denotes periodically inject anomalous signals controlled by classical stress testing methods\. Let𝒫safe\\mathcal\{P\}\_\{\\text\{safe\}\}denote the distribution of normal operations and𝒫crit\\mathcal\{P\}\_\{\\text\{crit\}\}represent the distribution of critical failures\. Due to its built\-in self\-healing functionalities, the tested UTM system usually maintainsℙ​\(𝝉∈𝒫safe\)≫ℙ​\(𝝉∈𝒫crit\)\\mathbb\{P\}\(\\boldsymbol\{\\tau\}\\in\\mathcal\{P\}\_\{\\text\{safe\}\}\)\\gg\\mathbb\{P\}\(\\boldsymbol\{\\tau\}\\in\\mathcal\{P\}\_\{\\text\{crit\}\}\)\.

### 3\.2Scenario Definition

In the context of UTM operations, we define a scenario as a finite slice of system trajectory that captures both the temporal evolution and spatial interactions of managed entities\. Formally, a scenarioξ\\xican be represented as a subsequence of state\-action pairs within trajectory𝝉\\boldsymbol\{\\tau\}, such thatξ=\{𝐬t,𝐚t\}t=kk\+Δ\\xi=\\\{\{\\mathbf\{s\}\}\_\{t\},\{\\mathbf\{a\}\}\_\{t\}\\\}\_\{t=k\}^\{k\+\\Delta\}, wherekkdenotes the starting time step andΔ\\Deltarepresents the scenario duration\.

### 3\.3UTM Scenario Discovery: An MDP

For scenario generation, three main prior approaches exist: policy\-based dynamic evolution\(O’Kellyet al\.,[2019](https://arxiv.org/html/2606.31114#bib.bib109)\), iteration\-based editing\(Dinget al\.,[2020](https://arxiv.org/html/2606.31114#bib.bib106); Liuet al\.,[2024](https://arxiv.org/html/2606.31114#bib.bib108)\), and template\-based one\-shot generation\(Dinget al\.,[2024](https://arxiv.org/html/2606.31114#bib.bib107)\)\. Each approach offers distinct advantages: dynamic evolution methods produce natural interactions but offer limited control, iterative editing provides precise scenario control but may result in mechanical generation processes, while template\-based methods trade generation efficiency with diversity\. Given the time\-invariant nature of UTM systems, the state transitions naturally form a Markov Decision Process \(MDP\)\. Transition dynamics between states is dominantly worth studying\.

In this framework, the discovery of vulnerability\-inducing scenarios can be formulated as an MDPℳ=\(𝒮,𝒜,𝒫,r,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},r,\\gamma\), where the state space𝒮\\mathcal\{S\}captures system configurations, action space𝒜\\mathcal\{A\}represents possible perturbations,𝒫\\mathcal\{P\}denotes the transition dynamics,rris the reward function, andγ\\gammais the discount factor\. This formulation, however, presents unique challenges stemming from both UTM system characteristics and technical constraints\. First, the centralized nature of UTM systems introduces high\-dimensional state spaces that encompass complex interdependencies among multiple UAVs\. The system’s complexity is further compounded by the need to maintain global consistency while managing local interactions, resulting in a combinatorial explosion of possible scenarios\. Moreover, the technical intrinsics of scenario discovery introduce additional challenges\. The distribution of critical scenarios exhibits a pronounced long\-tail effect, where vulnerability\-inducing states occupy only a small fraction of the state space, i\.e\.,ℙ​\(𝐬∈𝒮crit\)≪ℙ​\(𝐬∈𝒮safe\)\\mathbb\{P\}\(\\mathbf\{s\}\\in\\mathcal\{S\}\_\{\\text\{crit\}\}\)\\ll\\mathbb\{P\}\(\\mathbf\{s\}\\in\\mathcal\{S\}\_\{\\text\{safe\}\}\)\. This distributional imbalance is exacerbated by the lack of expert behavior data in critical scenarios, as most operational data comes from safe system states\. Consequently, traditional imitation learning approaches or straightforward policy optimization methods become insufficient for effective scenario discovery\. The transition dynamics𝒫\\mathcal\{P\}in this MDP need to capture not only the physical evolution of the system but also the likelihood of transitioning into critical scenarios\. Formally, for any state𝐬t\\mathbf\{s\}\_\{t\}and action𝐚t\\mathbf\{a\}\_\{t\}, we define the transition probability as:

𝒫​\(𝐬t\+1\|𝐬t,𝐚t\)=\{psafe​\(𝐬t\+1\|𝐬t,𝐚t\)if​𝐬t\+1∈𝒮safepcrit​\(𝐬t\+1\|𝐬t,𝐚t\)if​𝐬t\+1∈𝒮crit\\mathcal\{P\}\(\\mathbf\{s\}\_\{t\+1\}\|\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\}\)=\\begin\{cases\}p\_\{\\text\{safe\}\}\(\\mathbf\{s\}\_\{t\+1\}\|\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\}\)&\\text\{if \}\\mathbf\{s\}\_\{t\+1\}\\in\\mathcal\{S\}\_\{\\text\{safe\}\}\\\\ p\_\{\\text\{crit\}\}\(\\mathbf\{s\}\_\{t\+1\}\|\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\}\)&\\text\{if \}\\mathbf\{s\}\_\{t\+1\}\\in\\mathcal\{S\}\_\{\\text\{crit\}\}\\end\{cases\}\(1\)wherepsafep\_\{\\text\{safe\}\}andpcritp\_\{\\text\{crit\}\}represent the transition dynamics in safe and critical regions respectively\.

### 3\.4Reinforcement Learning Formulation

In our vulnerability discovery task, scenarios inducing system instability are assigned higher rewards, guiding the exploration towards potential failure modes\. Formally, we aim to find a policyπ\\pithat maximizes the expected cumulative reward:

J​\(π\)=𝔼τ∼π​\[∑t=0Tγt​r​\(𝐬t,𝐚t\)\]J\(\\pi\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\}\\left\[\\sum\_\{t=0\}^\{T\}\\gamma^\{t\}r\(\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\}\)\\right\]\(2\)where the reward functionr​\(𝐬t,𝐚t\)r\(\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\}\)is designed to favor transitions that push the system towards unstable states, as detailed in following paragraphs\.

Given the inherently high\-dimensional state space and complex system dynamics of UTM systems, explicit modeling of the state transition probability distribution𝒫a​\(𝐬,𝐬′\)\\mathcal\{P\}\_\{a\}\(\\mathbf\{s\},\\mathbf\{s\}^\{\\prime\}\)faces not only computational scalability challenges, but more fundamentally, modeling complete system dynamics remains largely infeasible\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2606.31114#bib.bib118)\)\. This limitation is particularly evident in traditional test case design approaches\(Leeet al\.,[2015](https://arxiv.org/html/2606.31114#bib.bib119)\)\. Attempts to capture system physics and constraints through rule\-based modeling\(Liu and Ozay,[2014](https://arxiv.org/html/2606.31114#bib.bib120)\)inevitably results in insufficient test coverage or inadequate test quality\. As a workaround, we propose a policy\-centric alternative that learns action selection strategies rather than directly modeling action prediction, thereby implicitly capturing MDP dynamics in state transitions through strategic exploitation\(Silveret al\.,[2018](https://arxiv.org/html/2606.31114#bib.bib117)\), in order to elegantly circumvents both theoretical and practical challenges of explicit world\-modeling\.

Specifically, our optimization objective becomes:

maxπℙ\(𝐬t\+1∈𝒮crit\|𝐬t,𝐚t∼π\(⋅\|𝐬t\)\)\\max\_\{\\pi\}\\mathbb\{P\}\(\\mathbf\{s\}\_\{t\+1\}\\in\\mathcal\{S\}\_\{\\text\{crit\}\}\|\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\}\\sim\\pi\(\\cdot\|\\mathbf\{s\}\_\{t\}\)\)\(3\)
#### State

The state space of testing framework is a concatenate of temporal, spatial and mission information in UAV numbers\. We defineoit∈𝒪⊂ℝdo\_\{i\}^\{t\}\\in\\mathcal\{O\}\\subset\\mathbb\{R\}^\{d\}as a vector ofddrelevant features, which encapsulates the observable information forii\-th UAV attt\-th time\-step, including: kinetic information \(position, velocity, and acceleration of all UAVs\), environmental data \(obstacles, weather conditions, and airspace restrictions\) and mission\-specific details \(battery levels, payload capacity, and route destinations\)\. To capture both cross\-UAV dependencies and temporal dependencies, we represent the states∈𝒮s\\in\\mathcal\{S\}as a sequence of observations of allNNUAVs over a fixed time windowTT, namely,𝐬=\{\(o11,…,oN1\),…,\(o1T,…,oNT\)\}\\mathbf\{s\}=\\\{\(o\_\{1\}^\{1\},\.\.\.,o\_\{N\}^\{1\}\),\\dots,\(o\_\{1\}^\{T\},\\dots,o\_\{N\}^\{T\}\)\\\}\.

#### Action

The action space𝒜\\mathcal\{A\}comprises a discrete set of all possible injection operations, each targeting a specific component or aspect of the SUT\. We define𝒜\\mathcal\{A\}as the Cartesian product of two sets𝒜=𝒟′×ℱ\\mathcal\{A\}=\\mathcal\{D\}^\{\{\}^\{\\prime\}\}\\times\\mathcal\{F\}where𝒟′\\mathcal\{D\}^\{\{\}^\{\\prime\}\}represents the set ofNNtargetable UAV, andℱ\\mathcal\{F\}is the set ofmmapplicable disturbance injections\. For each injection, all the possible types are listed in Table[7](https://arxiv.org/html/2606.31114#A7.T7)in Appendix[G](https://arxiv.org/html/2606.31114#A7.SS0.SSS0.Px2)\.

#### Reward

The reward functionrris designed to capture the system’s safety and operational efficiency asr​\(st,at\)=∑i=1Kαi​ri​\(st,at\)r\(s\_\{t\},a\_\{t\}\)=\\sum\_\{i=1\}^\{K\}\\alpha\_\{i\}r\_\{i\}\(s\_\{t\},a\_\{t\}\)denoting reward at timestepttwhererir\_\{i\}are individual reward components \(e\.g\., collision avoidance, mission completion, system stability\) andαi\\alpha\_\{i\}are their respective weights\. Return\-to\-go is the cumulate sum of reward from current timettto the ending timeTTasRt:T=∑i=tTriR\_\{t:T\}=\\sum\_\{i=t\}^\{T\}r\_\{i\}\.

### 3\.5Choice of Decision Transformer

Given UTM’s safety\-critical nature and biased data distribution discussed earlier, we adopt an offline reinforcement learning approach to discover vulnerability\-inducing scenarios\. We leverage the Decision Transformer\(DT\) architecture to effectively model the MDP dynamics, of which the return\-to\-go design in DT provides a natural guidance\.

Self\-attention mechanism also provides interleaving data utilization among different head in favor of modeling agent\-wise interaction\. Our strategy of Transformer usage resides in \(1\) modeling complex system mechanism through learning reward/return, \(2\) generating targeted and valuable actions based on knowledge of world\. Formally, given a sequence of state\-return pairs\(s1,R1\),…,\(sT,RT\)\(s\_\{1\},R\_\{1\}\),\.\.\.,\(s\_\{T\},R\_\{T\}\), the decoder\-only Transformer processed this information end\-to\-end into a set of predictions\{\(R1^,a1^\),…,\(RT^,aT^\)\}\\\{\(\\hat\{R\_\{1\}\},\\hat\{a\_\{1\}\}\),\.\.\.,\(\\hat\{R\_\{T\}\},\\hat\{a\_\{T\}\}\)\\\}withR^\\hat\{R\}as regressive modeling of world anda^\\hat\{a\}decision of actions, whereRt^=fθ​\(s1,R1,…,st\)\\hat\{R\_\{t\}\}=f\_\{\\theta\}\(s\_\{1\},R\_\{1\},\.\.\.,s\_\{t\}\)andat^=fθ​\(s1,R1,…,st,Rt\)\\hat\{a\_\{t\}\}=f\_\{\\theta\}\(s\_\{1\},R\_\{1\},\.\.\.,s\_\{t\},R\_\{t\}\)withfθf\_\{\\theta\}denoting the Transformer decoding function with parametersθ\\theta\. By conditioning action generation on desired future returns, we can effectively steer the exploration towards potentially vulnerable states while respecting the constraints of the offline setting\. Formally, we learn a sequence prediction model that bridges the distributional gap through return\-conditioned generation:

pθ​\(𝐚t\|𝐬t,Rt:T,ℋt\)=Transformer​\(\[𝐱t−l,…,𝐱t\]\)p\_\{\\theta\}\(\\mathbf\{a\}\_\{t\}\|\\mathbf\{s\}\_\{t\},R\_\{t:T\},\\mathcal\{H\}\_\{t\}\)=\\text\{Transformer\}\(\[\\mathbf\{x\}\_\{t\-l\},\.\.\.,\\mathbf\{x\}\_\{t\}\]\)\(4\)
where𝐬t\\mathbf\{s\}\_\{t\}represents the current system state,Rt:TR\_\{t:T\}denotes the return\-to\-go, andℋt\\mathcal\{H\}\_\{t\}captures the history of previous states and actions\. Each token𝐱i\\mathbf\{x\}\_\{i\}in the input sequence combines state, action, and return\-to\-go information through learned embeddings over a context lengthll\.

Considering complexity and heterogeneity of UTM run\-time states, the transformer’s attention mechanism and latent space encoding capabilities significantly enhance exploration efficiency\. Self\-attention mechanisms capture complex patterns in historical data that correlate with different system outcomes, from safe operations to potential failures\. We train the model by optimizing a dual objective that encompasses both action prediction and return estimation:

ℒ​\(θ\)\\displaystyle\\mathcal\{L\}\(\\theta\)=𝔼\(𝐬t,𝐚t,Rt:T\)∼𝒟​\[−log⁡pθ​\(𝐚t\|𝐬t,𝐚t,Rt:T,ℋt\)\]\\displaystyle=\\mathbb\{E\}\_\{\(\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\},R\_\{t:T\}\)\\sim\\mathcal\{D\}\}\[\-\\log p\_\{\\theta\}\(\\mathbf\{a\}\_\{t\}\|\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\},R\_\{t:T\},\\mathcal\{H\}\_\{t\}\)\]\(5\)\+𝔼\(𝐬t,𝐚t,Rt:T\)∼𝒟​\[−log⁡pθ​\(Rt:T\|𝐬t,ℋt\)\]\\displaystyle\+\\mathbb\{E\}\_\{\(\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\},R\_\{t:T\}\)\\sim\\mathcal\{D\}\}\[\-\\log p\_\{\\theta\}\(R\_\{t:T\}\|\\mathbf\{s\}\_\{t\},\\mathcal\{H\}\_\{t\}\)\]
To further enhance the efficiency of offline data utilization and address the challenges of imbalanced data distribution and limited expert demonstrations, we augment the transformer architecture with advanced sampling mechanisms\. These mechanisms help mitigate the scarcity of expert data and the inherent imbalance in sample distribution, details of which will be elaborated in subsequent sections\.

## 4Framework Architecture

![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/SysOverview_v5.png)Figure 2:Architecture overview of the proposed scenario\-oriented testing framework\.The framework consists of two primary modules: \(1\) a Transformer\-based Policy Model \(PM\) for generating fault scenarios based on real\-time and historical SUT data, and \(2\) an Action Sampler \(AS\) that enforces predefined safety rules and filters out undesirable actions\. The validated scenarios are then injected into the System\-Under\-Test \(SUT\) for evaluation\. This architecture effectively narrows the search space to high\-risk scenarios, improving fault detection efficiency and reducing unnecessary exploration of low\-risk cases\.In this section, we present a framework that generates complex testing scenarios while incorporating domain knowledge and expert preferences\. As shown in Fig\.[2](https://arxiv.org/html/2606.31114#S4.F2), in this framework, we utilize a Transformer\-based model as policy model \(PM\) to generate candidate actions conditioned on system states\. According to action space defined in Section[3\.4](https://arxiv.org/html/2606.31114#S3.SS4), generated actions can be interpreted as potential fault injections to be applied to typical victim drones in UTM\. Subsequently, these actions are passed through a domain\-specific action sampler \(AS\)\. AS serves for two purposes: \(1\) ensure the PM\-generated actions available within the specific UTM context; \(2\) leverage human expert knowledge to re\-sample actions with balanced preference bias in chosen actions and agents\. Only actions sampled are injected into the system\-under\-test \(SUT\)\. On SUT finishing execution, a new system state would be generated and fed back to the PM, along with the evaluation of the actions \(reward\)\. The PM iteratively explores the scenario space to expose potential vulnerabilities\.

### 4\.1Policy Model

In this subsection, we describe the design of PM, according to RL formulation defined in[3\.4](https://arxiv.org/html/2606.31114#S3.SS4)\. The Policy Model serves as the generative engine of our framework, leveraging the power of Transformer architectures to capture complex temporal dependencies and system dynamics\. During training, PM serves to model trajectory sequence from UTM and learn internal natures in offline dataset\. In performing inference, PM processes real\-time UTM context and generates proposed fault injection actions to AS\.

![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/transformer_v7.png)Figure 3:Architecture of the Policy Model \(PM\)\.The PM utilizes a Transformer\-based reinforcement learning framework, taking both historical and real\-time SUT states as input tokens to capture temporal dependencies and system dynamics\. The model generates action sequences that include both environmental manipulations \(e\.g\., placing obstacles\) and internal state changes \(e\.g\., network degradation\)\.#### Time sequence and action modeling

Expanding on previous work that utilized Transformers for decision\-making\(Chenet al\.,[2021](https://arxiv.org/html/2606.31114#bib.bib1)\), we unify temporal sequences by projecting observationsoo, actionsaa, returnsRRand rewardsrrinto a homogeneous space\. We aggregate rewardsrrinto summary tokens, similar to the construction of return\-to\-go tokenRR\(summarizingTTincoming timesteps\)\. Input sequence thus carries data of in\-total3×T3\\times Ttimesteps while focusing on centralTTcurrenttimesteps\. Tokens would then be arranged as array of⟨𝐎,𝐑,𝐀⟩\\left\\langle\\mathbf\{O\},\\mathbf\{R\},\\mathbf\{A\}\\right\\rangletuples with length ofTTtimesteps\. Considering temporal dependency in decision making, we masked out𝐑\\mathbf\{R\}and𝐀\\mathbf\{A\}tokens except that in last time step\. Thus model utilizeT×NT\\times Nobservation tokens to predict the current return\-to\-go tokenR^\\hat\{R\}to fit ground\-truth returnRR, as learning of implicit system nature\. An intermediatemasktoken is introduce to mask out invalid action choices, in favor of modeling system capability according to current state\.

#### Embedding and Causality

To enhance the modeling of causal dependencies within the policy model, we employ a multi\-faceted approach\. We augment the sequentially sampled multi\-agent drone observation data with positional embedding\. Additionally, as shown in Fig\.[3](https://arxiv.org/html/2606.31114#S4.F3), input sequence is augmented with different classification \(CLS\) tokens as powerful discriminators in order to reduce the ambiguity of prediction targets\. Inspired by insights fromShawet al\.\([2018](https://arxiv.org/html/2606.31114#bib.bib54)\), we prioritize the most recent observations by placing them closest to the CLS token, ensuring that the model pays particular attention to the latest information when making decisions\. This aligns with the principle that recent events often carry more causal relevance than distant ones\.

To capture long\-range dependencies, we employed self\-attention mechanism among tokens together with a semi\-lower\-triangular agent\-wise causal mask in attention calculation to preserve decision causality\. Observation tokensooat identical timestep are visible to each other homogeneously\. However theR^\\hat\{R\}tokens could be predicted with onlyobservationtokens visible before being fed with ground\-truthreturn\-to\-gotoken\. And only older⟨𝐎,𝐑,𝐀⟩\\left\\langle\\mathbf\{O\},\\mathbf\{R\},\\mathbf\{A\}\\right\\rangletuples are visible to newer ones\. We aim to guide the model to construct a more comprehensive and nuanced understanding of the causal dynamics\. Formally, we can sequentially express the prediction task asa^t=fθ​\(S−t:1,o1:t,Rt,Mt\)\\hat\{a\}\_\{t\}=f\_\{\\theta\}\(S\_\{\-t:1\},o\_\{1:t\},R\_\{t\},M\_\{t\}\)whereSSdenotes thesummarytoken aggregating previousTTtime steps andfθf\_\{\\theta\}represents the Transformer model with parametersθ\\theta\.

### 4\.2Action Sampler

Inductive bias and generality are key drawbacks of traditional offline RL methods\. We design a set of sampling strategies as a workaround\. In this subsection, We first introduce preference bias as a notation of human feedback\. And we describe action sampler functions between PM and SUT\.

![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/action_sampler_v3.png)Figure 4:Pipeline of the Action Sampler \(AS\)\.The AS enforces safety constraints and domain\-specific rules, filtering out irrelevant actions generated by the Policy Model \(PM\) before injecting them into the System\-Under\-Test \(SUT\), ensuring the integrity of the testing process\.#### Preference Bias

In offline RL with autoregressive models, the collected training data often exhibits biased distributions\. This limits the sampling of rare events\. Meanwhile, the more complex the system is tested, the more insidious the vulnerability and the more significant the long\-tail effect\. In this work, training dataset is collected through traditional stress testing, where unpredictable inductive bias is common in production systems\.

We introducePreference Bias, improved from popularity bias\(Klimashevskaiaet al\.,[2024](https://arxiv.org/html/2606.31114#bib.bib91)\)with additional domain expert knowledge, to unify imbalance in model prediction and gap in prior human preference\. Preference bias carries a expected distribution of⟨UAV,Action⟩\\left\\langle\\mathrm\{UAV\},\\mathrm\{Action\}\\right\\rangletuples\. The output of the offline\-trained PM is augmented with compensation dynamically calculated from distance between recent historical trajectories and given distribution\.

#### Action Candidate Sampling

As shown in Fig\.[4](https://arxiv.org/html/2606.31114#S4.F4), we adjust PM’s predicted action logits using the preference distribution\. To address long\-tail effect and improve fairness\(Menonet al\.,[2020](https://arxiv.org/html/2606.31114#bib.bib79)\), Top\-K sampling is introduced after augmentation in order to maintain variance\. To maintain physical feasibility, immediate action mask is applied in order to filter intolerable action candidates\. The final action is sampled through a uniform sampling after masking\.

## 5Evaluation

We train the proposed framework with a large\-scale offline dataset of around 17B tokens collected from stress testing data and evaluate on an industry\-level simulator\. As is summarised in Table\.[8](https://arxiv.org/html/2606.31114#A9.T8)in Appendix[I](https://arxiv.org/html/2606.31114#A9), the training set consists of seven distinct regions and online testing includes two regions\. The training dataset covering diverse geographical and operational characteristics, including a mix of rural \(12\.2%\), suburban \(39\.0%\), and urban areas \(48\.8%\), each with varying numbers of UAVs, airports, and flight lines\. The dataset is balanced to represent the typical distribution of scenarios encountered in real\-world UTM systems\. For testing, two regions \(TR1 and TR2\) are excluded from the training set to provide evaluations of the generalization capabilities\.

We design two model of different size, with 1\.2 billion and 2 billion parameters \(referred as PM\-1\.2B and PM\-2B respectively\)\. We train each model on 16 NVIDIA A100 GPUs, each equipped with 80GB of memory\. The training utilized PyTorch’s Distributed Data Parallel \(DDP\) to efficiently distribute the workload across multiple GPUs, ensuring high computational efficiency and resource utilization\. During training, the dataset is divided into smaller slices of 3B tokens for sequential loading during training\.

We evaluate the performance of the proposed model through both offline and online evaluations to provide a comprehensive analysis\. In Section[5\.1](https://arxiv.org/html/2606.31114#S5.SS1), we focus on the offline evaluation of the PM’s behavior during training, where we analyze the evolution of action accuracy and return\-to\-go loss\. In Section[5\.2](https://arxiv.org/html/2606.31114#S5.SS2), the online evaluation measures the model’s performance in a deployed real\-world environment, where we collect and analyze a range of key metrics\. This dual evaluation framework offers a holistic view of the model’s efficacy, ensuring robustness both during training and in practical applications\. As shown in Fig\.[5](https://arxiv.org/html/2606.31114#S5.F5), our framework is able to reveal much more potential vulnerabilities than conventional hand\-crafted strategy does\.

### 5\.1Offline Evaluation

![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/UTM_fault_demo_withoutlogo.png)Figure 5:Examples of detected UTM fault scenarios, whereyellowmarks indicate fault cases revealed by traditional expert method andredmarks indicate those revealed by the framework\.For offline evaluation, we focus on the impact of model size on action accuracy and return\-to\-go loss during training\. Especially, we apply the top K action accuracy in that in our framework, actions are sampled based on the top\-k predictions rather than solely the top\-1\. We further detail the impact of top\-1 accuracy in Appendix\.[A](https://arxiv.org/html/2606.31114#A1)\. The results in Fig\.[6](https://arxiv.org/html/2606.31114#S5.F6)illustrate that larger models consistently perform better across both action accuracy \(highest\) and return\-to\-go loss \(lowest\) metrics\. This indicates that larger models have a better capacity to capture the underlying structure in the offline data, achieving more accurate action selections with fewer training tokens\. Fig\.[6](https://arxiv.org/html/2606.31114#S5.F6)also reveals that the PM\-2B model begins to overfit much later compared to the smaller PM\-10M and PM\-100M models\. This suggests that larger models not only perform better in terms of action accuracy but also exhibit better generalization properties, allowing them to continue learning effectively with more data before encountering overfitting issues\. This behavior is a hallmark of the scaling effect, where larger models benefit from increased capacity and more robust training dynamics, making them more resistant to overfitting compared to smaller models\.

![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/Training_Tokens.png)\(a\)Action accuracy of PM;
![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/Training_Tokens_top23.png)\(b\)Top 2/3 action accuracy;
![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/RTG_Loss.png)\(c\)Return\-to\-go loss of PM;

Figure 6:Offline evaluation results on validation sets during training\.The action accuracy and return\-to\-go of the models \(PM\-10M, PM\-100M, and PM\-2B\) measured over increasing training tokens on validation sets\. All models show an initial increase in accuracy, followed by a decline, indicating overfitting phenomenon\. Similarly, all models eventually increase in return\-to\-go loss, signaling overfitting\. Larger models demonstrate a clear advantage, achieving significantly higher accuracy lower return\-to\-go loss compared to the smaller models\. The peak action accuracy for each curve is highlighted with a star\.
### 5\.2Online Evaluation

To evaluate the effectiveness of proposed framework in unseen environments, we selecte several key metrics to evaluate the preference and effectiveness of PM, as well as the quality of actions, as is shown in Table\.[5](https://arxiv.org/html/2606.31114#A6.T5)\. For detailed explanation of each metric, we refer to the Appendix[F](https://arxiv.org/html/2606.31114#A6)\.

MetricsPM\-2BPM\-1\.2BExpert\-Guided ExploitationSmoke Test111The smoke testing refers to the basic functionality testing of UTM system\. This is conducted as the initial testing after a new build or version of the UTM system\.TR1TR2TR1TR2TR1TR2TR1TR2APO\(%\)20\.031\.555\.338\.372\.083\.3100100APD\(%\)26/34/21/1946/32/11/1128/27/22/2330/29/20/2125/25/25/2525/25/25/25N/AN/AHAR\(%\)10\.84\.96\.74\.23\.61\.7N/AN/ACAR\(%\)29\.764\.14\.04\.54\.13\.9N/AN/ASPM50\.517\.65\.8N/AFPM7\.62\.2<<1\.0777The FPMs are below 1\.0 because the two baseline tests have already been thoroughly used to identify existing bugs and improve UTM in advance, while our method is focused on discovering new bugs in the updated version of the UTM system after the baselines have reached their detection limits\.<<1\.0777The FPMs are below 1\.0 because the two baseline tests have already been thoroughly used to identify existing bugs and improve UTM in advance, while our method is focused on discovering new bugs in the updated version of the UTM system after the baselines have reached their detection limits\.

Table 1:Performance metrics of the propose framework in online environments of unseen regions\.This table shows the online results in out\-of\-distribution region TR1 and TR2\. Results of PM models are reported on over 700 hours testing in total, with around 100M records for each model in each region\. The detailed definition of metrics can be found in Table\.[5](https://arxiv.org/html/2606.31114#A6.T5)and Appendix[F](https://arxiv.org/html/2606.31114#A6)\.From the results shown in Table\.[1](https://arxiv.org/html/2606.31114#S5.T1), we can conclude that the proposed PM\-2B model significantly outperformed both expert\-guided testing and smoke test baselines across all key metrics\. Specifically, PM\-2B generates high\-risk scenarios weight times faster than smoke testing, and is able to discover bugs while expert\-guided testing method fails to\. This indicates that the proposed framework is more effective in identifying critical scenarios and potential failures\. Furthermore, comparing with smaller PM\-1\.2B model, PM\-2B performs significantly better in action quality and efficiency\. This suggests the existence of scaling effect between model size and online performance in discovering critical cases and efficiently covering high\-risk regions\. Interestingly, the PM\-2B model detected failure modes \(SPM and FPM\) that the smoke test completely missed\. This emergent capability shows that the PM framework can find faults beyond traditional rule\-based methods, demonstrating its utility for uncovering rare bugs\. Considering both the scaling effect and emergent abilities, our framework shows significant promise for scaling up model sizes, and has the potential to become a breakthrough in the testing field in the future\. However, PM models fail to balance the distribution of different action types, which could lead to potential under\-exploration in less frequent action spaces\. This suggests a need for better action sampling strategies\.

## 6Discussion

#### Why don’t we use inverse RL?

One might consider using inverse reinforcement learning \(IRL\) to infer the underlying reward function from the system’s operational trajectories and then optimize for its opposite to discover vulnerabilities\. However, IRL faces several fundamental limitations in UTM testing\. The primary issue lies in the inherent ambiguity of the observed system behaviors\. Unlike traditional IRL settings where expert demonstrations represent optimal or near\-optimal policies, our historical trajectories mostly consist of operations that finally result in “safe” states, making it difficult to reliably invert the system’s true safety objectives\. Furthermore, there exists a fundamental tension in the learning objectives: strictly imitating the suboptimal aspects of historical operations might perpetuate existing blindspots in testing coverage, while over\-idealizing the system’s intended behavior could lead to unrealistic vulnerability scenarios\. This inherent ambiguity, combined with the need for active failure exploration, makes IRL less suitable than our direct policy learning approach\.

#### Why does proposed framework exceed the performance of human experts?

Although trained with expert\-guided exploitation data, PM model ultimately surpass the performance of human experts\. This is attributed to that PM model applies offline RL, which can be viewed as an implicit filter of low\-quality actions\(Prudencioet al\.,[2023](https://arxiv.org/html/2606.31114#bib.bib92)\), making it less susceptible to distraction during the search for long\-tail scenarios\.

We can illustrate this by analyzing the hazard action ratio per observation, which is obtained by multiplying HAR and APO, and the constant\-pressure action ratio per observation, calculated by multiplying CAR and APO\. For both PM\-2B, PM\-1\.2B, and human experts, the hazard action ratio per observation is consistently around 2%\. This shows that all methods are similarly effective in identifying high\-risk actions\. However, the key difference is that the PM models demonstrate a significantly higher constant\-pressure action ratio per observation, indicating that they maintain a more sustained level of high\-risk actions over time\. This ability to constantly pose challenges and maintain pressure highlights the advantage of the PM models in exploring complex, high\-risk scenarios more thoroughly, thereby leading to superior fault detection and scenario coverage\.

## 7Conclusion

We propose a novel scenario\-oriented testing framework for vulnerability detection in mission\-critical systems, specifically applied to UTM\. Our approach leverages a Transformer\-based policy model to tackle long\-tail effect and efficiency challenge in fault detection\. Context utilization in policy model improves generality in unseen regions\. Our results highlight the potential of learning and expert hybrid approaches in fortifying mission\-critical systems\. This work opens new directions for end\-to\-end auto\-regressive learning in safety\-critical system testing\. Future work could explore the application of this framework to other mission\-critical domains beyond UTM, such as autonomous vehicles or industrial control systems\.

## Impact Statement

This work aims to enhance the safety and reliability of Unmanned Traffic Management systems, which have significant societal implications as aerial mobility becomes increasingly important in urban environments\. While our framework’s ability to discover system vulnerabilities advances testing capabilities, we acknowledge that such knowledge could potentially be misused, and have therefore designed our methodology to be accessible only to authorized system developers and testers, with appropriate safeguards for responsible disclosure\. The improved efficiency in detecting critical scenarios could accelerate UTM deployment, potentially affecting traditional air traffic management roles while creating new opportunities in system development and maintenance\. Furthermore, our approach may have broader applications in other safety\-critical domains such as autonomous vehicles, medical devices, and industrial control systems, underscoring the importance of maintaining rigorous ethical standards and careful consideration of failure consequences as the technology is adapted to new contexts\. We are committed to open dialogue with stakeholders and the research community about these implications as the technology continues to evolve\.

## References

- S\. Arora and P\. Doshi \(2020\)A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress\.arXiv\.External Links:1806\.06877,[Document](https://dx.doi.org/10.48550/arXiv.1806.06877)Cited by:[§2\.2](https://arxiv.org/html/2606.31114#S2.SS2.p1.1)\.
- P\. Bhargava, R\. Chitnis, A\. Geramifard, S\. Sodhani, and A\. Zhang \(2023a\)When should we prefer Decision Transformers for Offline Reinforcement Learning?\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=vpV7fOFQy4)Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p3.1)\.
- P\. Bhargava, R\. Chitnis, A\. Geramifard, S\. Sodhani, and A\. Zhang \(2023b\)When should we prefer decision transformers for offline reinforcement learning?\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix G](https://arxiv.org/html/2606.31114#A7.SS0.SSS0.Px3.p1.1)\.
- T\. B\. Brown \(2020\)Language models are few\-shot learners\.arXiv preprint arXiv:2005\.14165\.Cited by:[Appendix D](https://arxiv.org/html/2606.31114#A4.p1.1)\.
- Y\. Chebotar, Q\. Vuong, K\. Hausman, F\. Xia, Y\. Lu, A\. Irpan, A\. Kumar, T\. Yu, A\. Herzog, K\. Pertsch,et al\.\(2023\)Q\-transformer: scalable offline reinforcement learning via autoregressive q\-functions\.InConference on Robot Learning,pp\. 3909–3928\.Cited by:[Appendix D](https://arxiv.org/html/2606.31114#A4.p1.1)\.
- L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch \(2021\)Decision transformer: reinforcement learning via sequence modeling\.arXiv preprint arXiv:2106\.01345\.Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.31114#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.31114#S4.SS1.SSS0.Px1.p1.16)\.
- J\. Devlin \(2018\)Bert: pre\-training of deep bidirectional transformers for language understanding\.arXiv preprint arXiv:1810\.04805\.Cited by:[1st item](https://arxiv.org/html/2606.31114#A4.I1.i1.p1.1)\.
- W\. Ding, Y\. Cao, D\. Zhao, C\. Xiao, and M\. Pavone \(2024\)RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios\.arXiv\.External Links:2312\.13303,[Document](https://dx.doi.org/10.48550/arXiv.2312.13303)Cited by:[§2\.1](https://arxiv.org/html/2606.31114#S2.SS1.p2.1),[§3\.3](https://arxiv.org/html/2606.31114#S3.SS3.p1.1)\.
- W\. Ding, B\. Chen, M\. Xu, and D\. Zhao \(2020\)Learning to Collide: An Adaptive Safety\-Critical Scenarios Generating Method\.arXiv\.External Links:2003\.01197,[Document](https://dx.doi.org/10.48550/arXiv.2003.01197)Cited by:[§2\.1](https://arxiv.org/html/2606.31114#S2.SS1.p2.1),[§3\.3](https://arxiv.org/html/2606.31114#S3.SS3.p1.1)\.
- FAA \(2023\)UTM field test \(UFT\) final report\.Federal Aviation Administration\.External Links:[Link](https://www.faa.gov/uas/advanced_operations/traffic_management/UFT-Final-Report.pdf)Cited by:[Appendix A](https://arxiv.org/html/2606.31114#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.31114#S1.p1.1)\.
- J\. Fu, K\. Luo, and S\. Levine \(2018\)Learning Robust Rewards with Adversarial Inverse Reinforcement Learning\.arXiv\.External Links:1710\.11248,[Document](https://dx.doi.org/10.48550/arXiv.1710.11248)Cited by:[§2\.2](https://arxiv.org/html/2606.31114#S2.SS2.p1.1)\.
- L\. Fu, H\. Huang, G\. Datta, L\. Y\. Chen, W\. C\. Panitch, F\. Liu, H\. Li, and K\. Goldberg \(2024\)In\-context imitation learning via next\-token prediction\.arXiv\.External Links:2408\.15980,[Document](https://dx.doi.org/10.48550/arXiv.2408.15980)Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p3.1)\.
- D\. Ha and J\. Schmidhuber \(2018\)World Models\.External Links:1803\.10122,[Document](https://dx.doi.org/10.5281/zenodo.1207631),[Link](http://arxiv.org/abs/1803.10122)Cited by:[§3\.4](https://arxiv.org/html/2606.31114#S3.SS4.p2.1)\.
- A\. Hamissi and A\. Dhraief \(2023\)A survey on the unmanned aircraft system traffic management\.ACM Computing Surveys56\(3\),pp\. 1–37\.Cited by:[Figure 7](https://arxiv.org/html/2606.31114#A2.F7)\.
- J\. Ho and S\. Ermon \(2016\)Generative Adversarial Imitation Learning\.InAdvances in Neural Information Processing Systems,Vol\.29\.Cited by:[§2\.2](https://arxiv.org/html/2606.31114#S2.SS2.p1.1)\.
- M\. Janner, Q\. Li, and S\. Levine \(2021\)Offline Reinforcement Learning as One Big Sequence Modeling Problem\.arXiv\.External Links:2106\.02039,[Document](https://dx.doi.org/10.48550/arXiv.2106.02039)Cited by:[§2\.2](https://arxiv.org/html/2606.31114#S2.SS2.p1.1)\.
- A\. Klimashevskaia, D\. Jannach, M\. Elahi, and C\. Trattner \(2024\)A survey on popularity bias in recommender systems\.User Modeling and User\-Adapted Interaction\.External Links:ISSN 1573\-1391,[Link](http://dx.doi.org/10.1007/s11257-024-09406-0),[Document](https://dx.doi.org/10.1007/s11257-024-09406-0)Cited by:[§4\.2](https://arxiv.org/html/2606.31114#S4.SS2.SSS0.Px1.p2.1)\.
- P\. H\. Kopardekar \(2014\)Unmanned aerial system \(UAS\) traffic management \(UTM\): enabling low\-altitude airspace and UAS operations\.Technical reportNational Aeronautics and Space Administration\.Cited by:[Appendix A](https://arxiv.org/html/2606.31114#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2606.31114#A1.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.31114#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.31114#S3.SS1.p1.1)\.
- P\. Kopardekar, J\. Rios, T\. Prevot, M\. Johnson, J\. Jung, and J\. E\. Robinson \(2016\)Unmanned aircraft system traffic management \(UTM\) concept of operations\.InAIAA AVIATION Forum and Exposition,Cited by:[Appendix A](https://arxiv.org/html/2606.31114#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2606.31114#A1.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.31114#S1.p1.1)\.
- L\. Gladence, V\. Anu, A\. Anderson, Immanuel Stanley, Jithin Abhishek Fernando J, and S\. Revathy \(2021\)Swarm Intelligence in Disaster Recovery\.2021 5th International Conference on Intelligent Computing and Control Systems \(ICICCS\),pp\. 1–8\.External Links:[Document](https://dx.doi.org/10.1109/ICICCS51141.2021.9432146)Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p2.1)\.
- K\. Lee, O\. Nachum, M\. S\. Yang, L\. Lee, D\. Freeman, S\. Guadarrama, I\. Fischer, W\. Xu, E\. Jang, H\. Michalewski,et al\.\(2022\)Multi\-game decision transformers\.Advances in Neural Information Processing Systems35,pp\. 27921–27936\.Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p3.1)\.
- R\. Lee, M\. J\. Kochenderfer, O\. J\. Mengshoel, G\. P\. Brat, and M\. P\. Owen \(2015\)Adaptive stress testing of airborne collision avoidance systems\.In2015 IEEE/AIAA 34th Digital Avionics Systems Conference \(DASC\),pp\. 6C2–1–6C2–13\.External Links:ISSN 2155\-7209,[Document](https://dx.doi.org/10.1109/DASC.2015.7311450),[Link](https://ieeexplore.ieee.org/document/7311450)Cited by:[§3\.4](https://arxiv.org/html/2606.31114#S3.SS4.p2.1)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.arXiv preprint arXiv:2005\.01643\.Cited by:[3rd item](https://arxiv.org/html/2606.31114#A4.I1.i3.p1.1)\.
- H\. Liu, L\. Zhang, S\. K\. S\. Hari, and J\. Zhao \(2024\)Safety\-Critical Scenario Generation Via Reinforcement Learning Based Editing\.arXiv\.External Links:2306\.14131,[Document](https://dx.doi.org/10.48550/arXiv.2306.14131)Cited by:[§2\.1](https://arxiv.org/html/2606.31114#S2.SS1.p2.1),[§3\.3](https://arxiv.org/html/2606.31114#S3.SS3.p1.1)\.
- J\. Liu and N\. Ozay \(2014\)Abstraction, discretization, and robustness in temporal logic control of dynamical systems\.InProceedings of the 17th International Conference on Hybrid Systems: Computation and Control,HSCC ’14,pp\. 293–302\.External Links:[Document](https://dx.doi.org/10.1145/2562059.2562137),[Link](https://doi.org/10.1145/2562059.2562137),ISBN 978\-1\-4503\-2732\-9Cited by:[§3\.4](https://arxiv.org/html/2606.31114#S3.SS4.p2.1)\.
- A\. K\. Menon, S\. Jayasumana, A\. S\. Rawat, H\. Jain, A\. Veit, and S\. Kumar \(2020\)Long\-tail learning via logit adjustment\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=37nvvqkCo5)Cited by:[§4\.2](https://arxiv.org/html/2606.31114#S4.SS2.SSS0.Px2.p1.1)\.
- A\. Y\. Ng and S\. J\. Russell \(2000\)Algorithms for Inverse Reinforcement Learning\.InProceedings of the Seventeenth International Conference on Machine Learning,ICML ’00,San Francisco, CA, USA,pp\. 663–670\.External Links:ISBN 978\-1\-55860\-707\-1Cited by:[§2\.2](https://arxiv.org/html/2606.31114#S2.SS2.p1.1)\.
- M\. O’Kelly, A\. Sinha, H\. Namkoong, J\. Duchi, and R\. Tedrake \(2019\)Scalable End\-to\-End Autonomous Vehicle Testing via Rare\-event Simulation\.arXiv\.External Links:1811\.00145,[Document](https://dx.doi.org/10.48550/arXiv.1811.00145)Cited by:[§2\.1](https://arxiv.org/html/2606.31114#S2.SS1.p2.1),[§3\.3](https://arxiv.org/html/2606.31114#S3.SS3.p1.1)\.
- Z\. Pan, B\. Zhuang, J\. Liu, H\. He, and J\. Cai \(2021\)Scalable vision transformers with hierarchical pooling\.InProceedings of the IEEE/cvf international conference on computer vision,pp\. 377–386\.Cited by:[Appendix D](https://arxiv.org/html/2606.31114#A4.p1.1)\.
- R\. F\. Prudencio, M\. R\. Maximo, and E\. L\. Colombini \(2023\)A survey on offline reinforcement learning: taxonomy, review, and open problems\.IEEE Transactions on Neural Networks and Learning Systems\.Cited by:[§6](https://arxiv.org/html/2606.31114#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Rios, D\. Mulfinger, I\. Smith, P\. Venkatesan, D\. Smith, V\. Baskaran, and L\. Wang \(2017\)UTM data working group demonstration 1 final report\.Moffett Field, CA\.Cited by:[Appendix A](https://arxiv.org/html/2606.31114#A1.SS0.SSS0.Px3.p1.1)\.
- P\. Shaw, J\. Uszkoreit, and A\. Vaswani \(2018\)Self\-Attention with Relative Position Representations\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),pp\. 464–468\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-2074)Cited by:[§4\.1](https://arxiv.org/html/2606.31114#S4.SS1.SSS0.Px2.p1.1)\.
- D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel, T\. Lillicrap, K\. Simonyan, and D\. Hassabis \(2018\)A general reinforcement learning algorithm that masters chess, shogi, and Go through self\-play\.362\(6419\),pp\. 1140–1144\.External Links:[Document](https://dx.doi.org/10.1126/science.aar6404),[Link](https://www.science.org/doi/10.1126/science.aar6404)Cited by:[§3\.4](https://arxiv.org/html/2606.31114#S3.SS4.p2.1)\.
- K\. Spalas \(2024\)Towards the unmanned aerial vehicle traffic management systems \(utms\): security risks and challenges\.arXiv preprint arXiv:2408\.11125\.Cited by:[Figure 7](https://arxiv.org/html/2606.31114#A2.F7)\.
- H\. Tian, Y\. Jiang, G\. Wu, J\. Yan, J\. Wei, W\. Chen, S\. Li, and D\. Ye \(2022\)MOSAT: finding safety violations of autonomous driving systems using multi\-objective genetic algorithm\.InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering,ESEC/FSE 2022,New York, NY, USA,pp\. 94–106\.External Links:[Document](https://dx.doi.org/10.1145/3540250.3549100),ISBN 978\-1\-4503\-9413\-0Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. ukasz Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p3.1)\.
- Wedad Alawad, Nadhir Ben Halima, and Layla Aziz \(2023\)An Unmanned Aerial Vehicle \(UAV\) System for Disaster and Crisis Management in Smart Cities\.Electronics\.External Links:[Document](https://dx.doi.org/10.3390/electronics12041051)Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p2.1)\.
- Z\. Zhong, Y\. Tang, Y\. Zhou, V\. d\. O\. Neves, Y\. Liu, and B\. Ray \(2021\)A survey on scenario\-based testing for automated driving systems in high\-fidelity simulation\.arXiv\.External Links:2112\.00964,[Document](https://dx.doi.org/10.48550/arXiv.2112.00964)Cited by:[§1](https://arxiv.org/html/2606.31114#S1.p1.1)\.
- B\. D\. Ziebart, A\. Maas, J\. A\. Bagnell, and A\. K\. Dey \(2008\)Maximum Entropy Inverse Reinforcement Learning\.InProceedings of the Twenty\-Third AAAI Conference on Artificial Intelligence \(2008\),1433\-1438\.Cited by:[§2\.2](https://arxiv.org/html/2606.31114#S2.SS2.p1.1)\.

## Appendix AUTM System Architecture and Testing Pipeline

#### What is Unmanned aircraft Traffic Management \(UTM\) system?

The Unmanned aircraft Traffic Management \(UTM\) system, as introduced by the National Aeronautics and Space Administration \(NASA\)\[Kopardekar,[2014](https://arxiv.org/html/2606.31114#bib.bib100), Kopardekaret al\.,[2016](https://arxiv.org/html/2606.31114#bib.bib88)\], is designed to ensure safe and efficient operation of multiple unmanned Unmanned Aerial Vehicles \(UAVs\) in shared airspace\. The UTM concept is developed to support the integration of UAVs into airspace without requiring human air traffic controllers to manage every UAV directly\. Instead, UTM emphasizes the use of automated systems to coordinate UAV operations\. This includes services like geofencing, route optimization, and deconfliction, ensuring that UAVs can safely and autonomously operate in both sparsely populated rural and densely populated urban areas or alongside manned aircraft\.

UTM is typically developed as a complex system\. This is because the UTM systems should integrate a wide range of functionalities and address diverse challenges associated with managing UAV operations in dynamic and unpredictable environments\. UTM systems need to handle real\-time communication between UAVs, ground stations, and other stakeholders, while simultaneously ensuring safety, efficiency, and fairness in airspace usage\.

As show in Figure[7](https://arxiv.org/html/2606.31114#A2.F7), UTM serves as the central coordinator, processing dynamic information received from all UAVs and managing overall traffic flow through sophisticated decision\-making algorithms simultaneously\. UTM maintains continuous communication, flight route allocation and trajectory assignment with multiple UAVs, each equipped with various sensors and control systems, while simultaneously monitoring environmental conditions and potential conflicts\.

#### What is fault detection in development of UTM and why it is important?

We define the termfault detectionas the process identifying possible faults in the UTM system during testing phase, which is before the UTM system is deployed in real\-world environments\. It is typically divided into several steps, including module testing, integration testing, smoke testing \(functional testing\), stress testing, etc\. After each testing step, the confidence \(e\.g\., reliability, fault tolerance, and compliance with regulatory standards\) of UTM system increases as potential faults are identified and addressed, ensuring that the system becomes progressively more robust and reliable\.

Fault detection is a critical aspect of UTM development because it directly impacts the safety, reliability, and efficiency of development pipeline\. As a mission critical system, the UTM system should be designed to eliminate all the faults it may occur, which are usually costly or even deadly \(e\.g\., UAV crushes, collisions with buildings or even collisions with human injuries\)\[Kopardekar,[2014](https://arxiv.org/html/2606.31114#bib.bib100), Kopardekaret al\.,[2016](https://arxiv.org/html/2606.31114#bib.bib88)\]\. By identifying and addressing potential faults during the testing phase, fault detection ensures that the UTM system operates as intended, mitigating risks before deployment in real\-world environments\. This proactive approach prevents costly failures, enhances system robustness, and builds trust among stakeholders\.

#### Why fault detection is challenging?

Fault detection in UTM systems is inherently challenging, particularly as testing progresses through advanced stages\. While early testing steps may uncover obvious issues, the long\-tail of rare and hard\-to\-detect faults often remains persistent and elusive\. This difficulty is compounded by the self\-healing capabilities of modern UTM systems, which can mask subtle issues that may only emerge under specific conditions\. As is listed in the Table[2](https://arxiv.org/html/2606.31114#A1.T2), although several testing steps have been conducted, there still remains faults to threat the safety of the UTM system \(e\.g\. shakedown effects found by Federal Aviation Administration in field testing\)\[Rioset al\.,[2017](https://arxiv.org/html/2606.31114#bib.bib101), FAA,[2023](https://arxiv.org/html/2606.31114#bib.bib97)\]\. Based on the stepwise testing and field testing results, we estimate the faults found in different steps of testing, as listed in Table[2](https://arxiv.org/html/2606.31114#A1.T2)\. From data in the table, we can see that as several testing steps are conducted, there still exists faults to be detected, which is fatal in mission critical systems\.

#### Why lower top\-1 accuracy doesn’t impact framework reliability

While the top\-1 action accuracy of our framework may appear relatively low \(around 20% for PM\-2B in TR1\), this does not compromise the framework’s reliability in vulnerability detection for several reasons\. First, unlike traditional classification tasks where precise prediction is crucial, our framework operates in the context of safety testing where the primary goal is to discover diverse failure scenarios rather than to predict specific actions\. The lower top\-1 accuracy actually reflects the framework’s ability to explore a broader range of potential actions rather than being constrained to the most probable ones\. This is evidenced by our high HAR \(10\.8%\) and CAR \(29\.7%\) metrics, which indicate that a significant portion of the generated actions effectively probe system vulnerabilities\. Second, our framework employs top\-k sampling strategy during testing, which allows it to maintain a balance between exploration and exploitation\. The high top\-3 accuracy \(shown in Fig\.[6](https://arxiv.org/html/2606.31114#S5.F6)\(b\)\) demonstrates that while the model may not always select the optimal action first, it consistently maintains valuable action candidates within its top choices\. Furthermore, when examining the detected vulnerabilities \(as shown in Table[1](https://arxiv.org/html/2606.31114#S5.T1)\), our framework achieves significantly higher SPM \(50\.5\) and FPM \(7\.6\) compared to expert\-guided testing \(5\.8 and <1\.0 respectively\), indicating that the framework’s action selection strategy, despite lower top\-1 accuracy, is more effective at uncovering critical system faults\. This aligns with recent findings in reinforcement learning literature that suggests maintaining action diversity can be more valuable than maximizing prediction accuracy when exploring rare but critical events in the state space\.

Fault TypesModule TestingIntegration TestingSmoke TestingStress TestingFault RemainingModule Level∼20%\\sim 20\\%∼10%\\sim 10\\%∼30%\\sim 30\\%∼40%\\sim 40\\%∼0\.1%\\sim 0\.1\\%Interface Level∼10%\\sim 10\\%∼20%\\sim 20\\%∼30%\\sim 30\\%∼40%\\sim 40\\%∼0\.1%\\sim 0\.1\\%Running time∼10%\\sim 10\\%∼10%\\sim 10\\%∼40%\\sim 40\\%∼40%\\sim 40\\%∼0\.1%\\sim 0\.1\\%Scenario ComplexitySimpleSimpleMediumMediumHigh

Table 2:Fault Types Detection during Different Steps of Testing\.The module testing verifies individual components of UTM to ensure they function correctly in isolation\. The integration testing checks interactions between combined modules to detect interface issues\. The smoke testing ensures basic functionality works correctly after a new build or update, acting as a preliminary check\. The stress testing evaluates system stability and performance under extreme or peak load conditions\. The tested scenarios for moduel testing and integration testing are relatively simple, while smoke testing and stress testing will generate more complex testing scenarios\. As the testing steps conducted one by one, the software maturity of UTM increases gradually\. However, there still exists rare faults happening in complex scenarios\.

## Appendix BProposed Testing Framework

#### Testing Framework

Testing framework introduced in this work serves as a copilot with UTM, rather than deploying on individual UAV\. It monitors identical data streams along with UTM, including UAV telemetry \(position, velocity, mission status\) and system state information\. The UTM system provides trajectory schedule in favor of system robustness, while testing system generating adversarial disturbance actions to increase systematic vulnerability\.

![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/testing_system.png)Figure 7:UTM System and Testing Framework Architecture\.The testing framework works as copilot of UTM and operates on the server\-side\. As a mission critical system, UTM under test is designed as centralized architecture at once to insure the safety and remove potential conflicts in advance\[Spalas,[2024](https://arxiv.org/html/2606.31114#bib.bib104), Hamissi and Dhraief,[2023](https://arxiv.org/html/2606.31114#bib.bib105)\]\. To align with the design of UTM, our proposed testing framework is also designed centrally\. The testing framework mimics the natural disturbance to generate different scenarios\.Testing system is designed to manipulate external disturbances to UAVs like wind, obstacle and network jitter as shown in Table[7](https://arxiv.org/html/2606.31114#A7.T7)\. Internal functionality and and robustness of on\-device system of individual UAV is out of the scope of this research\.

#### Sim vs Real

The framework’s methodology emphasizes systematic exploration of edge cases and rare failure modes that might otherwise remain undiscovered in conventional testing approaches\. Environmental disturbances suffer from randomness and difficulty in interpreting\. In this work, we make use of simulators which enables configurable environmental disturbances and concrete mapping between them and consequential operating status, in favor of typical analysis and diagnosis\. Visibility and capability of UTMs are strictly aligned in whether simulated or realistic context\.

Besides, precise timing selection of disturbance injections is within consideration as well\. Traffic pressure of UTM for complicated UAV MASs varies with time\. Testing system learns to inject actions when UTM is handling the most vulnerable cases in favor of significance of tesing scenarios generated\.

## Appendix CChallenges of Testing UTM

#### Critical Fault Distribution Imbalance

While UTM’s fault\-tolerant design successfully handles most anomalies through automated recovery mechanisms and redundant control strategies, this architectural resilience paradoxically increases the complexity of identifying severe failure scenarios, as intermediate failure states are often automatically corrected before they can develop into observable system failures\. Critical failures, those capable of overwhelming the system’s self\-healing mechanisms, occupy an extremely small portion of the state\-action space, which often reside in narrowly defined regions of the state\-action space, requiring precise combinations of multiple adverse factors to overcome the system’s multi\-layer safety functionality\. These regions are characterized by specific configurations of multiple elements: particular spatial arrangements of UAVs, precise timing of control actions, specific environmental conditions\. Furthermore, these failure scenarios often represent emergent behaviors arising from subtle interactions between multiple system components and their recovery attempts, rather than simple violations of individual safety constraints\.

TypesNumber ofInfluenced UAVsDisturbance Timeswithin 60sCase ExampleReal\-WorldRatioComplexitySafe Flight00N/A∼\\sim94%LowDisturbances11Winds with exceeding magnitude∼\\sim5%Medium≥2\\geq 21 \(each\)Winds hit multiple UAVs∼\\sim1%Medium1≥2\\geq 2Winds hit twice with 60s interval∼\\sim0\.1%High1≥2\\geq 2\(simultaneously\)Signal Loss when Winds hit∼\\sim0\.01%High

Table 3:Real\-World UTM failure distribution\.In real\-world UAV fleets, advanced UTM provides fundamental guarantee for safe flight, where faults with increasing risk still exist at a relatively low ratio and are increasingly hard to locate and tackle\.
#### High\-Dimensional State\-Action Temporal Dependency

Testing of UTM systems confronts a fundamental challenge in navigating its inherent high\-dimensional state\-action coupling relationships\. The state space encompasses multiple critical dimensions: spatial coordinates and velocity vectors of each UAV, environmental conditions, and communication network states\. Each additional UAV exponentially expands this state space, creating a combinatorial explosion in the dimensions that must be considered during testing\. Unlike traditional control systems where failures often manifest through immediate state violations, UTM system failures additionally emerge from specific combinations of historical state sequences and multi\-agent coupling, as shown in Table[3](https://arxiv.org/html/2606.31114#A3.T3)\. The behavioral trajectory of each UAV is intrinsically influenced by both its historical states and the temporal evolution of other agents’ states in the shared airspace\. For instance, a seemingly safe trajectory adjustment by one UAV could create cascading effects leading to system\-wide conflicts minutes later through complex agent interactions\. Furthermore, subtle perturbations in early states can propagate through the system’s temporal dynamics to trigger critical failures in significantly later stages\. The challenge is particularly pronounced in scenarios involving dense multi\-UAV operations, where system behavior emerges from the intricate interplay of multiple agents’ temporal trajectories rather than simple state\-transition patterns\.

## Appendix DMotivation for Transformer and Comparison with Other Models

The main motivation of applying Transformer as backbone model lies in that the Transformer models are proved to be scalable in multi tasks \(e\.g\., natural language processing\[Brown,[2020](https://arxiv.org/html/2606.31114#bib.bib90)\], computational vision\[Panet al\.,[2021](https://arxiv.org/html/2606.31114#bib.bib99)\], robotics\[Chebotaret al\.,[2023](https://arxiv.org/html/2606.31114#bib.bib98)\], etc\.\)\. The scalability is of essential importance in the development of testing framework in that \(1\) complex temporal and inter\-agent dependencies with scalable sizes of UAV swarm and temporal context window, and \(2\) long\-tail effect in fault distribution requiring sufficiently large dataset to identify faults and to feed in backbone models\. Leveraging the Transformer’s inherent scalability in modeling extended context lengths and processing large\-scale data inputs, it can effectively model complex temporal sequences and inter\-agent interactions within UAV swarms of varying sizes\. This capability allows the testing framework to accommodate extensive datasets necessary for identifying rare faults due to the long\-tail effect in fault distribution\. Furthermore, the Transformer’s ability to handle large\-scale data inputs ensures that the model remains robust and accurate as the system under test evolves \(e\.g\. different region settings, as demonstrated in Table[4](https://arxiv.org/html/2606.31114#A5.T4)\)\. Consequently, integrating the Transformer as the backbone model enhances the framework’s capacity to detect, analyze, and predict system behaviors across diverse operational scenarios\.

However, alternative backbone models such as Graph Neural Networks \(GNNs\), Recurrent Neural Networks \(RNNs\), Long Short\-Term Memory networks \(LSTMs\), and online reinforcement learning algorithms like Deep Q\-Networks \(DQNs\) or Proximal Policy Optimization \(PPO\) often struggle to address aforementioned challenges effectively\. These models may lack the inherent ability to capture long\-range dependencies or scale efficiently with increasing sequence lengths and swarm sizes\. Specifically,

- •RNN/LSTM: RNNs and LSTMs encounter difficulties when modeling long temporal contexts due to issues like vanishing gradients, which add to the training difficulty\. What’s more, RNNs and LSTMs are hard to parallelized, which adds to the training time, especially when deal with large datasets\[Devlin,[2018](https://arxiv.org/html/2606.31114#bib.bib102)\]\. Base on our primely experiments, we find that for models below 10 million parameters, RNNs are 10 times slower than Transformers, which constrains the scalability of RNNs\.
- •GNN: GNNs may not scale well with large and dynamic swarm networks, especially when temporal dynamics are involved\.
- •DQN/PPO: DQN and PPO require extensive online exploration and interactions\[Levineet al\.,[2020](https://arxiv.org/html/2606.31114#bib.bib103)\], making them less practical for fault detection in complex systems with long\-tail fault distributions\.

## Appendix EOnline Evaluation of Out\-of\-distribution and In\-distribution Dataset

Test RegionAPO \(%\)APD \(%\)HAR \(%\)CAR \(%\)TR1 \(OOD\)20\.026/34/21/1910\.829\.7TR2 \(OOD\)31\.546/32/11/114\.964\.1R4 \(ID\)27\.316/29/29/266\.548\.7Table 4:Performance metrics of PM\-2B \.The metrics include Action Probability per Observation \(APO\), Action Probability Distribution \(APD\), High\-Value Action Ratio \(HAR\), and Constant\-Pressure Action Ratio \(CAR\)\. Testing was conducted in three distinct regions: TR1 \(rural, out\-of\-distribution\), TR2 \(urban, out\-of\-distribution\), and R4 \(suburban, in\-distribution\), to evaluate the model’s generalization capability across diverse environments\.As is illustrated in Table\.[4](https://arxiv.org/html/2606.31114#A5.T4), the PM\-2B model demonstrates strong generalization across different environments, maintaining high performance in both in\-distribution \(ID\) and out\-of\-distribution \(OOD\) regions\. In the OOD rural region \(TR1 & TR2\), the model achieves the comparable performance with ID region \(in the context of comparing APO, HAR, and CAR\)\. In contrast, the model’s performance in the ID region \(R4\) shows more balanced APD values \(16/29/29/26\) than in OOD region, which could be a signal of overfitting\.

## Appendix FOnline Evaluation Metric Details

In this section, we provide a detailed explanation to selected metrics listed in Table\.[5](https://arxiv.org/html/2606.31114#A6.T5)\.

MetricPurposeAPOAction Probability per ObservationAPDAction Probability DistributionHARHazard Action RatioCARConstant\-Pressure Action RatioSPMHigh Risk Scenarios per Million FlightsFPMFaults per Million FlightsTable 5:Metrics for online evaluation of testing performance\.The metrics are categorized into three groups for a comprehensive evaluation of the proposed testing framework’s capabilities, including the preference and quality of proposed framework, as well as the final results\. The detail definition of metrics can be found in Appendix[F](https://arxiv.org/html/2606.31114#A6)\.#### Action Probability per Observation \(APO\)

The definition of APO is

APO=\#​\{action generated as injected, testing method is called\}\#​\{testing method is called\}×100%,\\text\{APO\}=\\frac\{\\\#\\\{\\text\{action generated as injected, testing method is called\}\\\}\}\{\\\#\\\{\\text\{testing method is called\}\\\}\}\\times 100\\%,where\#​\{⋅\}\\\#\\\{\\cdot\\\}denotes the number of occurrences of the specified event\. APO aims to measure the percentage of times a testing method generates actions that are injected into the system, indicating how often the framework effectively targets the desired action space during testing\. However, high APO may result in redundant action injections, as not all injected actions contribute to uncovering valuable information\. Only critical actions that can reveal faults or vulnerabilities are truly significant for effective testing\. Therefore, additional metrics about action quality and testing efficiency are necessary to evaluate the true effectiveness of the testing framework\.

#### Action Probability Distribution \(APD\)

APD measures the proportion of different types of actions generated by the testing framework\. It is represented as a vector indicating the percentage of each action type\. A balanced APD ensures that the framework explores a diverse set of actions, while an unbalanced distribution may indicate bias toward specific types, potentially missing critical scenarios\. Evaluating APD helps assess whether the testing method maintains comprehensive action coverage or if certain action types are underrepresented\.

#### Hazard Action Ratio \(HAR\)

HAR is defined as

HAR=\#​\{actions result in return\-to\-go significantly raise comparing with summary\}\#​\{injected actions\}×100%,\\text\{HAR\}=\\frac\{\\\#\\\{\\text\{actions result in return\-to\-go significantly raise comparing with summary\}\\\}\}\{\\\#\\\{\\text\{injected actions\}\\\}\}\\times 100\\%,where\#​\{⋅\}\\\#\\\{\\cdot\\\}denotes the number of occurrences of the specified event\. In practice, we consider an action to be hazardous if the difference betweenreturn\-to\-goand thesummaryis greater than 0\.4\. This threshold indicates that the injected action has a substantial impact on the system, potentially leading to risky or unexpected outcomes\. A high HAR reflects the framework’s ability to generate high\-risk scenarios, which is crucial for identifying critical vulnerabilities during testing\.

#### Constant\-Pressure Action Ratio \(CAR\)

CAR is defined as

CAR=\#​\{actions result in high return\-to\-go when summary is also high\}\#​\{injected actions\}×100%,\\text\{CAR\}=\\frac\{\\\#\\\{\\text\{actions result in high return\-to\-go when summary is also high\}\\\}\}\{\\\#\\\{\\text\{injected actions\}\\\}\}\\times 100\\%,where\#​\{⋅\}\\\#\\\{\\cdot\\\}denotes the number of occurrences of the specified event\. In practice, an action is categorized as constant\-pressure if both the return\-to\-go and the summary exceed a threshold of 0\.4\. This indicates that the action consistently maintains a high level of risk or pressure in an already high\-risk scenario\. A high CAR shows that the testing framework is able to sustain pressure over a prolonged period, making it more effective at evaluating the resilience and stability of the system under stress\.

#### High Risk Scenarios per Million Flights \(SPM\)

SPM measures the frequency of high\-risk scenarios detected by the testing framework for every million simulated flights\. A high SPM value indicates that the testing framework is effective in uncovering critical situations that pose potential threats to system safety\. It helps quantify the robustness of the testing methodology in identifying rare but impactful scenarios\.

#### Faults per Million Flights \(FPM\)

FPM represents the number of unique bugs identified for every million flights, where system may encounter severe failures\. It reflects the framework’s capability to discover actual system faults during testing\. A higher FPM suggests that the testing strategy is not only triggering risky scenarios but also exposing underlying system vulnerabilities that need to be addressed before deployment\.

## Appendix GArchitecture and Training Details

#### Architectures of Policy Model

The scenario\-oriented testing framework for UTM systems consists of two main phases: training and inference \(testing\), as illustrated in Algorithms[1](https://arxiv.org/html/2606.31114#alg1)and[2](https://arxiv.org/html/2606.31114#alg2)\. Algorithm[1](https://arxiv.org/html/2606.31114#alg1)details the training phase, where the Policy Model \(PM\) learns from an offline dataset of UTM scenarios\. This phase involves iterating through epochs and batches, processing state\-action\-reward tuples, and updating the model parameters to minimize the prediction error for both actions and rewards\. The training process incorporates context augmentation to enhance the model’s ability to capture temporal dependencies\. Algorithm[2](https://arxiv.org/html/2606.31114#alg2)outlines the inference \(testing\) phase, where the trained PM is used to generate and evaluate potentially vulnerable scenarios in the System\-Under\-Test \(SUT\)\. This phase operates in a loop, continuously generating candidate actions, filtering them through an Action Sampler \(AS\), injecting selected actions into the SUT, and evaluating the outcomes\. The process accumulates detected vulnerabilities while dynamically updating the context based on observed states, actions, and rewards\. Together, these algorithms form a comprehensive approach to identifying potential faults and vulnerabilities in UTM systems, leveraging both historical data and adaptive, context\-aware scenario generation\.

Algorithm 1Training Phase of UTM Testing Framework1:Offline dataset

DD, Model architecture

MM
2:Trained Policy Model PM

3:Initialize PM with architecture

MM
4:Initialize optimizer

5:foreach epochdo

6:foreach batch

BBin

DDdo

7:

s,a,r←s,a,r\\leftarrowGetBatchData\(

BB\)

8:

s~←\\tilde\{s\}\\leftarrowAugmentWithContext\(

ss\)

9:

a^,r^←\\hat\{a\},\\hat\{r\}\\leftarrowPM\.Forward\(

s~\\tilde\{s\}\)

10:

L←L\\leftarrowComputeLoss\(

a^,a,r^,r\\hat\{a\},a,\\hat\{r\},r\)

11:BackpropagateAndUpdate\(PM,

LL\)

12:endfor

13:endforreturnPM

Algorithm 2Inference \(Testing\) Phase of UTM Testing Framework1:Trained Policy Model PM, System\-Under\-Test SUT, Action Sampler AS

2:Detected vulnerabilities

VV
3:Initialize vulnerability set

V←∅V\\leftarrow\\emptyset
4:Initialize context set

C←∅C\\leftarrow\\emptyset
5:whiletesting budget not exhausteddo

6:

s←s\\leftarrowGetCurrentState\(SUT\)

7:

s~←\[C;s\]\\tilde\{s\}\\leftarrow\[C;s\]⊳\\trianglerightAugment state with context

8:

Rp​r​e​d​i​c​t​e​d←R\_\{predicted\}\\leftarrowPM\.PredictRTG\(

s~\\tilde\{s\}\)

9:

ac​a​n​d​i​d​a​t​e​s,←a\_\{candidates\},\\leftarrowPM\.GenerateActions\(

s~,Rp​r​e​d​i​c​t​e​d\\tilde\{s\},R\_\{predicted\}\)

10:

af​i​l​t​e​r​e​d←a\_\{filtered\}\\leftarrowAS\.FilterActions\(

ac​a​n​d​i​d​a​t​e​sa\_\{candidates\}\)

11:

a←a\\leftarrowAS\.SampleAction\(

af​i​l​t​e​r​e​da\_\{filtered\}\)

12:InjectAction\(SUT,

aa\)

13:

Ra​c​t​u​a​l←R\_\{actual\}\\leftarrowEvaluateAction\(SUT,

aa\)

14:ifIsVulnerability\(

ra​c​t​u​a​lr\_\{actual\}\)then

15:

V←V∪\{\(s,a,ra​c​t​u​a​l\)\}V\\leftarrow V\\cup\\\{\(s,a,r\_\{actual\}\)\\\}
16:endif

17:UpdateContext\(

CC,

ss,

aa,

ra​c​t​u​a​lr\_\{actual\}\)

18:endwhilereturn

VV

PM\-1\.2BPM\-2BLayers6464Model Dimension12801600Attention Heads2025Activation FunctionsGELUPositional EmbeddingsSinusoidalOptimizerAdamWPeak Learning Rate8×10−48\\times 10^\{\-4\}3×10−43\\times 10^\{\-4\}Learning Rate Schedule1000 steps warmup & cosine decayBatch Size512256GPUs16Table 6:Overview of the key hyperparameters of policy model\.We display settings for 1\.2B and 2B models\.
#### Action Space

Considering feasibility in implementation, we defined the action space of PM with 2 types of actions: \(1\)One\-time physical actions and \(2\) short\-Duration digital actions\. As shown in Table[7](https://arxiv.org/html/2606.31114#A7.T7), PM is also enabled to generate scenario configurations with different parameter settings\.

NAMETYPEDESCRIPTIONPARAMETERSWindOWinds with the exceeding magnitudeSpeed, DirectionObstacleOObstacles appearing in UAVs’ routesSize, LocationNetwork JitterDTemporary network disconnectionTime DurationTable 7:Action types of policy model\.We consider three types of action for each agent\. TheOstands forOne\-time physical actions andDstands for short\-Duration digital actions\.
#### Loss function

We made use of model with decision transformer style which had out\-standing in sparse reward tasks\[Bhargavaet al\.,[2023b](https://arxiv.org/html/2606.31114#bib.bib55)\]\. In favor of regression of PM, a multi\-objective loss function is introduced in training consisting of following aspects with configurable weights:return\-to\-goto model observation and causality,action maskto model world background knowledge andactionto model decision\.

## Appendix HIndustry Level UAV Swarm Simulator

The industry level UAV swarm simulator we applied is designed to create a digital twin of drone swarms for accurate analysis of both UTM system and UAVs’ behaviors in real\-world environments and interactions between natural environment and the whole system\. Powered by a physics engine, the simulator closely replicates real\-world physics\. Additionally, the simulator incorporates hardware\-in\-the\-loop by integrating actual UAV flight control systems, which adds to the accuracy\. The simulator supports a variety of environmental configurations, including buildings, moving objects like balloons and birds, lighting conditions, and wind effects, etc\. Backed by a dedicated support team, the system’s reliability can be continuously improved\.

## Appendix IEnvironment Details

![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/physic_failure.png)\(a\)Physical failures;
![Refer to caption](https://arxiv.org/html/2606.31114v1/figures/task_failure.png)\(b\)Task failures;

Figure 8:Two main types of failures in UTM\.Physical failures: Failures that result from physical damage or malfunction in system components, such as structural damage, hardware breakdowns, or external impact\. These failures typically require immediate attention as they compromise the safety and integrity of the UAV or surrounding environment\. Task Failures: Failures related to mission objectives, such as incorrect task execution, navigation errors, etc\. Task failures impact the operational success and can disrupt planned missions or lead to unexpected behavior\.TypeIndexArea\# of Airport\# of UAV\# of Flight Line\# of Alternate AirportFractionOffline TrainingR1Rural61612212\.2%Offline TrainingR2Suburb122424718\.3%Offline TrainingR3Urban63618627\.5%Offline TrainingR4Suburb101510211\.5%Offline TrainingR5Suburb10151029\.2%Offline TrainingR6Urban81616212\.2%Offline TrainingR7Urban412839\.1%Online TestingTR1Rural929162N/AOnline TestingTR2Urban616166N/A

Table 8:Overview of training and testing regions used in the scenario\-based testing framework\.Each region is categorized by type \(rural, suburban, or urban\) and is characterized by attributes such as the number of airports, UAVs, flight lines, and alternate airports\. For training dataset, the fraction of each region is provided to reflect the distribution of different operational environments\. Each region is specifically designed to provide a representative mix of operational challenges: regions R1 and R4 emphasize low\-density rural and suburban operations, respectively, whereas regions R3 and R6 represent high\-density urban areas with increased air traffic complexity\. This distribution ensures the model learns to generalize across different environment types while prioritizing scenarios with a higher likelihood of critical interactions\. Testing regions are designed to evaluate model performance on both trained dataset and unseen scenarios, ensuring robustness and generalizability\.

Similar Articles

Agentic Transformers Provably Learn to Search via Reinforcement Learning

arXiv cs.LG

This paper theoretically studies how transformer-based policies acquire search capabilities from reinforcement learning training dynamics in a stochastic tree environment. It shows that a two-head transformer can implement depth-first search and that this mechanism emerges naturally from sparse reward signals under a depth-wise curriculum.

Test-Time Training Undermines Safety Guardrails

arXiv cs.LG

This paper identifies three threat models for test-time training (TTT) that adversaries can exploit to bypass safety filters in LLMs, achieving high attack success rates. The findings reveal that TTT introduces new vulnerabilities that undermine existing safety guardrails.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Hugging Face Daily Papers

UniT is a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms (online/offline, multi-modal, long-horizon) while maintaining metric-scale accuracy via scale-adaptive loss and queue-style KV caching. It achieves state-of-the-art performance on ten benchmarks spanning seven tasks.