Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

arXiv cs.CL Papers

Summary

This paper proposes a safety-oriented, consequence-aware evaluation framework for large language models in Air Traffic Control, revealing that high aggregate accuracy masks significant reliability issues in handling high-risk semantic errors.

arXiv:2605.11769v1 Announce Type: new Abstract: Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 06:17 AM

# Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
Source: [https://arxiv.org/html/2605.11769](https://arxiv.org/html/2605.11769)
Yujing Chang1,∗, Yash Guleria2, Duc\-Thinh Pham1,3, Nhut\-Huy Pham1, Ningli Wang1, Vu N\. Duong3, Sameer Alam1

###### Abstract

Air Traffic Control \(ATC\) is a safety\-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences\. While large language models \(LLMs\) demonstrate strong general performance, their reliability in operational ATC environments remains unclear\. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high\-risk semantic mistakes \(e\.g\., incorrect runway identifiers or movement constraints\)\. To address this gap, we propose a safety\-oriented, consequence\-aware evaluation framework tailored to ATC operations\. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited\. Evaluated on clean transcripts, the peak Risk Score reaches only 0\.69, with most models scoring below 0\.6 despite high macro\-F1 performance\. Further analysis shows that errors concentrate in high\-impact entities despite relatively stable action\-type classification, indicating structural grounding deficiencies\. These findings highlight the necessity of consequence\-aware evaluation protocols for the responsible deployment of AI\-assisted ATC systems\.

![Refer to caption](https://arxiv.org/html/2605.11769v1/x1.png)Figure 1:Overview of the evaluation pipeline, contrasting conventional semantic metrics with the proposed consequence\-aware evaluation framework for ATC language understanding\.## IIntroduction

Air Traffic Control \(ATC\) is a core component of the global air transportation system, ensuring the safe and orderly movement of aircraft on runways, taxiways, and in controlled airspace\. Misinterpretation of operational ATC instructions can lead to safety incidents, including runway incursions and mid\-air collisions\. These events represent some of the most critical safety hazards in aviation\. Given the compact and context\-dependent phraseology of ATC communications, accurate semantic understanding is essential for maintaining operational safety\.

Recent advances in large language models \(LLMs\) and automatic speech recognition \(ASR\) systems have motivated growing interest in AI\-assisted air traffic management\[[12](https://arxiv.org/html/2605.11769#bib.bib4),[14](https://arxiv.org/html/2605.11769#bib.bib5)\]\. Spoken language understanding \(SLU\) research has improved robustness to ASR errors through approaches such as contrastive learning, mixture\-of\-experts architectures, and speech\-text alignment\[[2](https://arxiv.org/html/2605.11769#bib.bib10),[3](https://arxiv.org/html/2605.11769#bib.bib12),[4](https://arxiv.org/html/2605.11769#bib.bib20),[1](https://arxiv.org/html/2605.11769#bib.bib21)\], while ATC\-specific studies have benefited from domain corpora, ASR benchmarks, transfer learning, and retrieval\-enhanced scenario modeling for structured semantic extraction and operationally realistic evaluation\[[9](https://arxiv.org/html/2605.11769#bib.bib22),[20](https://arxiv.org/html/2605.11769#bib.bib23),[13](https://arxiv.org/html/2605.11769#bib.bib15),[17](https://arxiv.org/html/2605.11769#bib.bib24),[16](https://arxiv.org/html/2605.11769#bib.bib3),[8](https://arxiv.org/html/2605.11769#bib.bib25)\]\.

Despite this progress, evaluation in both SLU and ATC remains centered on surface\-level metrics such as word error rate \(WER\)\[[6](https://arxiv.org/html/2605.11769#bib.bib14)\], character error rate \(CER\)\[[5](https://arxiv.org/html/2605.11769#bib.bib13)\], intent accuracy, and slot\-level F1\[[10](https://arxiv.org/html/2605.11769#bib.bib7),[11](https://arxiv.org/html/2605.11769#bib.bib8),[19](https://arxiv.org/html/2605.11769#bib.bib16)\], which do not capture the operational consequences of semantic misunderstandings\. This limitation is particularly acute in safety\-critical settings, where misrecognizing a callsign in a high\-risk taxi clearance carries far greater consequences than a minor slot omission in a routine exchange\. Broader LLM safety research has similarly highlighted the heterogeneity of error severity\[[18](https://arxiv.org/html/2605.11769#bib.bib26)\]and the need for behavioral evaluation beyond aggregate metrics\[[15](https://arxiv.org/html/2605.11769#bib.bib27)\]; related safety\-critical domains such as medical AI have begun to adopt risk\-stratified evaluation schemes\[[7](https://arxiv.org/html/2605.11769#bib.bib30)\]\. Nonetheless, systematic consequence\-aware evaluation for real\-time operational language understanding in ATC remains largely unexplored\.

To address this gap, we propose a safety\-oriented, risk\-aware evaluation framework tailored to real\-world ATC operations, as illustrated in Figure[1](https://arxiv.org/html/2605.11769#S0.F1)\. The framework models structured semantic actions, including action types and safety\-critical slots, and associates them with operational risk levels derived from expert knowledge and regulations, enabling consequence\-aware scoring beyond aggregate accuracy\. Using authentic ATC communication data, we systematically evaluate LLMs under both text\-based and end\-to\-end voice\-to\-semantics settings\. Through quantitative analysis and case studies, we reveal high\-risk failure modes that remain hidden under conventional evaluation and show that, while current models achieve reasonable overall performance, reliability degrades significantly in safety\-critical scenarios\.

The main contributions of this work are:

- •A safety\-oriented, risk\-aware evaluation framework for ATC language understanding, explicitly modeling safety\-critical semantic actions and associated risk levels\.
- •A systematic evaluation of LLMs on authentic ATC communications, including end\-to\-end voice\-to\-semantics assessment\.
- •Quantitative and qualitative analyses revealing safety\-critical failure modes not captured by conventional evaluation metrics\.

## IIDataset

### II\-AData Source and Initial Processing

The dataset used in this study builds upon real\-world ATC communications collected at Singapore Changi Airport \(ICAO: WSSS\), originally introduced in\[[16](https://arxiv.org/html/2605.11769#bib.bib3)\]\. Controller–pilot transmissions were recorded between 17–23 March 2025 across three ground\-control frequencies \(124\.300 MHz, 121\.850 MHz, 121\.725 MHz\) over two daily time windows\.

The corpus focuses on surface movement instructions in ground control operations\. To construct the evaluation subset, we extracted a continuous high\-density traffic segment, removed utterances without identifiable aircraft callsigns, and reviewed the remaining data with a licensed controller and a commercial pilot\. The final subset preserves natural operational sequences while providing approximately 1\.2 hours of communication and around 1,000 annotated utterances\.

TABLE I:Dataset statistics\.\(a\) Utterance\-level statistics

\(b\) Entity and risk statistics

### II\-BRisk\-Oriented Semantic Reconstruction

To support risk\-aware evaluation, each utterance was restructured into a unified action\-level representation\.

##### Risk\-informed schema\.

A structured expert survey with five licensed tower controllers from Beijing Capital Airport Tower was conducted to assess the operational relevance of semantic components in ground\-control communications\. Their input informed the risk\-oriented schema design, including the relative safety importance assigned to different entity categories\.

##### Action taxonomy\.

Each utterance was mapped to one of nine canonical action types \(HOLD, TAXI, GIVE\_WAY, CONTACT, PUSHBACK, INFORM, GREET, STANDBY, UNKNOWN\), which were further stratified into three operational risk levels \(HIGH, MEDIUM, LOW\) according to their potential safety consequences in ground operations\.

##### Critical slots\.

For each action type, a predefined set of critical slots was specified \(e\.g\., callsign and boundary for HOLD; callsign and frequency for CONTACT\)\. The dataset was annotated accordingly, producing structured representations containing action type, normalized slots, and associated risk level\.

This reconstruction converts free\-form ATC utterances into risk\-aware semantic units suitable for risk\-aware evaluation\.

### II\-CData analysis

Table[I](https://arxiv.org/html/2605.11769#S2.T1)summarizes the curated evaluation subset\. Here, O denotes tokens outside annotated entity spans, while non\-O denotes tokens assigned to semantic entity labels\. The corpus shows balanced pilot–controller interaction, with instructions and readbacks comprising most exchanges, and maintains a stratified risk composition \(48% high, 26% medium, 26% low\)\. Overall, it provides realistic surface\-traffic dynamics with sufficient semantic and risk diversity for consequence\-aware evaluation\.

## IIIEvaluation Metrics

As illustrated in Figure[1](https://arxiv.org/html/2605.11769#S0.F1), treating all semantic errors uniformly fails to capture the strict safety risks inherent to ATC communications\. To ensure operational grounding, our evaluation incorporates ATC regulations together with expert input grounded in licensed controllers’ assessment of real\-world operational risk \(Table[II](https://arxiv.org/html/2605.11769#S3.T2)\)\. This expert\-informed process guided the action taxonomy, risk stratification, and consequence\-aware weighting used throughout the evaluation\. Under established ATC surface\-movement procedures, semantic failures are not operationally equivalent: callsign errors can misassign a clearance to the wrong aircraft, taxiway or runway reference errors can produce routing conflicts or unauthorized runway entry, and missed movement constraints can invalidate hold\-short, give\-way, or boundary restrictions\. Because such failure modes correspond to materially different hazard consequences, our evaluation explicitly differentiates entity criticality and action context rather than treating all slot errors uniformly\.

Building on this foundation, our evaluation employs a progressive four\-level hierarchy: \(i\) speaker identification, \(ii\) intention recognition, \(iii\) risk\-aware entity extraction, and \(iv\) action\-level consequence\-aware scoring\. Unless otherwise stated, all scores are computed on the curated evaluation subset and macro\-averaged across classes\.

### III\-ASpeaker Evaluation

Speaker role is formulated as a binary classification task over \{Pilot,Controller\}\. We report Macro\-F1 as the primary metric and accuracy as auxiliary:

Speaker​\-​F1=MacroF1​\(Pilot,Controller\)\.\\mathrm\{Speaker\\text\{\-\}F1\}=\\mathrm\{MacroF1\}\(\\textsc\{Pilot\},\\textsc\{Controller\}\)\.\(1\)

### III\-BIntention Evaluation

Intention classification is modeled as a four\-class problem over \{Greet,Inform,Instruction,Readback\}\. Performance is measured using Macro\-F1:

Intention​\-​F1=1\|𝒞\|​∑c∈𝒞F1c,\\mathrm\{Intention\\text\{\-\}F1\}=\\frac\{1\}\{\|\\mathcal\{C\}\|\}\\sum\_\{c\\in\\mathcal\{C\}\}\\mathrm\{F1\}\_\{c\},\(2\)where𝒞\\mathcal\{C\}denotes the set of intention classes\.

### III\-CEntity Evaluation \(Risk\-Aware\)

#### III\-C1Slot\-based representation

Entities are evaluated as semantic slots represented by\(entity\_type,text\)\(\\textit\{entity\\\_type\},\\textit\{text\}\)\.

Matching is performed in a one\-to\-one manner between ground\-truth and predicted entities\. A predicted entity is considered correct if \(i\) the entity type matches and \(ii\) the token\-level overlap between the predicted text and the ground\-truth text exceeds a predefined threshold \(0\.9 in our experiments\)\. Each predicted entity can be matched to at most one ground\-truth entity\.

#### III\-C2Slot\-level F1

We compute entity extraction performance using Macro\-F1 across entity types:

MacroF1=1\|ℰ\|​∑e∈ℰF1e,\\mathrm\{MacroF1\}=\\frac\{1\}\{\|\\mathcal\{E\}\|\}\\sum\_\{e\\in\\mathcal\{E\}\}\\mathrm\{F1\}\_\{e\},\(3\)whereℰ\\mathcal\{E\}denotes the set of entity categories\. This unweighted formulation ensures that frequent entities do not dominate the overall score\.

#### III\-C3Risk\-Weighted Entity Recall \(RW\-ER\)

To reflect that not all missing entities carry equal operational impact, we employ a risk\-weighted recall over*all ground\-truth entities*present in an utterance\. Each entity typeeeis assigned an importance weightw​\(e\)w\(e\)derived from a questionnaire\-based survey of five licensed tower controllers from Beijing Capital Airport Tower\. Controllers rated the relative safety impact of omitting or misinterpreting each entity category in ground\-control communications; ratings were averaged across participants and normalized to\[0,1\]\[0,1\], where higher values indicate greater operational criticality \(Tables[II](https://arxiv.org/html/2605.11769#S3.T2)and[III](https://arxiv.org/html/2605.11769#S3.T3)\)\.

Letℰgt\\mathcal\{E\}\_\{\\text\{gt\}\}be the set of all ground\-truth entities for an utterance andℰhit\\mathcal\{E\}\_\{\\text\{hit\}\}be the subset that is correctly predicted\. We define:

RW​\-​ER=∑e∈ℰhitw​\(e\)∑e∈ℰgtw​\(e\)\.\\mathrm\{RW\\text\{\-\}ER\}=\\frac\{\\sum\_\{e\\in\\mathcal\{E\}\_\{\\text\{hit\}\}\}w\(e\)\}\{\\sum\_\{e\\in\\mathcal\{E\}\_\{\\text\{gt\}\}\}w\(e\)\}\.This metric penalizes omissions of safety\-critical entities \(e\.g\.,Callsign,Runway/Taxiway\) more heavily than low\-impact entities \(e\.g\.,Greet\)\.

TABLE II:Summary of expert survey used to derive entity weights\.TABLE III:Entity importance weightsw​\(e\)w\(e\)used in risk\-aware evaluation\. Higher values indicate greater operational criticality if an entity is omitted or misunderstood\. O denotes tokens outside annotated entity spans\.

### III\-DAction\-Level Risk\-Aware Scoring

#### III\-D1Action schema and risk levels

Each utterance is mapped to an action typeai∈𝒜a\_\{i\}\\in\\mathcal\{A\}with a predefined set of critical slots𝒮​\(ai\)\\mathcal\{S\}\(a\_\{i\}\)as Table[IV](https://arxiv.org/html/2605.11769#S3.T4)\. Action types are associated with an operational risk levelρ​\(ai\)∈\{1\.0,0\.6,0\.2\}\\rho\(a\_\{i\}\)\\in\\\{1\.0,0\.6,0\.2\\\}forHigh/Medium/Lowrisk, respectively \(determined by domain knowledge and controller feedback\)\.

TABLE IV:Action schema with risk levels and critical slots\.
#### III\-D2Risk\-weighted action score

Given an utteranceiiwith ground\-truth action typeaia\_\{i\}and its critical slot set𝒮​\(ai\)\\mathcal\{S\}\(a\_\{i\}\), we define a consequence\-aware correctness score:

Scorei=r​\(ai\)⋅∑s∈𝒮​\(ai\)wai,s​mi,s∑s∈𝒮​\(ai\)wai,s\.\\mathrm\{Score\}\_\{i\}=r\(a\_\{i\}\)\\cdot\\frac\{\\sum\_\{s\\in\\mathcal\{S\}\(a\_\{i\}\)\}w\_\{a\_\{i\},s\}\\,m\_\{i,s\}\}\{\\sum\_\{s\\in\\mathcal\{S\}\(a\_\{i\}\)\}w\_\{a\_\{i\},s\}\}\.\(4\)Here,wai,sw\_\{a\_\{i\},s\}is the importance weight of critical slotssunder actionaia\_\{i\}\(derived from the entity weights in Table[III](https://arxiv.org/html/2605.11769#S3.T3)\), andmi,s∈\{0,1\}m\_\{i,s\}\\in\\\{0,1\\\}is a slot match indicator \(1 if slotssis correctly predicted; 0 otherwise\)\.

##### Action\-type risk coefficient\.

We incorporate action\-type correctness with a risk coefficient:

r​\(ai\)=\{1\.0,if action type is predicted correctly,1−ρ​\(ai\),otherwise,r\(a\_\{i\}\)=\\begin\{cases\}1\.0,&\\text\{if action type is predicted correctly\},\\\\ 1\-\\rho\(a\_\{i\}\),&\\text\{otherwise\},\\end\{cases\}\(5\)whereρ​\(ai\)\\rho\(a\_\{i\}\)is the predefined risk level of actionaia\_\{i\}:

ρ​\(ai\)=\{1\.0,High risk,0\.6,Medium risk,0\.2,Low risk\.\\rho\(a\_\{i\}\)=\\begin\{cases\}1\.0,&\\textsc\{High risk\},\\\\ 0\.6,&\\textsc\{Medium risk\},\\\\ 0\.2,&\\textsc\{Low risk\}\.\\end\{cases\}\(6\)This design penalizes action\-type mistakes more aggressively for high\-risk actions \(e\.g\., movement restrictions\) while allowing smaller penalties for low\-risk exchanges\.

##### Dataset\-level aggregation\.

We report the mean action\-risk score over all utterances as follows:

Action​\-​RiskScore=1N​∑i=1NScorei\.\\mathrm\{Action\\text\{\-\}RiskScore\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathrm\{Score\}\_\{i\}\.\(7\)
In addition, we report the average inference time per utterance for each model as a practical deployment consideration in real\-time operational environments\.

Together, these metrics distinguish surface\-level accuracy from true operational understanding, providing a more reliable assessment of deployment readiness under safety constraints\.

## IVExperimental Setup

### IV\-AEvaluation Data

All experiments are conducted on the curated evaluation subset described in Section[II](https://arxiv.org/html/2605.11769#S2)\. The subset contains 1,000 manually verified ATC utterances with structured annotations, including speaker role, communicative intention, entity slots, and action\-level risk labels\.

### IV\-BLLM Evaluation Protocol

We evaluate multiple state\-of\-the\-art commercial and open\-weight LLMs via their official APIs under standardized conditions\. For each utterance, only the raw text is provided as input\.

The prompt specifies the predefined entity types, action types, and a fixed structured output template\. Models generate structured outputs accordingly, which are then parsed and evaluated using the risk\-aware metrics defined in Section[III](https://arxiv.org/html/2605.11769#S3)\.

No task\-specific fine\-tuning is applied\. All models are evaluated in a strictly zero\-shot setting to assess their out\-of\-the\-box reliability\. This design reflects practical constraints in safety\-critical ATC environments, where large\-scale, risk\-calibrated annotations are scarce and sensitive\. Evaluating zero\-shot behavior establishes a baseline of intrinsic robustness and prevents domain adaptation from masking fundamental structural vulnerabilities\.

Inference latency is measured as the average processing time per utterance over the full evaluation set\.

### IV\-CEnd\-to\-End Voice Pipeline

To evaluate robustness under realistic speech input, we construct an end\-to\-end voice pipeline by integrating automatic speech recognition \(ASR\) backbones with downstream semantic parsing\. Several open\-source ASR models \(Whisper\-large, Whisper\-medium, Whisper\-small, Whisper\-turbo, and wav2vec2\) are used to transcribe audio utterances\. The resulting transcripts are then processed by a fixed downstream LLM \(GPT\-4o\-mini\) using the same structured prompting protocol described above\.

Evaluation uses the same risk\-aware metrics for direct comparison\. ASR latency is reported separately to reflect practical considerations in real\-time operational settings\.

## VEmpirical Analysis under Risk\-Aware Evaluation

We evaluate state\-of\-the\-art LLMs under the proposed risk\-aware metrics to examine whether strong aggregate performance translates into operational safety\.

### V\-AAggregate vs Risk\-Aware Gap

#### V\-A1Overall Performance Landscape

Under clean transcript input, frontier LLMs show strong structured parsing across speaker, intention, and action extraction\. The Gemini series achieves the highest Risk Scores \(0\.6922 / 0\.6884\), while GPT\-5\.1 attains competitive performance \(0\.6775\) and the highest strict score \(0\.4240\)\. Clear scaling effects are also observed within the Qwen family\.

#### V\-A2Conventional Metrics vs\. Risk\-Aware Evaluation

A consistent gap emerges between conventional metrics \(NER F1, Action\-Macro\) and consequence\-aware scores\. For most models, the Risk Score is lower than Action\-Macro, indicating that macro averaging overestimates safety performance\.

Weighting high\-impact entities \(e\.g\., callsigns, runway identifiers, movement constraints\) exposes hidden weaknesses\. GPT\-5\.1 and qwen3\-max remain stable under weighting \(Act W/T\), whereas several other models degrade, revealing sensitivity to critical\-slot errors\.

#### V\-A3Strict Criterion as a Safety Stress Test

Under Action\-Risk\-Strict—where any error in type or critical slot yields zero—performance collapses across all systems\. Most models fall within the 0\.1–0\.2 range, and even GPT\-5\.1 \(0\.4240\) remains far from operational reliability\. High aggregate semantic accuracy therefore does not imply safety compliance, as minor structural deviations can invalidate an entire instruction under real operational constraints\.

TABLE V:Model performance comparison with inference latency \(1,000 utterances\)\.Metric abbreviations:Risk Score represents the proposed consequence\-aware evaluation metric, where higher values indicate better model performance rather than higher operational risk; Act F1 denotes action\-type classification F1; Risk\-NER indicates risk\-weighted entity F1; NER F1 is unweighted entity\-level F1; Act Macro refers to unweighted slot\-level Macro\-F1; Act W/T applies slot\-importance weighting without type penalty; Risk Strict denotes the fail\-safe criterion requiring full correctness of action type and critical slots\.

![Refer to caption](https://arxiv.org/html/2605.11769v1/picture/entities.png)Figure 2:Entity\-level accuracy across evaluated models\. Identity\-related entities \(e\.g\.,Callsign,Controller\) achieve consistently high accuracy, whereas spatial and constraint\-related entities \(e\.g\.,Runway,Gate,Condition\) exhibit substantial cross\-model variability\.

### V\-BStructural Failure Patterns

#### V\-B1Entity\-Level Vulnerabilities

Figure[2](https://arxiv.org/html/2605.11769#S5.F2)presents entity\-level accuracy across models\. Identity\-related entities such asCallsign,Controller, andFrequencyexhibit consistently high accuracy across frontier systems, frequently exceeding 0\.94\. This suggests that surface\-form recognition and role identification are largely saturated under clean transcript conditions\.

In contrast, spatial and constraint\-related entities demonstrate substantial instability\.Runwayextraction remains particularly fragile across model families, with several variants \(e\.g\., qwen\-plus, qwen3\-14b\) dropping below 0\.15–0\.20 accuracy\. Similarly,Gateand contextualConditionshow large cross\-model performance gaps, indicating persistent weaknesses in operational grounding and constraint resolution\. These results reveal that aggregate NER F1 obscures safety\-critical vulnerabilities concentrated in specific entity categories rather than uniformly distributed errors\.

![Refer to caption](https://arxiv.org/html/2605.11769v1/picture/level.png)Figure 3:Risk\-level robustness curves across evaluated models\. Left: critical slot accuracy; Right: action\-type accuracy\. While type performance remains comparatively stable across High, Medium, and Low\-risk subsets, slot extraction exhibits substantially greater variability, revealing structural grounding instability under safety\-critical conditions\.
#### V\-B2Risk\-Stratified Structural Sensitivity

To further examine structured failure modes, we conduct risk\-stratified evaluation separating action\-type classification from critical slot extraction\. Figure[3](https://arxiv.org/html/2605.11769#S5.F3)summarizes robustness trends across High, Medium, and Low\-risk subsets\.

Action\-type accuracy remains relatively stable across risk levels, typically fluctuating within a narrow band, and High\-risk categories are not consistently the most challenging at the type level\. In contrast, slot extraction exhibits higher risk sensitivity\. While frontier models \(e\.g\., GPT\-5\.1 and Gemini variants\) maintain relatively stable slot performance across risk strata, several systems show pronounced degradation under high\-impact conditions\. In multiple cases, models achieve competitive High\-risk type accuracy yet display sharp slot\-level drops, revealing structural inconsistency in constraint grounding\. This type–slot discrepancy explains the sharp decline under strict evaluation: a single error in a safety\-critical argument can invalidate an otherwise correct instruction\.

TABLE VI:Voice pipeline performance using different ASR models \(LLM: GPT\-4o\-mini\)\.

### V\-CDeployment Fragility under ASR Noise

#### V\-C1Voice Pipeline Degradation

Table[VI](https://arxiv.org/html/2605.11769#S5.T6)reports performance under the end\-to\-end voice pipeline\. Introducing ASR noise causes a sharp degradation across all consequence\-aware metrics\. While GPT\-4o\-mini achieves a Risk Score of 0\.45 on clean transcripts, performance drops to 0\.04–0\.07 with Whisper backbones and further to 0\.01 with wav2vec2, representing an order\-of\-magnitude decline despite identical downstream inference\. Action\-Macro decreases more moderately than risk\-weighted metrics, indicating that coarse action\-type recognition remains partially intact whereas structured slot extraction—particularly callsigns, runway identifiers, taxiway references, and boundary constraints—is highly sensitive to even minor transcription perturbations\. As a result, errors that appear minor at the lexical level propagate directly to consequence\-aware scoring\.

#### V\-C2Safety Collapse under Strict Criterion

The impact becomes critical under the fail\-safe Action\-Risk\-Strict metric\. Once strict structural correctness is required, performance collapses to 0\.00 across all ASR configurations\. No voice\-based setting produces fully correct structured predictions when transcription noise is present\.

These findings reveal a substantial gap between transcript\-level evaluation and realistic deployment conditions\. Even small recognition inconsistencies can invalidate safety\-critical instructions under operational constraints\. Text\-only benchmarking therefore substantially overestimates real\-world reliability and should be complemented by end\-to\-end voice evaluation\.

## VIConclusion

This work evaluates large language models for safety\-critical language understanding in air traffic control \(ATC\) under consequence\-aware metrics\. We show that conventional aggregate metrics substantially overestimate operational reliability\. Under risk\-weighted scoring and strict structural correctness, performance drops markedly across all models\.

Failures concentrate in safety\-critical entities such as runway identifiers and movement constraints, even when action types are predicted correctly\. Deployment under realistic voice conditions further induces an order\-of\-magnitude degradation, with strict correctness collapsing across ASR settings\. These results indicate that high\-level semantic accuracy does not guarantee structured operational safety\.

Future work will investigate safety\-oriented model improvement, constraint\-aware decoding, domain adaptation, more extensive sensitivity analysis, and human\-in\-the\-loop validation, together with broader cross\-domain extension of the evaluation setting\. More broadly, evaluation frameworks for high\-risk domains must move beyond linear token\-level averages toward consequence\-sensitive assessment\.

Reliable deployment in safety\-critical environments requires not only stronger models, but stricter evaluation standards\.

## References

- \[1\]Z\. Chen, H\. Huang, A\. Andrusenko, O\. Hrinchuk, K\. C\. Puvvada, J\. Li, S\. Ghosh, J\. Balam, and B\. Ginsburg\(2023\)SALM: speech\-augmented language model with in\-context learning for speech recognition and translation\.Note:arXiv preprint arXiv:2310\.09424External Links:[Document](https://dx.doi.org/10.48550/arXiv.2310.09424)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[2\]\(2023\)C2A\-SLU: cross and contrastive attention for improving asr robustness in spoken language understanding\.InProc\. Interspeech 2023,pp\. 695–699\.External Links:[Document](https://dx.doi.org/10.21437/Interspeech.2023-93)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[3\]X\. Cheng, Z\. Zhu, X\. Zhuang, Z\. Chen, Z\. Huang, and Y\. Zou\(2024\)MoE\-SLU: towards asr\-robust spoken language understanding via mixture\-of\-experts\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 14868–14879\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.882)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[4\]L\. Dong, Z\. An, P\. Wu, J\. Zhang, L\. Lu, and Z\. Ma\(2023\)CIF\-PT: bridging speech and text representations for spoken language understanding via continuous integrate\-and\-fire pre\-training\.InFindings of the Association for Computational Linguistics: ACL 2023,Toronto, Canada,pp\. 8894–8907\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.566)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[5\]A\. Graves, S\. Fernández, F\. Gomez, and J\. Schmidhuber\(2006\)Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks\.InICML,Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p3.1)\.
- \[6\]A\. Graves and N\. Jaitly\(2014\)Towards end\-to\-end speech recognition with recurrent neural networks\.ICML\.Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p3.1)\.
- \[7\]H\. Guan, P\. Hou, P\. Hong, L\. Wang, W\. Zhang, X\. Du, Z\. Zhou, and L\. Zhou\(2025\)A clinically\-informed framework for evaluating vision\-language models in radiology report generation: taxonomy of errors and risk\-aware metric\.Note:medRxiv preprintExternal Links:[Document](https://dx.doi.org/10.1101/2025.07.13.25331222)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p3.1)\.
- \[8\]Y\. Guleria, D\. Pham, A\. L\. K\. Yun, T\. N\. M\. Nadirsha, K\. Fennedy, C\. Ma, and S\. Alam\(2026\)Text2Traffic: retrieval\-enhanced in\-context learning for complex air traffic scenario generation\.Journal of Aerospace Information Systems\.Note:Published online Mar\. 16, 2026External Links:[Document](https://dx.doi.org/10.2514/1.D0566),[Link](https://doi.org/10.2514/1.D0566)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[9\]K\. Hofbauer, S\. Petrik, and H\. Hering\(2008\-05\)The ATCOSIM corpus of non\-prompted clean air traffic control speech\.InProceedings of the Sixth International Conference on Language Resources and Evaluation \(LREC’08\),Marrakech, Morocco\.External Links:[Link](https://aclanthology.org/L08-1507/)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[10\]A\. Mani, S\. Palaskar, and S\. Konam\(2020\-07\)Towards understanding ASR error correction for medical conversations\.InProceedings of the First Workshop on Natural Language Processing for Medical Conversations,Online,pp\. 7–11\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.nlpmc-1.2),[Link](https://aclanthology.org/2020.nlpmc-1.2/)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p3.1)\.
- \[11\]A\. Mani, S\. Palaskar, N\. V\. Meripo, S\. Konam, and F\. Metze\(2020\)ASR error correction and domain adaptation using machine translation\.Note:arXiv:2003\.07692Accepted for Oral Presentation at ICASSP 2020External Links:[Document](https://dx.doi.org/10.48550/arXiv.2003.07692),[Link](https://arxiv.org/abs/2003.07692)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p3.1)\.
- \[12\]OpenAI\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[13\]R\. Prabhavalkaret al\.\(2021\)Applying large\-scale weakly\-supervised automatic speech recognition to air traffic control\.InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 6249–6253\.Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[14\]A\. Radfordet al\.\(2023\)Robust speech recognition via large\-scale weak supervision\.arXiv preprint arXiv:2212\.04356\.Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[15\]M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh\(2020\)Beyond accuracy: behavioral testing of nlp models with checklist\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 4902–4912\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.442)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p3.1)\.
- \[16\]V\. Thai, D\. Aradhya, C\. Chan, Y\. Chang, P\. Thinh, and S\. Alam\(2025\)Speech\-to\-route: learning\-based airport taxi route inference from progressive air traffic control instructions\.Note:SSRNAvailable at SSRN:\\urlhttps://ssrn\.com/abstract=5871919External Links:[Document](https://dx.doi.org/10.2139/ssrn.5871919),[Link](https://ssrn.com/abstract=5871919)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.11769#S2.SS1.p1.1)\.
- \[17\]M\. Y\. Z\. Wee, J\. J\. H\. Wong, L\. Lim, J\. Y\. W\. Tan, P\. Gupta, D\. Lim, E\. H\. Tew, A\. K\. S\. Han, and Y\. Z\. Lim\(2025\)Adapting automatic speech recognition for accented air traffic control communications\.Note:arXiv:2502\.20311External Links:[Document](https://dx.doi.org/10.48550/arXiv.2502.20311),[Link](https://arxiv.org/abs/2502.20311)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.
- \[18\]W\. Xu, Y\. Tuan, Y\. Lu, M\. Saxon, L\. Li, and W\. Y\. Wang\(2022\)Not all errors are equal: learning text generation metrics using stratified error synthesis\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Abu Dhabi, United Arab Emirates,pp\. 6559–6574\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.489)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p3.1)\.
- \[19\]S\. Yin, P\. Huang, J\. Chen, H\. Huang, and Y\. Xu\(2025\-07\)ECLM: entity level language model for spoken language understanding with chain of intent\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 21851–21862\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1061),[Link](https://aclanthology.org/2025.acl-long.1061/)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p3.1)\.
- \[20\]J\. Zuluaga\-Gomez, P\. Motlicek, Q\. Zhan, K\. Vesely, and R\. Braun\(2020\)Automatic speech recognition benchmark for air\-traffic communications\.Note:arXiv:2006\.10304External Links:[Document](https://dx.doi.org/10.48550/arXiv.2006.10304),[Link](https://arxiv.org/abs/2006.10304)Cited by:[§I](https://arxiv.org/html/2605.11769#S1.p2.1)\.

Similar Articles

Lessons learned on language model safety and misuse

OpenAI Blog

OpenAI shares lessons learned on language model safety and misuse, discussing challenges in measuring risks, the limitations of existing benchmarks, and their development of new evaluation metrics for toxicity and policy violations. The post also highlights concerns about labor market impacts and the need for continued research on measuring social effects of AI deployment at scale.

Decomposing and Steering Functional Metacognition in Large Language Models

arXiv cs.CL

This research paper investigates functional metacognition in Large Language Models, demonstrating that internal states like evaluation awareness and self-assessed capability are linearly decodable from residual stream activations. The authors propose a mechanistic framework to steer these states, showing causal control over reasoning behaviors, verbosity, and safety responses.

Evaluating chain-of-thought monitorability

OpenAI Blog

OpenAI researchers introduce a framework and suite of 13 evaluations to systematically measure chain-of-thought monitorability in large language models, finding that monitoring reasoning processes is substantially more effective than monitoring outputs alone, with important implications for AI safety and supervision at scale.

Evaluation Awareness in Language Models Has Limited Effect on Behaviour

arXiv cs.CL

This paper investigates whether verbalized evaluation awareness (VEA) in large reasoning models causally affects their behavior on safety, alignment, moral reasoning, and political opinion benchmarks. The authors find that VEA has limited behavioral impact, with near-zero effects from injecting VEA and small shifts from removing it, suggesting that high VEA rates should not be taken as strong evidence of strategic behavior or alignment tampering.