ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots
Summary
ASTRA is an end-to-end training simulator for air traffic control operators that automates sim pilot roles using locally adapted speech models, achieving a significant reduction in word error rates for Singaporean-accented aviation speech and incorporating AI-assisted performance evaluation.
View Cached Full Text
Cached at: 06/18/26, 05:41 AM
# ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots
Source: [https://arxiv.org/html/2606.18319](https://arxiv.org/html/2606.18319)
\\correspondingauthor
lim\_yong\_zhi1@defence\.gov\.sg
Ethan Chew\*Air Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeEnjia Wu\*Air Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeIruss Eng Wei YeowAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeIan Weiqin LimAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeRanen SimAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeBrandon Koh ZihengAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeKaleb NimAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeCaden Toh Jun YiAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeWei Dong SoinAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeDarius Kai Keat KohAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeGalen King Yu TayAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporePrannaya GuptaAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, SingaporeJonathan Ee Fang KoongAir Emerging Technologies High\-Speed Experimentations and Research \(AETHER\), RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, Singapore
###### Abstract
Air Traffic Control Operators \(ATCOs\) are vital in ensuring the safe, orderly, and efficient flow of air traffic, yet training capacity is constrained by reliance on specialized human trainers known assimpilots, who must role\-play both pilots and ATCOs in a simulated airspace\. Existing automated solutions rely on Western\-centric speech models that perform poorly in Singaporean operational contexts, with off\-the\-shelf systems exhibiting Word Error Rates \(WER\) of up to 107\.80% on Singaporean\-accented aviation speech\. We introduceASTRA, an end\-to\-end training simulator that automates thesesimpilotroles through a pipeline that transcribes ATCO speech, interprets instructions, and generates appropriate pilot and ATCO responses using locally adapted voice models\. Our fine\-tuned Automatic Speech Recognition \(ASR\) pipeline reduces WER to 23\.45%, substantially outperforming existing approaches in this domain\. Beyond traffic simulation,ASTRAincorporates an AI\-assisted performance evaluation framework that assesses trainee radiotelephony communications across accuracy, brevity, and completeness, achieving post\-optimization scores of 91\.7%, 88\.2%, and 86\.9%, respectively\. Built on open\-source foundations such as DSPy and Unsloth, this approach enables scalable, standardized ATCO assessment while reducing instructor workload\.
## 1Introduction
Air Traffic Control Operators \(ATCOs\) play a critical role in ensuring the safe, orderly, and efficient flow of air traffic in increasingly congested skies\. As aviation recovers rapidly from COVID\-19, there is a critical shortage of ATCOs, increasing safety risks and limiting capacity\[understaffed\-cbs\]\. Current ATCO training systems heavily rely on specialized human trainers known as simulation\-pilots \(or"simpilots"\) role\-playing both aircraft pilots \(or"pseudo pilots"\) and other ATCOs \(or"ghost controllers"\) to coordinate with the trainee to create a realistic training environment\[faa\_simpilots\]\.
This paradigm struggles with scalability: the required complement of instructors and simpilots limits training throughput and constrains practice to moments when qualified personnel are co\-located and available\. Automating the simpilot role is therefore essential to enable more flexible training, while providing standardized and objective assessment independent of human availability\[brudnicki2005application\]\. A further constraint is localization, where frontier models trained on American and British English fail onSingaporean\-accented speech and aviation terminology\.
This work introducesASTRA, a training simulator that mimics the functionality of asimpilot\.ASTRAimplements an end\-to\-end speech\-to\-speech pipeline to respond to input commands from ATCO trainees with appropriate responses\.
This pipeline consists of five major stages, which are modeled afterlin2021deep:
1. 1\.Automatic Speech Recognition \(ASR\): transcribes ATCO trainee speech into textual commands\.
2. 2\.Controller Instruction Understanding \(CIU\): parses textual commands and extracts structured target parameters \(STPs\)\.
3. 3\.Response Generation: generates contextually appropriate replies from STPs from either thepseudo pilotorghost controller\.
4. 4\.Text\-to\-Speech \(TTS\): synthesizes audio based on responses generated by the Response Generation module, which is then streamed to the ATCO trainee\.
5. 5\.Simulation of Aircraft Movement \(SAM\): utilizes STPs to reflect changes in the positions of the aircraft in the simulation environment\.
Beyondsimpilotautomation,ASTRAaddresses a second gap in existing training systems: the lack of objective, scalable performance assessment\.ASTRAtherefore incorporates an AI\-assisted performance evaluation framework that scores trainee’s radiotelephony across accuracy, brevity, and completeness, providing automated feedback that would have previously required an experienced instructor to deliver\.
The rest of this paper is structured as follows:[Section˜2](https://arxiv.org/html/2606.18319#S2)reviews relevant literature across the key components of automated ATCO training systems, outlining the current state of the art\.[Section˜3](https://arxiv.org/html/2606.18319#S3)describes the design and implementation ofASTRA, including the simulation environment and end\-to\-end speech pipeline\.[Section˜4](https://arxiv.org/html/2606.18319#S4)presents experimental evaluations of the system, covering ASR, TTS, and communication assessment performance\. Finally,[Section˜5](https://arxiv.org/html/2606.18319#S5)discusses the limitations of the current approach and outlines directions for future work, before concluding in[Section˜6](https://arxiv.org/html/2606.18319#S6)\.
## 2Existing Work
Our extensive review of works related tosimpilotsis decomposed into four sub\-modules:ASR,CIU,Response Generation, andTTS\.
### 2\.1Automatic Speech Recognition \(ASR\)
Modern ASR systems such as*Whisper*\[radford2023robust\]have made significant advancements, resulting in broader applicability across domains\. Despite these advancements, these frontier ASR models tend to struggle with the following two issues: transcribing Singaporean\-accented speech and accurately recognizing domain\-specific terminology\.
#### 2\.1\.1Accent Robustness in ASR Models
Conventional ASR models perform poorly on Singaporean\-accented English due to limited speech corpora available for training with these accents\.
he2024MERaLiONandwang2025advancingproposed*MERaLiON*, an audio foundation model capable of transcribing local accented speech more accurately as compared to existing frontier audio foundation models, and introduced the Multitask National Speech Corpus \(MNSC\), a large\-scale corpus for Singaporean\-accented transcription\.
#### 2\.1\.2Domain Terminology in Aviation ASR
Conventional ASR models struggle with accurately recognizing specific terminology used in radiotelephony communications\. To address this, many works explore fine\-tuning to adapt the weights of such models \(e\.g\.*Whisper*\) to domain\-specific corpora\.
van2024whisperpresented*WhisperATC*, a family of models fine\-tuned on ATCOSIM\[hofbauer2008atcosim\]and ATCO2\[zuluaga2022atco2\]\. These models achieve a Word Error Rate \(WER\) of16\.74%16\.74\\%for ATCO2 and1\.19%1\.19\\%for ATCOSIM, as compared to base*Whisper*which achieved24\.03%24\.03\\%and16\.74%16\.74\\%respectively\.
### 2\.2Controller Instruction Understanding
prasad2022speechandzuluaga2023virtualtrained a named entity recognition \(NER\) model based on BERT\[devlin2019bert\], which attempts to break down an ASR command into three key fields, 1\)callsign, 2\)commandand 3\)value\.jiang2024slkirproposed Small Sample Learning for Key Information Recognition \(SLKIR\), an end\-to\-end deep learning framework for information extraction from Chinese ATC commands\.
### 2\.3Response Generation
lin2021deepproposes a pilot repetition generation method by training a*Seq2Seq*model to generate pilot readbacks by reordering and preserving key elements of the ATCO instruction, such as aircraft callsign and command parameters\. Their bidirectional long\-short term memory network with attention helps capture ATC\-specific structures and cases where only partial readback is required\.
### 2\.4Text\-to Speech \(TTS\)
Neural TTS has rapidly advanced with multilingual and voice\-cloning models such as*VITS*\[kim2021conditional\]and*XTTS*\[casanova2024xtts\], enabling natural and intelligible speech for aviation training\.
10765121introduced a TTS system for ATCO and pilot speech, fine\-tuning*VITS*and*XTTS*on ATCOSIM and multilingual pilot\-speech datasets\. Across 4100 subjective ratings,*XTTS*outperformed*VITS*in clarity, pronunciation, intonation, naturalness, and overall quality\.
Nonetheless, current ATC\-focused TTS systems still struggle with the following: 1\) generating consistent Singaporean\-accented speech, 2\) producing accurate pronunciation of aviation\-specific terms, and 3\) limited domain\-appropriate evaluation metrics\.
#### 2\.4\.1Accent Robustness in TTS Models
Most aviation TTS systems struggle with underrepresented accents because training data is dominated by American or British English, limiting local phonetic and prosodic modeling\. Insufficient regional data leads to accent drift and unstable prosody\[10\.1109/TASLP\.2024\.3363414\]\. Zero and few\-shot models like*XTTS*mitigate this using multilingual representations for low\-resource accent transfer\.
#### 2\.4\.2Domain Terminology in Aviation TTS
Accurate aviation radiotelephony requires TTS models to pronounce domain\-specific terms often missing from general corpora, causing mispronunciations and irregular pacing\. Low\-frequency or unseen tokens produce unstable pronunciation, and reliable output needs specialized lexicons or pronunciation dictionaries\[ttsdomainadapt\]\.
hu2019domainaugmented training with synthetic word–pronunciation pairs and phoneme improvements, reducing errors on unseen terms, while adding domain\-specific lexicons and rules to stabilize output\.
#### 2\.4\.3Limitations of TTS Evaluation Methods
TTS systems lack reliable, domain\-specific evaluation metrics\. Mean Opinion Scores \(MOS\) are subjective and difficult to attain, while Mel Cepstral Distortion does not capture domain phraseology or radiotelephony timing\. ASR\-based intelligibility is also unreliable across mismatched domains\[salesky2021assessing\], and without a Singaporean\-accented aviation ASR, scores largely reflect ASR bias rather than TTS quality\. Recent work explores automated evaluators like Large Language Model \(LLM\)\-based scoring\[wang2025enabling\]and ASR\-ensemble methods combining ASR confidence, pronunciation, and acoustic similarity\[kirk2025mos\], but both still require domain\-matched or calibrated models\.
### 2\.5AI\-Assisted Radiotelephony Performance Evaluation
Current evaluation methodologies rely on experienced instructors manually reviewing trainee communications, introducing variability and limiting throughput\.
aldridge2025identifyinghighlight that objective and continuous measurement of ATC performance remains relatively underexplored, emphasizing the need for structured and quantifiable evaluation methods\.
Early work bybrudnicki2005applicationintroduced Intelligent Tutoring Systems \(ITS\) to support structured evaluation in ATC training\. ITS frameworks define three key components: 1\) anExpert Model, representing expected performance, 2\) aStudent Model, capturing observed trainee behavior, and 3\) anInstructor Model, supporting feedback and after\-action\-review \(AAR\)\. However, practical adoption remains limited, with evaluation processes still largely reliant on manual interpretation\.
To improve objectivity, rule\-based evaluation methods have been explored\.wu2020rulebaseddemonstrate that predefined scoring rules can provide consistent and interpretable assessments in ATC simulation environments\. Nevertheless, such approaches remain rigid and are unable to effectively capture linguistic variation and contextual intent in radiotelephony communication\.
Recent advances in large language models \(LLMs\) provide new opportunities to address these limitations\.chiang2023llmshow that LLMs can function as reliable evaluators when guided by structured prompts and rubrics, whilezhang2020bertscoredemonstrate that contextual embedding methods such as BERTScore enable semantic similarity evaluation beyond exact lexical matching\. These developments motivate hybrid evaluation approaches that combine rule\-based consistency with LLM\-as\-a\-judge to support more robust and scalable radiotelephony performance assessment\.
## 3Technical Approach
ASTRAimplements an internal realistic simulation engine with a context\-appropriate user interface, an in\-built scenario management system and an end\-to\-end speech pipeline\.
### 3\.1User Interface
To simulate traditional Air Traffic Control \(ATC\) interfaces with enhanced realism,ASTRAleverages the Unity game engine integrated with Cesium Ion for high\-fidelity geospatial visualization, including detailed 3D terrain and radar views that mirror real\-world ATC displays\.
Beyond the simulation environment, theASTRAUser Interface serves as a centralized platform that allows instructors to configure training scenarios and monitor trainees\.
### 3\.2Training Scenarios
ASTRAsupports two scenario modalities\. InFree\-for\-Allmode, pre\-loaded aircraft create an open\-ended environment where trainees manage traffic reactively, suited for advanced users\.
InStory Mode, instructors design event\-driven scenarios using a node\-based tool\. Scenarios consist ofaircraft profiles\(aircraft type, intent, route, and context\), with triggers controlling when and how aircraft are introduced, enabling dynamic sequencing that adapts to trainee actions without further intervention\.
Within instructor\-createdsessions, each node represents anaircraft profile\. Connecting triggers across nodes establishes the sequence of actions that form a complete scenario\. Instructors can define an action or event in a scenario by setting three parameters:
- •Event Trigger: A configurable rule attached to an aircraft in the scenario\.
- •Event Type: The condition in the simulation to listen for \(e\.g\.AfterReportingWaypoint\)\. AnEvent Typemight require additional parameters to be filled in \(e\.g\. Waypoint Name\)\.
- •Event Handler: The action\(s\) that should run once the event type condition is met\.
### 3\.3End\-to\-End Speech Pipeline
ASTRAimplements an end\-to\-end speech pipeline to simulate thesimpilotcommunications, which is decomposed into five sub\-modules:ASR,CIU,Response Generation,TTS, andSAM\.
#### 3\.3\.1Automatic Speech Recognition \(ASR\)
ASTRAgenerically supports various frontier ASR models such as*Whisper*Small and Large v3 and*Parakeet*TDT v3\[sekoyan2025canary\]to transcribe speech data from ATCO trainees\. In addition, it also supports specialized ASR models such as*MERaLiON*, a model trained on Singaporean\-accented data, and*WhisperATC*Small and Large v3, a family of models trained on ATC\-specific data\.
ASTRAimplements a multi\-stage ASR inference pipeline optimized for noisy, accented ATC audio\. The pipeline moves through four stages: audio preprocessing, voice activity detection, transcription, and post\-processing \([Figure˜2](https://arxiv.org/html/2606.18319#S3.F2)\)\.
Figure 2:ASR inference pipeline1. a\)Audio Preprocessing: Raw audio is resampled to 16 kHz, peak\-normalized to−1\.0\-1\.0dBFS, and passed through a high pass filter at 150 Hz to suppress wind rumble and plosive artefacts common in VHF radio microphones\. Noise reduction is then applied using*RNNoise*\[valin2018hybrid\], a recurrent neural network trained for real\-time speech denoising\. The audio is upsampled to 48 kHz via*soxr*, for processing, then downsampled back to 16 kHz\.
2. b\)Voice Activity Detection: A*Silero VAD*\[silerovad\]model trims the audio to speech\-only regions before transcription\. Using a high confidence threshold \(0\.85\) with padding, the VAD aggressively removes leading and trailing silence, which is a known cause of*Whisper*hallucinations in quiet segments\.
3. c\)Transcription:ASTRAdeploys a fine\-tuned*Whisper*Large v3 model, converted to CTranslate2\[ctranslate2\]format via*faster\-whisper*\[fasterwhisper\]for reduced memory usage and faster inference\. Beam search is biased with 200 domain\-specific hotwords covering Singapore locations, waypoints, ATC procedures, callsigns, and confusable terms\. The ASR training corpus is mostly synthetic, generated using Chatterbox TTS and F5\-TTS\.
4. d\)Post\-Processing: To ensure correct pronunciation of Singaporean place names, TTS input uses simplified spoken forms \(e\.g\., "leh\-bar" for "lebar"\), while the model is trained to output the standard spelling\. Post\-processing corrects edge cases where the model reproduces the spoken form instead\. A rule\-based formatter then converts the normalised text to a human readable format \(e\.g\.,“eagle one climb flight level two eight zero”→\\rightarrow“EAGLE 1 climb FL280”\)\.
#### 3\.3\.2Controller Instruction Understanding \(CIU\)
ASTRAimplements a two\-stage CIU processing pipeline for parsing ATC utterances, interpreting commands and extracting key information\. To ensure low\-latency processing, all command types are stored in a*JSON*specification containing both structural definitions and keyword\-basedtags\.
In Stage 1, RegEx\-based tag matching identifies likely command types to accommodate multiple actions or embedded intents\. For example, the command typeCLIMB\_TOmay be tagged with keywords such asclimb,altitude, andincrease, allowing a command likeSingapore 123, climb five thousand feetto be accurately identified\. In Stage 2, only filtered candidate structures are passed, where*DSPy*\[khattab2024dspy\], in conjunction with frontier LLMs, extracts precise parameters, conditions, values, and contextual information\.
Figure 3:Elements of an ATCO UtteranceAs shown in[Figure˜3](https://arxiv.org/html/2606.18319#S3.F3), each communication type has a differing structural ontology, linguistic patterns and operational constraints \. As such,ASTRAimplements CIU as two specialized modules tailored to each communication type\.
##### Radio Frequency CIU
For controller\-pilot interactions, each utterance consists of an aircraft callsign and one of more ATC commands, each containing a type, parameter list, optional conditions and values, and any operational reason\.
##### Landline CIU
For controller\-controller coordination, each utterance includes a coordination prefix and ATC command\(s\) with, callsign, aircraft type, intended action, associated values, and further contextual information\. These are considerably more variable due to the large set of potential information\.
#### 3\.3\.3Response Generation
Output from the CIU module is then passed into the Response Generation module, which leverages*DSPy*in conjunction with frontier LLMs\. The Response Generation module is divided into two parallel pipelines: aPilot Response Generatorand aController Response Generator, each of which is prompted with role\-appropriate conventions and phraseology\.
#### 3\.3\.4Text\-to\-Speech \(TTS\)
ASTRAimplements a domain\-adapted text\-to\-speech \(TTS\) module to synthesize responses frompseudo\-pilotsandghost controllers\. Unlike general\-purpose TTS systems, ATC communication requires \(i\) strict pronunciation of specialized terminology, \(ii\) a recognizable Singaporean accent, \(iii\) fast but intelligible delivery, and \(iv\) stable long\-form synthesis without hallucinations\. To address these requirements,ASTRAadapts three models:
- •*XTTS 2\.0*, an autoregressive multilingual voice\-cloning model with integrated vocoder\.
- •*CSM*\(Conversational Speech Model\)111[https://hf\.co/sesame/csm\-1b](https://hf.co/sesame/csm-1b), a cross\-lingual text\-audio Transformer supporting contextual and expressive synthesis\.
- •
##### Experimental Setup
All TTS fine\-tuning and inference experiments were conducted on a workstation equipped with two NVIDIA GeForce RTX 4090 GPUs \(24 GB each\), an Intel Xeon Silver 4210 CPU, and 128 GB RAM\. Depending on the experiment, models were distributed across both GPUs \(e\.g\., parallel training runs or multi\-speaker streaming inference\)\. Mixed\-precision training and inference were enabled throughout to reduce memory usage and increase synthesis speed\.
##### Dataset Preparation
TTS models were trained on a mixed corpus constructed specifically for Singapore\-accented ATC speech\. The dataset combines \(i\) real Singaporean\-accent speech sourced from public corpora and internal recordings and \(ii\) synthetic ATC\-style speech generated using commercial systems such as Gemini333[https://ai\.google\.dev/gemini\-api/docs/speech\-generation](https://ai.google.dev/gemini-api/docs/speech-generation)and ElevenLabs444[https://elevenlabs\.io/app/speech\-synthesis/text\-to\-speech](https://elevenlabs.io/app/speech-synthesis/text-to-speech)\. Synthetic examples were used to cover a large range of callsigns, waypoints, runway identifiers, and multi\-value instructions that are difficult to obtain from natural recordings\.
To reduce overfitting and improve robustness,ASTRAapplies a set of audio augmentations including speed perturbation, pitch shifting, volume scaling, silence trimming, and time\-stretching\. These augmentations compensate for the overly clean nature of synthetic speech and limited dataset size while increasing the variability of prosody and timing patterns in the training corpus\.
The final training corpus comprised Singaporean\-accent conversational speech, mixed Singaporean\-accent plus synthetic ATC speech, and internal Singaporean\-accented aviation speech, supplemented with European ATC data only for auxiliary comparisons\.
##### Accent Adaptation
Accent adaptation was necessary due to the lack of native Singaporean\-accent ATC data\. A multi\-stage strategy was used\. First, conversational Singaporean\-accent recordings were combined with scripted ATC commands designed to emphasize local vowel realizations, syllable\-timing patterns, and pitch contours characteristic of Singapore English\. Second, synthetic ATC speech generated with Singapore\-accent settings was incorporated to expand coverage of domain\-specific terminology\. Third, augmentations were applied to induce prosodic variability and prevent models from learning overly rigid or monotonic speech patterns\.
Parameter\-efficient fine\-tuning using LoRA\[hu2022lora\]was applied to XTTS, CSM, and Orpheus\. This allowed the models to acquire Singaporean\-accent phonetic cues and ATC timing conventions while preserving their multilingual generalization abilities\. Multiple datasets \(single\-speaker, multi\-speaker, conversational, and ATC\-specific\) were evaluated to examine the impact of speaker consistency on accent stability and pronunciation accuracy\.
##### Terminology and Text Processing
ATC communication contains specialized terminology such as callsigns, waypoints, runway identifiers, and ICAO\-standard phrasing\. Because these items are largely absent from public speech corpora,ASTRAexpands terminology coverage through a combined text\-normalization and synthetic data construction process:
- •Custom text\-normalization rules that convert transcripts into ICAO\-compliant spoken forms \(for example digit\-by\-digit numbers, flight levels, callsign expansions, and runway labels\)\.
- •Synthetic creation of missing or rare aviation terms using*Gemini*and*ElevenLabs*, including callsigns, waypoint names, and local aerodromes\.
##### Training Hyperparameters
*XTTS*was trained using the Coqui framework, while*CSM*and*Orpheus*were fine\-tuned using the Unsloth framework\. Across all models, we used the AdamW optimiser, a linear warm\-up over the first 1–3% of steps, followed by cosine learning\-rate decay\.
Batch sizes were constrained by model size\.*Orpheus*and*CSM*were trained with global batch sizes of 16–24 using gradient accumulation, whereas XTTS used an effective batch size of 32\. Gradient clipping at 1\.0 was applied throughout to prevent divergence during the early fine\-tuning phase\.
All models were trained for approximately 10–20 epochs, depending on convergence on the Singapore\-accent ATC dataset\. This range was chosen to balance adaptation and overfitting risks, given the limited volume of domain\-specific data\. Validation loss was tracked after every 100 steps, and the checkpoint with minimum validation loss was selected for evaluation\. Across runs, loss curves stabilized after roughly 8–12 epochs, with diminishing returns beyond that point\.
##### Generation and Streaming
ASTRAgenerates real\-time TTS audio via chunk\-based synthesis where each instruction is split into small, overlapping segments generated sequentially, allowing for parallelization \(as implemented for*XTTS*\)\. These partial audio segments are streamed to the frontend immediately via*WebSockets*, reducing end\-to\-end latency and allowing for parallelization\.
Timestamps returned by the backend are used to align chunks and reconstruct a continuous stream, with short fades at chunk boundaries to avoid clicks or abrupt transitions\.
### 3\.4Simulation of Aircraft Movement \(SAM\)
ASTRAoperates at a fixed polling rateΔt\\Delta t, simulating radar sweeps occurring at regular intervals\. Each aircraft is defined by a set of performance parameters \(e\.g\., climb rate, cruise speed, ceiling, and runway requirements\), which are used to accurately replicate real\-world dynamics\. At each interval, the SAM module computes the aircraft state, represented by the current state vectorStS\_\{t\}and its target trajectoryTtT\_\{t\}, defined in Eq\. \([1](https://arxiv.org/html/2606.18319#S3.E1)\) and \([2](https://arxiv.org/html/2606.18319#S3.E2)\) respectively:
St=\[ϕt\(latitude\)λt\(longitude\)ht\(altitude\)ψt\(heading\)VIAS,t\(IAS\)Vs,t\(vertical speed\)θt\(bank angle\)\]S\_\{t\}=\\begin\{bmatrix\}\\phi\_\{t\}\\text\{ \(latitude\)\}\\\\ \\lambda\_\{t\}\\text\{ \(longitude\)\}\\\\ h\_\{t\}\\text\{ \(altitude\)\}\\\\ \\psi\_\{t\}\\text\{ \(heading\)\}\\\\ V\_\{\\text\{IAS,\}t\}\\text\{ \(IAS\)\}\\\\ V\_\{s,t\}\\text\{ \(vertical speed\)\}\\\\ \\theta\_\{t\}\\text\{ \(bank angle\)\}\\end\{bmatrix\}\(1\)
Tt=\[htarget,t\(target altitude\)ψtarget,t\(target heading\)VIAS,target,t\(target IAS\)δturn,t\(turn rate\)\]T\_\{t\}=\\begin\{bmatrix\}h\_\{\\text\{target,\}t\}\\text\{ \(target altitude\)\}\\\\ \\psi\_\{\\text\{target,\}t\}\\text\{ \(target heading\)\}\\\\ V\_\{\\text\{IAS,target,\}t\}\\text\{ \(target IAS\)\}\\\\ \\delta\_\{\\text\{turn,\}t\}\\text\{ \(turn rate\)\}\\end\{bmatrix\}\(2\)
When STPs are received by the SAM module, it computes a new aircraft stateSt\+ΔtS\_\{t\+\\Delta t\}using deterministic motion models\. This systematic update loop ensures that aircraft follow realistic trajectories while responding dynamically to control instructions or scenario\-driven goals\. The update process for heading and altitude is governed by the rules described in Eq\. \([3](https://arxiv.org/html/2606.18319#S3.E3)\) through \([6](https://arxiv.org/html/2606.18319#S3.E6)\)\.
#### Heading
ψ˙t=1091tan\(θt\)VIAS,t\(or constant ifθt=0\)\\dot\{\\psi\}\_\{t\}=\\frac\{1091\\tan\(\\theta\_\{t\}\)\}\{V\_\{\\text\{IAS,\}t\}\}\\quad\\text\{\(or constant if \}\\theta\_\{t\}=0\)\(3\)Δψt=sign\(δturn,t\)⋅min\(\|ψ˙tΔt\|,\|δturn,t\|\)\\Delta\\psi\_\{t\}=\\text\{sign\}\\left\(\\delta\_\{\\text\{turn,\}t\}\\right\)\\cdot\\min\\left\(\\left\|\\dot\{\\psi\}\_\{t\}\\Delta t\\right\|,\\left\|\\delta\_\{\\text\{turn,\}t\}\\right\|\\right\)\(4\)
#### Altitude
Vs,t=\{min\(h˙climb,htarget,t−htΔt\)htarget,t≥ht−min\(h˙descent,ht−htarget,tΔt\)htarget,t<htV\_\{s,t\}=\\begin\{cases\}\\min\\left\(\\dot\{h\}\_\{\\text\{climb\}\},\\frac\{h\_\{\\text\{target,\}t\}\-h\_\{t\}\}\{\\Delta t\}\\right\)&h\_\{\\text\{target,\}t\}\\geq h\_\{t\}\\\\ \-\\min\\left\(\\dot\{h\}\_\{\\text\{descent\}\},\\frac\{h\_\{t\}\-h\_\{\\text\{target,\}t\}\}\{\\Delta t\}\\right\)&h\_\{\\text\{target,\}t\}<h\_\{t\}\\end\{cases\}\(5\)Δht=Vs,tΔt\\Delta h\_\{t\}=V\_\{s,t\}\\Delta t\(6\)
wherehclimbh\_\{\\text\{climb\}\}andhdescenth\_\{\\text\{descent\}\}are predefined parameters\.
### 3\.5AI\-Assisted Radiotelephony Performance Evaluation
ASTRAincorporates an AI\-assisted performance evaluation framework designed to standardize and scale the assessment of trainee radiotelephony \(RT\) communications\. The framework analyzes commands at the utterance level to generate objective quantitative scores and actionable qualitative feedback for after\-action reviews\.
##### Evaluation Scope and Design Principles
The proposed evaluation framework focuses on aspects of ATCO performance that can be objectively inferred from radiotelephony communications and simulator state:
- •Communication Correctness: adherence to ICAO\-standard phraseology and structure
- •Operational Validity: alignment of instructions with safety constraints and flight trajectories
Higher\-order competencies \(e\.g\., planning ability, conflict resolution\) are outside the current scope, as they require temporal reasoning and direct observation of controller behavior beyond RT transcripts\.
##### Evaluation Workflow
Each trainee command is processed through a hybrid architecture combining deterministic rules and LLMs \([Figure˜4](https://arxiv.org/html/2606.18319#S3.F4)\):
Figure 4:Evaluation pipeline1. 1\.Speech Transcription: Trainee speech is transcribed into text using the ASR module
2. 2\.Structured Extraction: The CIU module extracts STPs, including command type, values, and contextual attributes
3. 3\.Hybrid Evaluation: The extracted command is evaluated using two complementary components: - •Rule\-Based Evaluation: Deterministic rules assess compliance with ICAO phraseology, structure, and numerical correctness - •LLM\-Based Analysis: A DSPy\-optimized LLM evaluates semantic correctness, contextual intent, and linguistic nuances not captured by rules
4. 4\.Score Aggregation and Feedback Generation: Outputs from both components are combined into metric\-specific scores and explanatory feedback for after\-action review \(AAR\)
To enhance robustness, BERT\-based contextual embeddings are used to measure semantic similarity between trainee and expected phraseology\. These representations capture meaning beyond exact keyword matching, enabling the system to recognize correct intent even when phrasing varies\.
This semantic similarity layer complements the DSPy\-optimized LLM analysis by providing a structured and consistent similarity signal, reducing over\-reliance on generative reasoning alone and improving evaluation stability\.
This hybrid approach ensures bothprecision\(through rule enforcement\) andflexibility\(through contextual reasoning\), across highly structured scenarios and semi\-open ATC scenarios where task objectives are fixed but variations in phraseology and response formulation are permitted\.
MetricDescriptionAccuracyCorrect use of standard phraseology and valuesBrevityConcise transmission, free of redundancyCompletenessIncludes all necessary elementsSafety AdherenceMaintain safe separation and avoid conflictsRoute ComplianceFollows assigned trajectoryTable 1:Core performance metrics used inASTRAevaluation framework
##### Performance Metrics
ASTRAevaluates trainee performance across five core metrics, each scored independently on a scale of 0\-100\.
##### Communication Metrics
The first three metrics evaluate the linguistic and structural quality of RT communications\.
- •Accuracy: Measures adherence to ICAO phraseology and correctness of operational values\. Rule\-based penalties are applied for incorrect terminology and critical safety\-impacting errors, while the LLM identifies subtle deviations in phrasing and intent\.
- •Brevity: Evaluates whether transmissions are concise and free of redundancy\. This includes analysis of word count deviation, speaking rate, filled pauses, and repeated information\. Semantic similarity checks ensure that concise commands preserve intended meaning\.
- •Completeness: Assesses whether all required parameters for a given command type are present and correctly structured\. A slot\-based validation mechanism verifies the presence and ordering of mandatory elements based on the intent, while LLMs identify context\-dependent omissions\.
##### Operational Metrics
In addition to communication quality,ASTRAevaluates whether the instructions issued are operationally valid within the simulation environment\.
- •Safety Adherence: Measures whether commands maintain required aircraft separation minima based on aviation safety standards\. Two primary criteria are considered: vertical separation \(≥\\geq500 ft\) and horizontal separation \(≥\\geq3 NM\)\. A separation violation is detected when either minimum is not satisfied\. Separation events are evaluated across aircraft pairs, and a configurabletime\-to\-correct \(TTC\)window allows trainees to resolve conflicts before penalties are applied\. Penalties are assigned based on severity \(near\-miss, major breach, critical breach\), with unresolved violations resulting in greater score reductions\.
- •Route Compliance: Evaluates whether aircraft trajectories remain consistent with assigned routes and scenario constraints\. Deviations such as missed waypoints and off\-track movement are detected using configurable tolerance thresholds\. Similarly, a TTC window will also be applied to allow corrective instructions before penalties are enforced\.
##### Hybrid Scoring Strategy
The evaluation framework employs a penalty\-based scoring system, where deviations from expected behavior reduce the metric score according to severity\. Rule\-based penalties ensure consistent handling of deterministic errors, while LLM\-generated feedback provides contextual explanations and identifies nuanced issues\.
LLM outputs are constrained using structured prompting via DSPy, enabling consistent output formats and reducing variability across evaluations\.
The final session score is computed in Eq\. \([7](https://arxiv.org/html/2606.18319#S3.E7)\) as a weighted aggregation of these metrics, where safety and correctness metrics are assigned higher importance to reflect operational priorities:
S=100−∑i=1nwiPiS=100\-\\sum\_\{i=1\}^\{n\}w\_\{i\}P\_\{i\}\(7\)
wherePiP\_\{i\}represents penalties incurred from individual deviations, andwiw\_\{i\}denotes the corresponding weight assigned to each metric\.
##### Session\-Level Feedback and Replay System
Beyond per\-command evaluation,ASTRAaggregates results across a training session to generate a comprehensive post\-session performance report\. This includes:
- •Timeline of all trainee commands
- •Individual metric score breakdown
- •Identified strengths and errors
- •Natural language feedback for improvement
To support detailed analysis, the system integrates synchronizedaudio playbackandvisual replayof aircraft movements\. This allows trainees and instructors to review decisions in context, bridging communication performance with operational outcomes\.
By combining structured rule\-based validation, BERT\-based semantic similarity, and LLM\-driven contextual analysis, the proposed framework bridges the gap betweenlinguistic correctnessandoperational intent\. This enables consistent, explainable, and scalable evaluation of ATCO radiotelephony performance, reducing reliance on subjective instructor judgment while enhancing training quality\.
ATCOSIMMNSC ASR Part 1SG\-AviationModelParamsWERCERWERCERWERCER*Whisper*\(Large\)1\.5B80\.4953\.8124\.447\.0778\.8135\.68*Whisper*\(Small\)0\.2B91\.0561\.5127\.058\.67107\.8080\.57*Parakeet*0\.6B53\.2223\.0721\.906\.0763\.1225\.83*WhisperATC*\(Large\)1\.5B22\.512\.1735\.6121\.2451\.1624\.33*WhisperATC*\(Small\)0\.2B54\.3319\.7918\.687\.3440\.9316\.52*MERaLiON 2*10B48\.6319\.5619\.4811\.4473\.9536\.11*Whisper*\(Fine\-tuned\)1\.5B21\.6713\.2521\.1717\.4319\.0815\.51ASTRAASR Pipeline \([Figure˜2](https://arxiv.org/html/2606.18319#S3.F2)\)1\.5B20\.7214\.1419\.6315\.1914\.4511\.62Table 2:ASR performance across datasets \(WER/CER\)
## 4Experimental Results
To assess the effectiveness of the proposed methods, we conducted a series of evaluations designed to measure their performance that leveraged task\-appropriate qualitative and quantitative indicators\.
### 4\.1Automatic Speech Recognition \(ASR\)
The performance of several ASR models was evaluated on the following public and internal datasets using word error rate \(WER\) and character error rate \(CER\):
- •ATCOSIM\[hofbauer2008atcosim\], which predominantly collected Western ATCO speech
- •Multi\-Task National Speech Corpus \(MNSC\)\[wang2025advancing\]ASR Part 1, which collected Singaporean\-accented speech
- •SG\-Aviation \(internal dataset\), which collected Singaporean\-accented ATCO speech
As seen in[Table˜2](https://arxiv.org/html/2606.18319#S3.T2), base*Whisper*models perform poorly across all datasets, indicating the need for accent\- and domain\-specific adaptation\.*WhisperATC*Small remains the strongest evaluated model for Singaporean\-accented speech, while*MERaLiON*achieves competitive results on ATCOSIM and MNSC but degrades on SG\-Aviation\.
Based on these evaluations,*Whisper*Large v3 was selected as the base model and fine\-tuned on a combined corpus of ATCOSIM, Singaporean\-accented ATC recordings, and augmented synthetic data\. The fine\-tuned model and full ASR pipeline \([Figure˜2](https://arxiv.org/html/2606.18319#S3.F2)\) achieve the lowest SG\-Aviation WER of 26\.08% and 23\.45% respectively\. Error rates on the out\-of\-domain ATCOSIM and MNSC datasets are higher, showing the trade\-off of domain\-specific adaptation\. While specialized models exhibit high variance across datasets, theASTRApipeline maintains a consistent error profile within the 20–35% range, showing its stability across both Western and local aviation contexts\.
### 4\.2Text\-to\-Speech \(TTS\)
TTS models were evaluated for real\-time ATC simulation using perceptual quality via Mean Opinion Scores \(MOS\) and A/B comparisons\. Audio clips were anonymized as*Model A*,*B*, and*C*to avoid recognition bias\.
ModelSourceClarityPron\.ProsodyNaturalnessOverallGender Acc\.MaleFemale\# RatingsXTTSHuman4\.063\.813\.743\.793\.7198\.41313263CSMHuman3\.523\.503\.273\.443\.21100\.00352762OrpheusHuman3\.953\.873\.773\.543\.7096\.83342963XTTSLLM4\.134\.573\.363\.073\.3792\.68205205410CSMLLM4\.104\.573\.413\.103\.4495\.15205207412OrpheusLLM4\.224\.673\.513\.193\.5995\.60205204409Table 3:Comparison of human and LLM\-based MOS scores for TTS models\.ComparisonOption 1 \(%\)Option 2 \(%\)No Pref\. \(%\)\# RatingsXTTS vs CSM43\.330\.026\.730XTTS vs Orpheus44\.934\.720\.449CSM vs Orpheus34\.053\.212\.847Male vs Female \(XTTS\)35\.740\.523\.842Male vs Female \(CSM\)19\.463\.916\.736Male vs Female \(Orpheus\)22\.244\.433\.336Table 4:A/B preference test results for model\-level and gender\-level comparisons\.##### Mean Opinion Scores \(MOS\)
MOS was rated on five dimensions using male and female voices on a diverse set of ATC utterances\. Ratings were provided by \(i\) 21 aviation personnel familiar with ICAO radiotelephony, and \(ii\) an automated LLM evaluator \(*Gemini 2\.5 Flash*\) to provide robustness and highlight divergences between human and model perception\. Both followed brief guidelines on a \(1–5 scale, 5 being optimal\):
1. 1\.Clarity: intelligibility of all spoken content
2. 2\.Pronunciation: correctness of terminologies
3. 3\.Prosody: naturalness of rhythm and pacing
4. 4\.Naturalness: closeness to human speech
5. 5\.Overall Quality: overall quality, reflecting other dimensions and any artifacts
Each human rater evaluated 9 audio clips covering all models and both genders, with optional annotations for gender mismatch or comments\.
As shown in[Table˜3](https://arxiv.org/html/2606.18319#S4.T3),*XTTS*and*Orpheus*achieve the highest overall human MOS scores \(3\.71 and 3\.70 respectively\)\.*XTTS*leads on clarity, making it well\-suited for intelligibility\-critical ATC communications, while*Orpheus*scores higher on prosody and pronunciation, producing more natural\-sounding delivery\.*CSM*trails both models across all dimensions, largely due to truncations and incomplete utterance endings that disrupted perceived quality\.
LLM\-based scores follow the same ranking, with*Orpheus*scoring highest at 3\.59\. Notably,*XTTS*exhibits the largest human–LLM divergence \(3\.71 vs\. 3\.37\), suggesting that the automated evaluator underweights the prosodic and acoustic qualities that human raters — particularly those accustomed to ATC speech — find most salient\. Gender accuracy is high across all three models, exceeding 96% in every case\.
##### A/B Comparison Tests
Pairwise A/B tests were conducted with human evaluators to capture relative preferences\. Two comparison types were used: \(i\)Model A/B Comparisons: pairs of clips generated from the same script and gender, enabling direct comparison of model behavior\. \(ii\)Gender Comparisons: pairs comparing male and female voices produced by the same model\.
Raters selected their preferred clip for each pair and could optionally provide a brief justification\.
[Table˜4](https://arxiv.org/html/2606.18319#S4.T4)shows a clear preference for both*XTTS*and*Orpheus*over*CSM*\.*XTTS*is favored over*CSM*43\.3% vs 30\.0%, and*Orpheus*shows an even stronger advantage at 53\.2% vs 34\.0%, aligning with*CSM*’s lower MOS scores and frequent truncations\. Between*XTTS*and*Orpheus*,*XTTS*is preferred \(44\.9% vs 34\.7%\) for clarity, while*Orpheus*is noted for smoother pacing\. In gender comparisons, female voices are preferred across all models, with comments noting smoother and more natural delivery for the female voices of*XTTS*and*Orpheus*\.
### 4\.3AI\-Assisted Radiotelephony Performance Evaluation
To evaluate the effectiveness of the proposed communication assessment module, we conducted experiments on three key communication factors:accuracy,brevity, andcompleteness\. Each evaluator was optimized using the DSPy framework to compare various strategies, specificallyBootstrapFewShotWithRandomSearch\(hereafter referred to as RS\) andMIPROv2\. These methods leverage a teacher model to bootstrap high\-quality demonstrations; however, MIPROv2 distinguishes itself by jointly optimizing the underlying task instructions along these demonstrations\.
All evaluators were trained on labeled examples and evaluated on held\-out development sets\. Performance was measured using inverted Mean Absolute Error \(MAE\), where higher scores indicate better alignment with ground truth grading\. For brevity, BERTScore \(F1\) was additionally used to capture semantic similarity between generated and expected radiotelephony, ensuring both linguistic correctness and scoring consistency\.
EvaluatorBaseline \(%\)Compiled \(%\)ImprovementAccuracy83\.891\.7\+7\.9Brevity85\.589\.7\+4\.2Completeness81\.588\.1\+6\.6Table 5:Performance of Communication Evaluators Before and After DSPy Optimization[Table˜5](https://arxiv.org/html/2606.18319#S4.T5)shows prompt optimization improved performance across all evaluators, with the largest gain observed inaccuracy\(\+7\.9%\)\.
EvaluatorFewShot \(%\)FewShot\+RS \(%\)MIPROv2 \(%\)Accuracy83\.891\.788\.2Brevity85\.588\.289\.7Completeness81\.586\.988\.1Table 6:Comparison of DSPy Optimization Strategies across Communication FactorsTo further evaluate the impact of different optimization strategies,[Table˜6](https://arxiv.org/html/2606.18319#S4.T6)compares standardBootstrapFewShotwithRandomSearch\(RS\) andMIPROv2\. While RS achieved the peak score foraccuracy,MIPROv2yielded the best results forbrevityandcompleteness\. This suggests that instruction\-aware optimization is particularly beneficial for stylistic and multi\-parameter assessments, whereas random search excels in calibrating purely phraseology\-based scoring\.
## 5Limitations & Future Work
WhileASTRAaims to provide a realistic ATCO training simulator with an in\-built end\-to\-end communications pipeline, we acknowledge some limitations that will be addressed as feature enhancements in future work\.
##### ASR Performance
The fine\-tuned model is trained exclusively on Singapore ATC data, limiting generalization to other airspaces, accents, and languages\. Future work will expand the training corpus with international ATC recordings, radio static profiles, and filler words \("uh", "um"\) to improve noise robustness and filler detection accuracy for performance evaluation\. Techniques such ascontextual biasingvia*TCPGen*\[sun2023tcpgen\]and Keyword Spotting will also be explored to further improve recognition of domain\-specific terminology\.
##### Live Scenario Injects
ASTRAsupports a set of predefined scenarios\. Future work will implement the functionality for instructors to inject live edits to the scenarios, allowing it to be more customized and challenging\.
##### TTS Hallucinations
Long text inputs caused our TTS models to produce errors like jumbled phrasing and cut\-offs\. Noting that autoregressive TTS models are unstable on long sequences and that chunk\-wise decoding reduces alignment drift\[li2025robustefficientautoregressivespeech\], we plan to apply inference\-time text chunking as a simple, practical first step, with more complex methods considered only if needed\.
##### TTS Evaluation Constraints
The human MOS evaluation was conducted with 21 raters, each evaluating 9 clips — a sample size that limits the statistical robustness of the results\. Future evaluations should recruit a larger and more diverse pool of aviation personnel to improve confidence in perceptual quality rankings\. The automated LLM evaluator provides useful signal at scale but diverges from human judgement on prosodic and acoustic dimensions, as seen in the XTTS human–LLM gap in[Table˜3](https://arxiv.org/html/2606.18319#S4.T3)\. Calibrating such evaluators against a larger human baseline remains an important direction for future work\.
##### Adaptive Scenario Generation
Rather than relying solely on fixed scenarios,ASTRAsupports the automatic generation of training situations based on trainee performance\. By analyzing real\-time interactions and RT performance metrics, the system identifies performance gaps and dynamically introduces scenarios that target specific weaknesses\. For example, a trainee demonstrating difficulty with altitude management may be presented with scenarios of increased altitude\-control complexity\. This creates a closed\-loop training environment that continuously adapts to the trainee’s competency, reinforcing learning without requiring direct instructor intervention\.
## 6Conclusion
In this paper, we proposeASTRA, a training simulator enabling autonomous ATCO training withoutsimpilotsby combining an end\-to\-end speech pipeline with advanced speech recognition, custom text\-to\-speech models, and systems for parsing and responding to ATC utterances\. Designed for the Singaporean context,ASTRAaddresses reliance on simpilots and localization limitations, enabling scalable, high\-fidelity, instructor\-independent training\.
## Acknowledgment
This paper is made possible with the support of RSAF Agile Innovation Digital \(RAiD\), Republic of Singapore Air Force, Singapore\.
## ReferencesSimilar Articles
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
Mega-ASR proposes scaling up real-world acoustic simulation to improve automatic speech recognition in challenging, wild conditions, aiming to narrow the performance gap between lab and real-world settings.
Astra Autonomous Pentest
Astra Security launches an autonomous pentest product that uses AI agents to find, validate, and fix vulnerabilities automatically.
Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
This paper proposes a safety-oriented, consequence-aware evaluation framework for large language models in Air Traffic Control, revealing that high aggregate accuracy masks significant reliability issues in handling high-risk semantic errors.
AI is being used to resurrect the voices of dead pilots
The NTSB temporarily removed public access to its docket system after AI tools were used to reconstruct cockpit audio from a UPS plane crash, recreating the voices of deceased pilots from a spectrogram and transcript.
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
This paper introduces Agentic ASR, an interactive speech recognition framework that uses semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement. It also proposes a new sentence-level semantic error rate metric and an interactive simulation system for benchmarking.