Tag
This paper proposes a safety-oriented, consequence-aware evaluation framework for large language models in Air Traffic Control, revealing that high aggregate accuracy masks significant reliability issues in handling high-risk semantic errors.