Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI

arXiv cs.AI Papers

Summary

This paper evaluates explainability methods in safety-critical Automatic Target Recognition (ATR) systems, highlighting the limitations of post-hoc techniques like saliency and attention maps. It proposes a taxonomy and assessment framework to address issues such as spurious explanations and instability, advocating for more robust, causally grounded XAI approaches.

arXiv:2605.05748v1 Announce Type: new Abstract: Explainable Artificial Intelligence (XAI) is increasingly rec ognized as essential for deploying machine learning systems in safety critical environments. In Automatic Target Recognition (ATR), where models operate on image, video, radar, and multisensor data, high pre dictive performance alone is insufficient. Model decisions must also be interpretable, reliable, and suitable for validation. This paper presents a structured evaluation of explainability methods in the context of safety-critical ATR systems: We identify major XAI paradigms, including saliency-based, attention-based, and surrogate ap proaches, as well as recent detection-aware extensions. Based on this, we formalize explainability as an assurance-oriented assessment problem, introduce a taxonomy, and assess these methods with respect to four key dimensions: interpretability, robustness, vulnerability to manipula tion, and suitability for validation and verification. The analysis identifies systematic limitations of current post-hoc explanation methods. In par ticular, we derive critical failure modes such as spurious explanations, instability under perturbations, and overtrust induced by visually con vincing outputs. These findings indicate that widely used XAI techniques may be insufficient for safety-critical deployment. Finally, we discuss implications for ATR systems and outline directions toward more robust, causally grounded, and physically informed explain ability methods. Our results emphasize the need to move beyond visually plausible explanations toward approaches that support reliable decision making and system-level assurance.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:47 AM

# Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI
Source: [https://arxiv.org/html/2605.05748](https://arxiv.org/html/2605.05748)
11institutetext:Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB, Ettlingen, Germany
11email:vanessa\.buhrmester@iosb\.fraunhofer\.de, david\.muench@iosb\.fraunhofer\.de###### Abstract

Explainable Artificial Intelligence \(XAI\) is increasingly recognized as essential for deploying machine learning systems in safety\-critical environments\. In Automatic Target Recognition \(ATR\), where models operate on image, video, radar, and multisensor data, high predictive performance alone is insufficient\. Model decisions must also be interpretable, reliable, and suitable for validation\.

This paper presents a structured evaluation of explainability methods in the context of safety\-critical ATR systems: We identify major XAI paradigms, including saliency\-based, attention\-based, and surrogate approaches, as well as recent detection\-aware extensions\. Based on this, we formalize explainability as an assurance\-oriented assessment problem, introduce a taxonomy, and assess these methods with respect to four key dimensions: interpretability, robustness, vulnerability to manipulation, and suitability for validation and verification\. The analysis identifies systematic limitations of current post\-hoc explanation methods\. In particular, we derive critical failure modes such as spurious explanations, instability under perturbations, and overtrust induced by visually convincing outputs\. These findings indicate that widely used XAI techniques may be insufficient for safety\-critical deployment\.

Finally, we discuss implications for ATR systems and outline directions toward more robust, causally grounded, and physically informed explainability methods\. Our results emphasize the need to move beyond visually plausible explanations toward approaches that support reliable decision\-making and system\-level assurance\.

## 1Introduction

Explainable Artificial Intelligence \(XAI\) is increasingly regarded as a prerequisite for deploying machine learning systems in safety\-critical environments\. This is particularly evident in Automatic Target Recognition \(ATR\), where models operate on heterogeneous data sources such as imagery, video streams, radar, and multisensor data\. In these settings, predictive accuracy alone is insufficient\. Model decisions must also be transparent, reliable, and suitable for technical assessment, since misinterpretations may have significant operational consequences\.

Despite their strong performance, modern deep learning models, especially convolutional and transformer\-based architectures, remain inherently opaque\. A broad range of post\-hoc explanation techniques has been proposed to address this issue, including saliency\-based, attention\-based, and surrogate approaches\[[3](https://arxiv.org/html/2605.05748#bib.bib21),[13](https://arxiv.org/html/2605.05748#bib.bib2),[19](https://arxiv.org/html/2605.05748#bib.bib1),[22](https://arxiv.org/html/2605.05748#bib.bib6)\]\. While these methods often produce visually convincing explanations, their reliability is increasingly questioned\. Empirical studies indicate that explanations can remain stable under model randomization, react strongly to minor perturbations, or fail to reflect the underlying decision process altogether\[[1](https://arxiv.org/html/2605.05748#bib.bib7)\]\.

These limitations are particularly critical in the ATR context\. Unlike standard image classification tasks, ATR systems must operate under varying environmental conditions, sensor noise, incomplete observations, and real\-time constraints\. Furthermore, explanations are not only used for human interpretation, but also for system validation, robustness analysis, and decision support\. Consequently, explanation methods must satisfy stronger requirements, including stability, resistance to manipulation, and compatibility with validation and verification processes\.

Recent research has emphasized the need for structured evaluation of XAI methods, especially in object detection scenarios where explanations must capture both spatial localization and semantic relevance\. At the same time, application\-driven studies in domains such as remote sensing and UAV\-based perception highlight the growing importance of explainability in real\-world and safety\-critical settings\.

This work provides a structured analysis of XAI methods in the context of modern ATR systems, focusing on saliency\-based, attention\-based, and surrogate\-based approaches, including recent detection\-aware extensions, and analyzing their strengths, limitations, and common failure modes in safety\-critical settings\. This paper does not propose a new explanation algorithm\. Instead, it provides an ATR\-specific assessment framework for evaluating whether existing XAI paradigms are suitable for safety\-critical use, with particular emphasis on robustness, manipulation resistance, and validation and verification\.

The main contributions of this work are as follows:

- •We formalize explainability in safety\-critical ATR as an assurance\-oriented assessment problem and introduce four evaluation dimensions: interpretability, robustness, vulnerability to manipulation, and suitability for validation and verification\.
- •We provide an ATR\-specific, unified taxonomy of XAI methods, covering saliency\-based, attention\-based, surrogate, detection\-aware, concept\-based, and intrinsic or physics\-informed approaches\.
- •We present a systematic, cross\-paradigm evaluation of major XAI methods with respect to these dimensions, identifying their suitability for exploratory analysis, structured validation, and high\-assurance deployment\.
- •We identify systematic gaps between current post\-hoc XAI practice and safety\-critical requirements, including instability, spurious explanations, overtrust, and limited integration into V&V workflows\.
- •We derive concrete research directions toward robust, causally grounded, and physically informed XAI methods, highlighting the need for explanations that support verification, resist manipulation, and align with domain constraints\.

## 2Related Work

### 2\.1XAI in Computer Vision

Explainable Artificial Intelligence has been widely studied in computer vision, where most approaches can be grouped into attribution\-based, activation\-based, perturbation\-based, concept\-based, and transformer\-oriented methods\. Recent survey papers provide increasingly structured overviews of this landscape and highlight the growing shift from image classification toward more complex tasks such as object detection and tracking\[[5](https://arxiv.org/html/2605.05748#bib.bib14),[14](https://arxiv.org/html/2605.05748#bib.bib13)\]\. Attribution\-based methods such as gradient saliency and Integrated Gradients\[[24](https://arxiv.org/html/2605.05748#bib.bib3)\]focus on local feature relevance, while activation\-based approaches such as CAM and Grad\-CAM exploit intermediate feature maps for spatial localization\. More recent reviews also emphasize transformer\-based explainability methods and discuss their relevance for modern vision architectures\[[5](https://arxiv.org/html/2605.05748#bib.bib14)\]\.

At the same time, a growing body of work points to fundamental shortcomings of existing XAI approaches\. Visually coherent explanations do not necessarily correspond to model\-relevant features, and many methods exhibit pronounced sensitivity to noise and perturbations\. These observations have motivated a shift toward more critical evaluation protocols and application\-specific analysis frameworks, especially in high\-stakes domains\.

### 2\.2Explainability for Object Detection

Compared with image classification, explainability in object detection poses additional challenges because explanations must account not only for class predictions but also for localization decisions and multi\-object scenarios\. Classical saliency\-based approaches such as Grad\-CAM\[[22](https://arxiv.org/html/2605.05748#bib.bib6)\]have therefore been extended to object detection pipelines such as Faster R\-CNN\[[18](https://arxiv.org/html/2605.05748#bib.bib11)\]and YOLO\[[17](https://arxiv.org/html/2605.05748#bib.bib12)\], where explanations are typically conditioned on specific detection heads or bounding boxes\.

Recent work has proposed detection\-aware extensions of CAM\-based methods, including the Gaussian\-Class Activation Mapping Explainer \(G\-CAME\)\[[2](https://arxiv.org/html/2605.05748#bib.bib20)\], which improves instance\-level localization and computational efficiency in object detection settings\. In parallel, there is growing interest in benchmarking explainability methods specifically for detection tasks\. ODExAI, for example, introduces a dedicated evaluation framework for object detection explainability based on localization accuracy, faithfulness, and computational complexity\[[15](https://arxiv.org/html/2605.05748#bib.bib15)\]\. Such work is particularly relevant for ATR scenarios, where explanations must capture both semantic target evidence and spatial precision\.

Transformer\-based object detectors such as DETR\[[4](https://arxiv.org/html/2605.05748#bib.bib5)\]have further expanded the methodological landscape\. Their attention mechanisms are often interpreted as explanatory signals, but prior work has shown that attention weights do not necessarily correspond to causal relevance\[[7](https://arxiv.org/html/2605.05748#bib.bib8)\]\. This limits their direct interpretability and calls for critical assessment in practical use cases\.

### 2\.3XAI in Safety\-Critical and ATR Applications

![Refer to caption](https://arxiv.org/html/2605.05748v1/image_OBJ.png)Figure 1:Illustrative example of safety\-critical applications: Real\-time automatic target recognition from a UAV\-based thermal infrared perspectiveThe use of XAI in safety\-critical applications has received increasing attention, particularly in domains where incorrect or poorly understood model decisions may lead to severe consequences\. In such contexts, explainability is not only a diagnostic tool for model developers, but also a prerequisite for trust, verification, and operational acceptance\.

In the ATR domain, explainability has been studied across multiple sensing modalities, including image, video, radar, and SAR data\. Image\-based ATR and object detection systems predominantly rely on post\-hoc saliency or attention\-based approaches, while radar\- and SAR\-based ATR increasingly explore more structured or physically grounded forms of interpretability\. Application\-driven research in satellite imagery demonstrates that explainable object detection can improve robustness and support structured reasoning in remote sensing scenarios\[[20](https://arxiv.org/html/2605.05748#bib.bib18)\]\. Similarly, explainability is becoming increasingly relevant in monocular vision\-based UAV systems, where transparent obstacle detection and navigation decisions are critical for safe and trustworthy autonomous operation\[[8](https://arxiv.org/html/2605.05748#bib.bib19)\]\.

Overall, prior work shows that explainability in ATR is moving beyond generic visualization methods toward task\-specific, detection\-aware, and application\-driven approaches\. However, the literature still lacks a unified evaluation perspective tailored to safety\-critical ATR systems\. In particular, there remains a need for systematic comparison of XAI methods with respect to robustness, manipulability, and suitability for validation and verification\.

### 2\.4Gap Analysis and Positioning

Although recent surveys provide comprehensive overviews of XAI methods in computer vision, they remain largely application\-agnostic and do not fully address the requirements of safety\-critical ATR systems\. Similarly, detection\-aware XAI frameworks improve the evaluation of explanations for object detection, but they primarily focus on localization, faithfulness, and computational aspects\. These criteria are necessary but insufficient for high\-assurance ATR applications, where explanations must also be robust under sensor and environmental perturbations, resistant to manipulation, and usable within validation and verification processes\. This paper therefore positions itself between general XAI surveys, object\-detection\-specific explanation benchmarks, and safety\-critical AI research\. Its contribution is an ATR\-specific assessment framework that links major XAI paradigms to operational assurance requirements\. The goal is not to introduce a new explanation method, but to identify which classes of explanations are suitable, insufficient, or promising for safety\-critical ATR\.

The discussion above shows that existing XAI literature provides either broad methodological taxonomies, object\-detection\-specific explanation techniques, or general critiques of post\-hoc explanations in high\-stakes domains\. What is still missing is an ATR\-specific perspective that connects explanation methods to operational assurance requirements\. The following section therefore formulates XAI evaluation in ATR as a multi\-criteria assessment problem\.

## 3Evaluation Framework

### 3\.1Problem Formulation

Automatic Target Recognition systems can be formalized as predictive models that map sensor data to target labels\. Given an input space𝒳⊆ℝn\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{n\}and output space𝒴\\mathcal\{Y\}, e\.g\., continuous predictions\. An inputx∈𝒳x\\in\\mathcal\{X\}, representing data from one or multiple sensors, a modelffproduces a predictiony∈𝒴y\\in\\mathcal\{Y\}, typically by maximizing the posterior probability:

f​\(x\)=arg⁡maxy∈𝒴⁡p​\(y∣x\)\.f\(x\)=\\arg\\max\_\{y\\in\\mathcal\{Y\}\}p\(y\\mid x\)\.\(1\)
In modern ATR systems,ffis commonly implemented as a deep neural network, such as a convolutional or transformer\-based architecture\. In object detection settings, the model output may additionally include spatial information, such as bounding boxes or segmentation masks, resulting in structured predictions of the form:

f​\(x\)=\{\(yi,bi\)\}i=1N,f\(x\)=\\\{\(y\_\{i\},b\_\{i\}\)\\\}\_\{i=1\}^\{N\},\(2\)whereyiy\_\{i\}denotes the predicted class,bib\_\{i\}the corresponding spatial localization as bounding box andNNthe number of detections\.

To improve transparency, an explanation functionEEis introduced, which maps a modelffand an inputxxto an interpretable representation:

E:\(f,x\)↦e,E:\(f,x\)\\mapsto e,\(3\)where the explanationeemay take different forms, such as saliency maps, attention distributions, or surrogate model approximations\.

The central objective of explainability is to ensure thateeprovides meaningful insight into the decision\-making process offf\. In safety\-critical ATR systems, however, this objective is subject to additional requirements\. Explanations must satisfy properties that go beyond visual interpretability, including stability under input perturbations, resistance to manipulation, and consistency with the underlying model behavior\.

This leads to a fundamental challenge: given a modelffand an explanation methodEE, how can the quality of the resulting explanationeebe assessed in a principled and application\-relevant manner? In contrast to standard machine learning evaluation, there is typically no ground truth for explanations\[[9](https://arxiv.org/html/2605.05748#bib.bib17),[11](https://arxiv.org/html/2605.05748#bib.bib16)\], which makes it difficult to directly measure correctness\.

ATR systems introduce additional complexity due to multisensor inputs, varying environmental conditions, and real\-time constraints\. Explanations must therefore capture both semantic relevance, i\.e\., which features influence the predicted class, and spatial localization, i\.e\., where relevant information is located, while remaining robust across different sensing modalities\.

In this work, we address this challenge by formulating the evaluation of explainability methods as a multi\-criteria problem\. Specifically, we assess explanation methods along four key dimensions: interpretability, robustness, vulnerability to manipulation, and suitability for validation and verification\. This formulation provides the basis for the comparative analysis presented in the following sections\.

### 3\.2Evaluation Criteria

Based on the problem formulation introduced above, the evaluation of explainability methods in ATR systems can be formulated as a multi\-dimensional assessment problem\. Since no ground truth explanations are available in most practical scenarios, explanation quality must be evaluated indirectly through a set of well\-defined criteria\.

We consider four dimensions that are particularly relevant for safety\-critical ATR applications: interpretability, robustness, vulnerability to manipulation, and suitability for validation and verification\.

Interpretabilityrefers to the extent to which an explanation can be understood by a human user and provides meaningful insight into the model decision\. It is important to distinguish between visual plausibility and semantic correctness\. An explanation may appear intuitive while not accurately reflecting the underlying decision process of the model\. Therefore, interpretability must be considered together with other criteria\.

Robustnessdescribes the stability of an explanation under variations in the input or the model\. In practical ATR scenarios, sensor data is often affected by noise, changing viewpoints, or environmental conditions\. A robust explanation method should produce consistent outputs under such perturbations\. High sensitivity to small input changes reduces explanation reliability and limits practical applicability\.

Vulnerability to manipulationcaptures the susceptibility of explanation methods to adversarial or intentional modifications\. Explanations can be altered without significantly affecting the model prediction\[[12](https://arxiv.org/html/2605.05748#bib.bib9)\], leading to misleading or deceptive interpretations\. In safety\-critical systems, such behavior poses a significant risk, as it undermines trust in the explanation process\.

Suitability for validation and verification \(V&V\)refers to the extent to which an explanation method supports systematic analysis of model behavior\. This includes the ability to test, validate, and potentially certify model decisions\. Explanations that lack structure or consistency are difficult to integrate into formal verification workflows, limiting their usefulness in high\-assurance applications\.

These criteria are not independent but exhibit inherent trade\-offs\. For example, highly interpretable explanations may lack robustness, while methods optimized for stability may provide less intuitive representations\. In addition, different ATR modalities, such as image\-based or radar\-based systems, impose different requirements on explanation methods\.

The proposed criteria provide a structured basis for comparing XAI methods in the ATR context\. In the following section, we apply this framework to analyze the strengths and limitations of different explainability paradigms\.

### 3\.3Method Categorization

Based on the evaluation criteria defined above, we categorize explainability methods in the ATR context into five main paradigms: saliency\-based methods, attention\-based approaches, surrogate models, detection\-aware XAI methods, and intrinsic or physics\-informed models\. This categorization reflects both methodological differences and practical applicability in safety\-critical systems\.

Saliency\-based methodsrepresent the most widely used class of post\-hoc explainability techniques\. These approaches attribute model predictions to input features by estimating relevance scores, typically derived from gradients or feature activations\[[23](https://arxiv.org/html/2605.05748#bib.bib22),[24](https://arxiv.org/html/2605.05748#bib.bib3)\]\. Methods such as Vanilla Gradients, Integrated Gradients, and Grad\-CAM\[[22](https://arxiv.org/html/2605.05748#bib.bib6)\]fall into this category\. More recent extensions, including detection\-aware variants such as G\-CAME\[[2](https://arxiv.org/html/2605.05748#bib.bib20)\], improve spatial localization in object detection scenarios\. While saliency methods are easy to apply and broadly compatible with different architectures, they are often limited in terms of robustness and causal interpretability\[[1](https://arxiv.org/html/2605.05748#bib.bib7),[6](https://arxiv.org/html/2605.05748#bib.bib23)\]\.

Attention\-based approachesleverage internal attention mechanisms, particularly in transformer\-based architectures, to provide insight into model behavior\[[7](https://arxiv.org/html/2605.05748#bib.bib8),[25](https://arxiv.org/html/2605.05748#bib.bib4)\]\. In object detection models such as DETR\[[4](https://arxiv.org/html/2605.05748#bib.bib5)\], attention weights are often visualized as explanatory signals\. These methods offer a global perspective on feature relationships but do not necessarily reflect causal importance\. As a result, their interpretability is limited and they should be used with caution in safety\-critical applications\.

Surrogate models and model simplification techniquesaim to approximate complex models with simpler, interpretable representations\. Methods such as LIME\[[19](https://arxiv.org/html/2605.05748#bib.bib1)\]and SHAP\[[13](https://arxiv.org/html/2605.05748#bib.bib2)\]generate local explanations by fitting interpretable models to the behavior of the original model in the vicinity of a given input\. While these approaches improve transparency, their reliability depends on approximation fidelity\. In high\-dimensional ATR scenarios, surrogate models may fail to capture the true decision boundary, leading to misleading explanations\.

Detection\-aware XAI methodsform a distinct paradigm in ATR, as they explicitly incorporate the structure of detection models\. These approaches, including detector\-specific CAM variants, G\-CAME\[[2](https://arxiv.org/html/2605.05748#bib.bib20)\], and object detection \(OD\)\-specific evaluation methods, align explanations with detection outputs such as bounding boxes or object instances\.

In contrast to generic saliency methods, they explicitly couple explanations to object\-level predictions, improving spatial localization and interpretability in object detection tasks, which are central to ATR\. However, these methods can still inherit limitations from underlying gradient\-based techniques, including sensitivity to perturbations and limited causal interpretability\.

Intrinsic and physics\-informed modelsintegrate interpretability directly into the model architecture\. Instead of explaining decisions post hoc, these approaches aim to make the decision process itself transparent\. In radar\- and SAR\-based ATR systems, this often involves incorporating physical models or constraints into the learning process\[[16](https://arxiv.org/html/2605.05748#bib.bib24),[10](https://arxiv.org/html/2605.05748#bib.bib25)\]\. Such methods generally provide higher robustness and stronger alignment with domain knowledge, making them particularly suitable for safety\-critical applications\.

This categorization highlights fundamental differences between post\-hoc and model\-inherent explainability approaches\. While post\-hoc methods offer flexibility and ease of use, intrinsic approaches provide more reliable and structured explanations\. The trade\-offs between these paradigms are analyzed in the following section using the evaluation criteria defined above\.

## 4Comparative Analysis

### 4\.1Comparison of XAI Paradigms

Using the evaluation criteria introduced in Section 3, we compare the main XAI paradigms with respect to their behavior in ATR scenarios\. Rather than providing a purely descriptive overview, the focus lies on identifying systematic strengths and limitations across method classes\.

Saliency\-based methods provide intuitive and visually accessible explanations by directly mapping relevance to the input space\. This makes them particularly attractive for image\-based ATR systems, where spatial interpretability is essential\. However, their interpretability is often limited to local sensitivity analysis, and their outputs can be highly unstable under small input perturbations\. Moreover, these methods lack causal grounding, as highlighted by prior work showing that saliency maps may remain visually similar even when model parameters are randomized\[[1](https://arxiv.org/html/2605.05748#bib.bib7)\]\. This raises concerns about their reliability in safety\-critical settings\.

Attention\-based approaches offer a complementary perspective by capturing global dependencies within the data\. In transformer\-based architectures, attention weights can provide insight into relationships between different input regions\. While this enables a more holistic view of model behavior, attention mechanisms do not necessarily reflect causal importance\. Empirical studies have demonstrated that significantly different attention distributions can lead to similar model outputs, limiting their interpretability and robustness\.

Surrogate models aim to improve interpretability by approximating complex models with simpler, more transparent representations\. This enables structured explanations and facilitates human understanding of decision boundaries\. However, approximation fidelity is critical\. In high\-dimensional ATR scenarios, surrogate models may fail to accurately capture the behavior of the original model, resulting in misleading explanations\. Their robustness is therefore dependent on the stability of the approximation process\.

In addition to these general post\-hoc approaches, detection\-aware XAI methods bridge the gap between generic saliency approaches and task\-specific requirements of object detection\. By aligning explanations with detection outputs such as bounding boxes or object instances, they improve spatial interpretability in ATR scenarios\. However, they still rely on underlying gradient\-based mechanisms and therefore inherit key limitations in robustness and causal validity\.

In contrast, intrinsic and physics\-informed models integrate interpretability directly into the model architecture\. These approaches often leverage domain knowledge, such as physical signal properties in radar or SAR systems, to produce explanations that are both meaningful and consistent with underlying processes\. As a result, they tend to exhibit higher robustness and lower susceptibility to manipulation\. However, this comes at the cost of reduced flexibility and increased modeling complexity\. Overall, the comparison reveals fundamental trade\-offs between interpretability, robustness, and methodological complexity\. Post\-hoc methods such as saliency and attention provide flexible and easily deployable explanations but suffer from limitations in stability and causal validity\. Intrinsic approaches, while more reliable, require stronger assumptions and domain\-specific modeling\.

Table 1:Assessment of XAI paradigms according to safety\-critical ATR requirements\.
### 4\.2Summary Table

To provide a structured overview, Table[1](https://arxiv.org/html/2605.05748#S4.T1)summarizes the considered XAI paradigms with respect to the evaluation criteria introduced in Section[3\.2](https://arxiv.org/html/2605.05748#S3.SS2)\.

Table[1](https://arxiv.org/html/2605.05748#S4.T1)reveals a clear trade\-off between accessibility and reliability\. Methods that are easy to apply and visually intuitive, such as saliency\- and attention\-based approaches, tend to fall short in terms of robustness, causal validity, and verifiability\. Detection\-aware XAI methods, including detector\-specific CAM variants and approaches such as G\-CAME, improve spatial alignment with object detections and enhance interpretability in detection tasks, but do not fully resolve these fundamental limitations\. Surrogate models increase transparency through simplified approximations, yet remain dependent on approximation fidelity and local sampling assumptions\. In contrast, intrinsic and physics\-informed approaches are better aligned with the requirements of safety\-critical ATR, as they ground explanations in domain knowledge and signal formation processes\.

Overall, this comparison demonstrates the need for a shift from visually convincing explanations toward methods that provide robust, causally meaningful, and operationally relevant insight\.

## 5Discussion and Failure Modes

The analysis above indicates that current XAI approaches exhibit systematic weaknesses when applied to safety\-critical ATR systems\. Beyond quantitative evaluation criteria, it is essential to consider typical failure modes that arise in practice and may compromise the reliability and interpretability of explanations\.

### 5\.1Spurious Explanations

A central issue of many post\-hoc explanation methods is the occurrence of spurious explanations\. These are explanations that appear visually plausible but do not accurately reflect the underlying decision process of the model\. In particular, gradient\-based saliency methods may produce structured patterns that are largely influenced by the input distribution rather than the learned model parameters\.

Previous work has demonstrated that some saliency methods can generate similar explanations even when model weights are randomized\[[1](https://arxiv.org/html/2605.05748#bib.bib7)\], indicating a weak dependence on the actual model behavior\. In the ATR context, this can lead to misleading interpretations, where explanations highlight features that are not causally relevant for target recognition\. Such discrepancies pose a significant risk in safety\-critical applications, where decisions must be both accurate and interpretable\.

### 5\.2Overtrust in Visual Explanations

Another important failure mode is overtrust induced by visually convincing explanations\. Heatmaps and attention visualizations are often intuitive and easy to interpret, which may lead users to overestimate their reliability\.

Empirical studies suggest that humans tend to trust explanations that are simple and visually coherent, even when they are incomplete or incorrect\. In ATR systems, this can result in operators placing unjustified confidence in model decisions, potentially leading to incorrect or unsafe actions\. The combination of high visual plausibility and limited causal validity therefore creates a systematic risk of misinterpretation\.

### 5\.3Instability and Sensitivity to Perturbations

Many XAI methods exhibit significant sensitivity to small input perturbations\. Minor changes in the input data, such as noise, slight shifts, or variations in viewing conditions, can lead to substantially different explanations\.

This instability is particularly problematic in ATR applications, where sensor data is inherently noisy and subject to environmental variability\. Explanations that are not robust under such conditions cannot be reliably used for system validation or decision support\. Moreover, instability reduces the reproducibility of explanations, which is a critical requirement in safety\-critical systems\.

### 5\.4Limits of Explainability in Deep Learning Systems

Beyond method\-specific limitations, there exist fundamental constraints on explainability in complex deep learning models\. Modern neural networks operate in high\-dimensional and highly nonlinear representation spaces, where decision boundaries cannot always be decomposed into simple, human\-interpretable components\.

Furthermore, multiple internal representations may lead to equivalent model outputs, making it difficult to establish unique causal explanations\. In many cases, no ground truth for explanations exists, which complicates both evaluation and validation\. These limitations suggest that post\-hoc explanation methods may inherently fall short in fully capturing model behavior in complex ATR systems\.

### 5\.5Implications for Safety\-Critical ATR Systems

The identified failure modes have direct implications for the deployment of XAI in safety\-critical ATR applications\. In such systems, explanations must not only be interpretable but also reliable, stable, and aligned with the actual decision process of the model\.

The analysis indicates that relying solely on post\-hoc explanation methods is insufficient for high\-assurance applications\[[21](https://arxiv.org/html/2605.05748#bib.bib10)\]\. Instead, there is a need for approaches that integrate interpretability into the model design, incorporate domain knowledge, and support formal validation and verification processes\.

Overall, these findings reinforce the need for a shift toward more robust, causally grounded, and physically informed explainability methods, particularly in domains where incorrect interpretations may have severe consequences\.

## 6Conclusion and Outlook

This work examined the applicability of current XAI methods in safety\-critical ATR systems\. We introduced a taxonomy of commonly used approaches, including saliency\-based, attention\-based, surrogate\-based, detection\-aware, and intrinsic or physics\-informed methods, and proposed an ATR\-specific assessment framework with respect to interpretability, robustness, vulnerability to manipulation, and suitability for validation and verification\.

The analysis revealed fundamental limitations of current post\-hoc explainability methods\. While saliency\- and attention\-based approaches provide intuitive and easily deployable explanations, they often lack robustness, causal validity, and reliability in safety\-critical scenarios\. Detection\-aware XAI methods improve alignment with object detection outputs and enhance spatial interpretability, but still inherit key limitations of underlying gradient\-based techniques, including sensitivity to perturbations and limited causal grounding\. Surrogate models increase transparency but remain dependent on approximation fidelity, which can be problematic in complex, high\-dimensional ATR settings\. In contrast, intrinsic and physics\-informed approaches show greater potential for reliable and consistent explanations, particularly when aligned with domain knowledge\.

The discussion of failure modes further highlighted critical challenges, including spurious explanations, overtrust in visually plausible outputs, and instability under perturbations\. These findings indicate that explainability in ATR systems must be evaluated not only in terms of interpretability, but also with respect to robustness and operational reliability\.

Future research should therefore focus on the development of hybrid and intrinsically interpretable models that combine the flexibility of deep learning with structured, physically grounded representations\. In addition, there is a need for standardized evaluation frameworks and benchmarks tailored to safety\-critical applications, enabling systematic comparison and validation of XAI methods, including detection\-aware approaches\.

Advancing XAI in ATR will require moving beyond perceptually plausible explanations toward explanations that are robust, causally meaningful, verifiable, and grounded in the underlying physical processes of the data\.

## References

- \[1\]J\. Adebayo, J\. Gilmer, M\. Muelly, I\. Goodfellow, M\. Hardt, and B\. Kim\(2018\)Sanity checks for saliency maps\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.05748#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p2.1),[§4\.1](https://arxiv.org/html/2605.05748#S4.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.05748#S5.SS1.p2.1)\.
- \[2\]R\. Agarwal, A\. Singh, P\. Gupta, and P\. H\. S\. Torr\(2024\)Gaussian\-class activation mapping explainer \(g\-came\): improved visual explanations for deep neural networks\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2\.2](https://arxiv.org/html/2605.05748#S2.SS2.p2.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p2.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p5.1)\.
- \[3\]V\. Buhrmester, D\. Münch, and M\. Arens\(2021\)Analysis of explainers of black box deep neural networks for computer vision: a survey\.Machine Learning and Knowledge Extraction3\(4\),pp\. 966–989\.Cited by:[§1](https://arxiv.org/html/2605.05748#S1.p2.1)\.
- \[4\]N\. Carion, F\. Massa, G\. Synnaeve, N\. Usunier, A\. Kirillov, and S\. Zagoruyko\(2020\)End\-to\-end object detection with transformers\.InEuropean Conference on Computer Vision \(ECCV\),Cited by:[§2\.2](https://arxiv.org/html/2605.05748#S2.SS2.p3.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p3.1)\.
- \[5\]Z\. Cheng, Y\. Wu, Y\. Li, L\. Cai, and B\. Ihnaini\(2025\)A comprehensive review of explainable artificial intelligence in computer vision\.Sensors\.Cited by:[§2\.1](https://arxiv.org/html/2605.05748#S2.SS1.p1.1)\.
- \[6\]A\. Ghorbani, A\. Abid, and J\. Zou\(2019\)Interpretation of neural networks is fragile\.InAAAI,Cited by:[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p2.1)\.
- \[7\]S\. Jain and B\. C\. Wallace\(2019\)Attention is not explanation\.InNAACL,Cited by:[§2\.2](https://arxiv.org/html/2605.05748#S2.SS2.p3.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p3.1)\.
- \[8\]S\. Javaidet al\.\(2025\)Explainable ai and monocular vision for uav navigation\.Frontiers in Sustainable Cities\.Cited by:[§2\.3](https://arxiv.org/html/2605.05748#S2.SS3.p2.1)\.
- \[9\]A\. Kadiret al\.\(2023\)On the evaluation of explainable artificial intelligence methods\.Cited by:[§3\.1](https://arxiv.org/html/2605.05748#S3.SS1.p5.3)\.
- \[10\]G\. E\. Karniadakis, I\. G\. Kevrekidis, L\. Lu, P\. Perdikaris, S\. Wang, and L\. Yang\(2021\)Physics\-informed machine learning\.Nature Reviews Physics\.Cited by:[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p7.1)\.
- \[11\]H\. Lakkaraju and O\. Bastani\(2023\)Evaluating explainable ai: which algorithm should i choose?\.Foundations and Trends in Machine Learning\.Cited by:[§3\.1](https://arxiv.org/html/2605.05748#S3.SS1.p5.3)\.
- \[12\]T\. Laugel, M\. Lesot, C\. Marsala, X\. Renard, and M\. Detyniecki\(2019\)Fooling lime and shap: adversarial attacks on post hoc explanation methods\.InAIES,Cited by:[§3\.2](https://arxiv.org/html/2605.05748#S3.SS2.p5.1)\.
- \[13\]S\. Lundberg and S\. Lee\(2017\)A unified approach to interpreting model predictions\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.05748#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p4.1)\.
- \[14\]J\. Mi, X\. Jiang, L\. Luo, and Y\. Gao\(2024\)Toward explainable artificial intelligence: a survey and overview\.Neurocomputing\.Cited by:[§2\.1](https://arxiv.org/html/2605.05748#S2.SS1.p1.1)\.
- \[15\]L\. P\. T\. Nguyen, H\. T\. T\. Nguyen, and H\. Cao\(2025\)ODExAI: a comprehensive object detection explainable ai evaluation\.arXiv preprint arXiv:2504\.19249\.Cited by:[§2\.2](https://arxiv.org/html/2605.05748#S2.SS2.p2.1)\.
- \[16\]M\. Raissi, P\. Perdikaris, and G\. E\. Karniadakis\(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational Physics\.Cited by:[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p7.1)\.
- \[17\]J\. Redmon, S\. Divvala, R\. Girshick, and A\. Farhadi\(2016\)You only look once: unified, real\-time object detection\.InCVPR,Cited by:[§2\.2](https://arxiv.org/html/2605.05748#S2.SS2.p1.1)\.
- \[18\]S\. Ren, K\. He, R\. Girshick, and J\. Sun\(2015\)Faster r\-cnn: towards real\-time object detection with region proposal networks\.InNeurIPS,Cited by:[§2\.2](https://arxiv.org/html/2605.05748#S2.SS2.p1.1)\.
- \[19\]M\. T\. Ribeiro, S\. Singh, and C\. Guestrin\(2016\)Why should i trust you? explaining the predictions of any classifier\.InKDD,Cited by:[§1](https://arxiv.org/html/2605.05748#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p4.1)\.
- \[20\]A\. Royet al\.\(2025\)Explainable ai for object detection in satellite imagery\.IEEE Access\.Cited by:[§2\.3](https://arxiv.org/html/2605.05748#S2.SS3.p2.1)\.
- \[21\]C\. Rudin\(2019\)Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead\.Nature Machine Intelligence1,pp\. 206–215\.Cited by:[§5\.5](https://arxiv.org/html/2605.05748#S5.SS5.p2.1)\.
- \[22\]R\. R\. Selvaraju, M\. Cogswell, A\. Das, R\. Vedantam, D\. Parikh, and D\. Batra\(2017\)Grad\-cam: visual explanations from deep networks via gradient\-based localization\.InICCV,Cited by:[§1](https://arxiv.org/html/2605.05748#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.05748#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p2.1)\.
- \[23\]K\. Simonyan, A\. Vedaldi, and A\. Zisserman\(2014\)Deep inside convolutional networks: visualising image classification models and saliency maps\.arXiv preprint arXiv:1312\.6034\.Cited by:[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p2.1)\.
- \[24\]M\. Sundararajan, A\. Taly, and Q\. Yan\(2017\)Axiomatic attribution for deep networks\.InICML,Cited by:[§2\.1](https://arxiv.org/html/2605.05748#S2.SS1.p1.1),[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p2.1)\.
- \[25\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§3\.3](https://arxiv.org/html/2605.05748#S3.SS3.p3.1)\.

Similar Articles

Architecture-Aware Explanation Auditing for Industrial Visual Inspection

arXiv cs.LG

This paper introduces an architecture-aware explanation audit protocol for industrial visual inspection, demonstrating that the faithfulness of explanation methods is bounded by their structural compatibility with a model's native decision mechanism, using experiments on wafer map and anomaly detection datasets.

Beyond the Black Box: Interpretability of Agentic AI Tool Use

arXiv cs.AI

This paper introduces a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to monitor internal model states before AI agents invoke tools, aiming to improve diagnostics and safety in enterprise workflows.