GLARE: A Natural Language Interface for Querying Global Explanations
Summary
GLARE is an LLM-based interface that translates natural language questions into SQL queries over local explanation data, enabling users to interactively explore global explanations of black-box image classifiers.
View Cached Full Text
Cached at: 06/20/26, 02:32 PM
# GLARE: A Natural Language Interface for Querying Global Explanations
Source: [https://arxiv.org/html/2606.19735](https://arxiv.org/html/2606.19735)
11institutetext:Oregon State University, Corvallis OR 97330, USA11email:\{vasub,mangannr\}@oregonstate\.edu###### Abstract
While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration\. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM\-based interactive interface that provides natural language access to global explanations for black\-box image classifiers\. The system’s core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data\. This enables flexible aggregation without exposing users to low\-level representations\. For each query, the interface outputs statistics\-augmented natural language responses, supporting local explanations, and intent\-aligned visualizations\. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors\. Our results demonstrate that LLM\-mediated querying substantially improves the accessibility and usability of global explanations for human\-centered XAI\.
## 1Introduction
Deep vision models have achieved remarkable success in tasks ranging from medical diagnosis to autonomous driving, yet their decision\-making processes remain largely opaque\. In high\-stakes settings, users require explanations not only to justify individual predictions but to build a mental model of the system’s behavior, assess reliability, and diagnose failure modes\. Explainable AI \(XAI\) methods have traditionally been dichotomized into*local*and*global*approaches\. Local methods such as saliency maps\[[17](https://arxiv.org/html/2606.19735#bib.bib1)\], concept bottlenecks\[[9](https://arxiv.org/html/2606.19735#bib.bib2)\], or counterfactuals, explain specific instances \(e\.g\., “Why was*this*image classified as a wolf?”\)\. While useful for auditing single errors, they fail to reveal systemic biases or general reasoning patterns\. Conversely,*global*explanations aim to summarize the model’s behavior across the entire input space, often by identifying globally important features\[[6](https://arxiv.org/html/2606.19735#bib.bib5)\]\[[1](https://arxiv.org/html/2606.19735#bib.bib4)\]or distilling the model into a transparent surrogate such as a decision tree\[[10](https://arxiv.org/html/2606.19735#bib.bib3)\]\[[5](https://arxiv.org/html/2606.19735#bib.bib7)\]or a Disjunctive Normal Form \(DNF\) formula\[[21](https://arxiv.org/html/2606.19735#bib.bib6)\]\[[2](https://arxiv.org/html/2606.19735#bib.bib8)\]\.
However, a critical usability gap plagues modern global explanations: they are often overwhelming in scale and complexity\. For a deep network trained on a complex visual domain, a high\-fidelity global explanation might consist of thousands of logical rules or prototypes\. Presenting this “explanation dump” to a user induces cognitive overload, obscuring the very insights it aims to reveal\. We argue that users rarely need a static, monolithic summary of the entire model\. Instead, human explanation\-seeking is an iterative, query\-driven process\[[12](https://arxiv.org/html/2606.19735#bib.bib10)\]\[[11](https://arxiv.org/html/2606.19735#bib.bib9)\]\. Users approach the model with specific hypotheses or information needs, such as:*“What features are necessary for the ’bedroom’ class?”*,*“Does the model rely on background snow to classify wolves?”*, or*“Show me examples where the model relies on shape rather than texture\.”*Current XAI tools force users to manually filter and aggregate local explanations to answer these questions, creating a friction that limits the practical utility of global insights\.
In this paper, we presentGLARE\(Global Language\-based Analysis and Retrieval of Explanations\), an interactive interface that mediates between users and large\-scale global explanations\. Rather than treating a global explanation as a static artifact to be viewed, we treat it as a*database*to be queried\. We choose logical explanations\[[21](https://arxiv.org/html/2606.19735#bib.bib6)\]\[[20](https://arxiv.org/html/2606.19735#bib.bib11)\], that aggregates local Minimal Sufficient Explanations \(MSXs\) into a global DNF structure due to its binary nature of a concept being important vs not in the form of logical rules\. We ingest these logic\-based local explanations into a relational database, enabling precise structural queries over the model’s reasoning patterns\. Although our experiments are limited to explanations generated by\[[21](https://arxiv.org/html/2606.19735#bib.bib6)\], the interface presented in the paper can work with any concept\-based local explanation method\. GLARE allows users to interrogate this database using natural language\. The core of our system is a Large Language Model \(LLM\) fine\-tuned to act as a semantic parser, translating user questions into structured SQL queries\. Unlike general text\-to\-SQL approaches, we constrain the LLM to select from a taxonomy of analytical query templates specialized in explaining decisions\. Templates ranging from simple object frequency counts to complex counterfactual set operations\. By employing a loss\-masking technique during fine\-tuning that focuses learning exclusively on the explanation\-specific SQL structure \(“fence masking”\), we encourage the model to learn the*relational algebra*of explanation querying rather than memorizing dataset\-specific entity names, leading to zero\-shot transfer to new datasets\.
We evaluate GLARE on global explanations derived from the ADE20K scene parsing dataset\. Our results demonstrate that the system achieves over 95% accuracy on in\-distribution queries and exhibits strong robustness to spelling errors, grammatical noise, and phrasing variations\. Most notably, we demonstrate zero\-shot cross\-dataset transfer: a model trained exclusively on ADE20K metadata effectively interprets queries for a Pascal VOC database, a domain with a completely disjoint object vocabulary, suggesting that our approach learns generalized reasoning patterns applicable to diverse vision tasks\. To summarize, Our contributions are threefold: \(i\) we introduce a natural\-language interface for interrogating global explanations as queryable databases; \(ii\) we define a SQL\-based intermediate representation for aggregating, filtering, and contrasting local explanations; and \(iii\) we demonstrate that synthetic\-data fine\-tuning with SQL\-fence loss masking yields robust query interpretation, including cross\-dataset transfer to Pascal VOC without retraining\.
## 2Related Work
#### Interactive and Conversational XAI:
This challenge highlights the "social" nature of explanations\[[12](https://arxiv.org/html/2606.19735#bib.bib10)\], suggesting they should be interactive dialogues rather than static artifacts\. While early XAI systems categorized user intent into queries like Why? or What if?\[[11](https://arxiv.org/html/2606.19735#bib.bib9)\], and visual tools like Gamut\[[7](https://arxiv.org/html/2606.19735#bib.bib18)\]or the What\-If Tool\[[22](https://arxiv.org/html/2606.19735#bib.bib19)\]enabled manual counterfactual inspection, these interfaces often require significant domain expertise\. Conversational XAI lowers this barrier by allowing natural language interaction; while systems like TalkToModel\[[19](https://arxiv.org/html/2606.19735#bib.bib21)\]pioneered this for tabular data, our work extends this paradigm to global vision explanations, supporting complex structural queries over necessary and sufficient conditions\.
#### LLMs as Neuro\-Symbolic Interpreters:
To bridge the gap between natural language and formal logic, we leverage Large Language Models \(LLMs\) as neuro\-symbolic interpreters rather than direct explanation generators, which are prone to hallucination\. Existing work uses LLMs as semantic parsers \(e\.g\., text\-to\-SQL\[[23](https://arxiv.org/html/2606.19735#bib.bib22)\]\[[16](https://arxiv.org/html/2606.19735#bib.bib23)\]\) or tool\-augmented routers\[[15](https://arxiv.org/html/2606.19735#bib.bib24)\]\. Within XAI, LLMs have been used to describe neurons\[[3](https://arxiv.org/html/2606.19735#bib.bib27)\]or retrieve static artifacts\[[18](https://arxiv.org/html/2606.19735#bib.bib26)\]\. In contrast, GLARE treats the LLM as a logic\-constrained parser that maps user intent to a deterministic "grammar of explanations" via verifiable SQL templates\. This ensures that the flexibility of natural language is grounded in formal correctness, enabling precise logical aggregations and counterfactual queries over the model’s reasoning structure\.
## 3Methodology
We present a natural language interface for querying global explanations of image classifiers\. Our system builds upon the global explanation framework of Vasu et al\.\[[21](https://arxiv.org/html/2606.19735#bib.bib6)\], which generates concept\-based explanations for black\-box image classifiers, expressed as Disjunctive Normal Form \(DNF\) formulas that describe important object combinations for each class\. Our system enables users to pose analytical questions in natural language and receive structured, interpretable answers along with supporting evidence images\. More formally, letfθ:𝒳→𝒴f\_\{\\theta\}:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}denote an image classifier mapping imagesx∈𝒳x\\in\\mathcal\{X\}to labelsy∈𝒴y\\in\\mathcal\{Y\}\. We assume a dataset𝒟=\{\(xi,yi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}and predictionsy^i=fθ\(xi\)\\hat\{y\}\_\{i\}=f\_\{\\theta\}\(x\_\{i\}\)\. We assume access to a local explanation generatorEEproducing an explanation artifactei=E\(xi,fθ\)e\_\{i\}=E\(x\_\{i\},f\_\{\\theta\}\)for each input, whereEEis an the result of a concept based attribution methods or prototypes\. Users pose natural\-language questionsqqabout aggregate model behavior over subsets of𝒟\\mathcal\{D\}\(e\.g\., by class, confusion pair, or attribute\)\. We aim to answerqqby retrieving and aggregating relevant local explanation artifacts\{ei\}\\\{e\_\{i\}\\\}\.
### 3\.1System Overview
Our system follows a*parse\-validate\-execute*pipeline \(Figure[1](https://arxiv.org/html/2606.19735#S3.F1)\)\. A user poses a question in natural language \(e\.g\.,*“What percentage of bedroom images contain both bed and wall?”*\)\. Our LLM\-based query interpreter translates the question into a structured SQL query by selecting from a set of predefined query templates and extracting the relevant parameters \(class names, object names, thresholds\)\. The parameterized template is instantiated as an executable SQL query, validated for correctness and safety, then executed against a database encoding the global explanations\. Results are returned as structured data along with supporting evidence images highlighting the relevant objects\. The structured data returned is then converted back to natural language using a generic small LLM of the same size\. Images illustrating the supporting local explanations, which highlight contributing regions based on the original image with a segmentation map as shown in Figure[2](https://arxiv.org/html/2606.19735#S5.F2)\.
Because the model learns to generate SQL over a fixed relational schema, it acquires the compositional structure of the query language itself and not merely a mapping from phrases to template indices\. This enables generalization along multiple axes: to unseen entity combinations, linguistic variations, novel compositions of known SQL fragments, and even entirely new datasets sharing the same schema\. At the same time, anchoring generation in predefined templates mitigates risks of malformed or semantically incorrect queries while preserving the expressiveness needed for meaningful explanation interrogation\. Given a query, the interface returns:\(1\) natural\-language summary with relevant statistics; \(2\) supporting local explanations \(examples\); \(3\) visualizations aligned with query intent\.
Figure 1:End\-to\-end pipeline\. Top: The upstream framework generates mDNF explanations from aggregation of local concept or logic based explanation\. Bottom: Our system translates natural language questions into validated SQL queries executed against the explanation database, returning structured answers with supporting evidence images\.
### 3\.2LLM\-Based Query Interpretation
We formulate natural\-language query interpretation as a structured SQL generation problem: given a user queryqq, the model selects an appropriate query pattern and extracts the parametersϕ\(q\)\\phi\(q\)needed to instantiate it as executable SQL\. We fine\-tune Gemma 2\-9B\[[14](https://arxiv.org/html/2606.19735#bib.bib30)\]using Low\-Rank Adaptation \(LoRA\)\[[8](https://arxiv.org/html/2606.19735#bib.bib31)\]with 4\-bit quantization \(QLoRA\) with a LoRA rank: 16, alpha: 32, dropout: 0\.05\.
#### Training Data Generation\.
We generate synthetic training data covering 24 distinct query templates \(Section[3\.3](https://arxiv.org/html/2606.19735#S3.SS3)\), producing 50,000 training and 2,000 validation examples\. For each template, we sample random class\-object combinations and apply natural language variation: \(1\)Synonym substitution: Operators \(*“and”*,*“&”*,*“together with”*\), quantifiers \(*“percentage”*,*“%”*,*“proportion”*\), ranking terms \(*“top”*,*“most common”*,*“leading”*\)\. \(2\)Phrasing templates: Natural language templates per query type \(e\.g\.,*“What % of X have Y?”*vs\.*“What proportion of X contain Y?”*\)\. During fine\-tuning, we employ a custom collator \(SqlFenceCollator\) that masks the training loss to*only*tokens betweenSQL\_STARTandSQL\_END\.
### 3\.3Query Templates
We define 24 query templates corresponding to common analytical tasks over global explanations\. Each template captures a specific question type and is parameterized by entities extracted from the user query \(target class, object names, comparison class, thresholds, etc\.\)\. The templates are organized into three tiers of increasing complexity:*Core*queries cover fundamental object\-class relationships such as frequency, boolean combinations, top\-kkranking, co\-occurrence, and class ranking;*Extended*queries leverage SQL features like self\-joins forNN\-way combinations, cross\-class comparisons, set operations, conditional co\-occurrence, and confidence\-filtered analysis; and*Contrastive*queries enable counterfactual analysis including absence analysis, threshold queries, and distinguishing features between classes\. The full taxonomy with all 24 templates and the prompt structure is provided in the supplementary material\. Crucially, the template set is not a fixed system boundary: adding a new question type requires only defining a new SQL pattern and regenerating synthetic training data, after which the entire pipeline, data generation, fine\-tuning, and evaluation runs automatically without manual annotation\.
## 4Experimental Setup
We evaluate GLARE along four axes: \(i\) in\-distribution accuracy on held\-out queries over the training dataset, \(ii\) robustness to natural\-language perturbations, \(iii\) out\-of\-distribution generalization to novel phrasing and unseen SQL constructs, and \(iv\) cross\-dataset transfer to an entirely different object vocabulary and scene taxonomy\.
### 4\.1Datasets
#### Training data\.
Training examples are generated synthetically by sampling from the 24 query templates \(Section[3\.3](https://arxiv.org/html/2606.19735#S3.SS3)\)\. Each template produces a \(natural\-language question, SQL query\) pair by randomly selecting objects and scene classes from the ADE20K vocabulary of 150 objects and 35 scene categories\. We generate 50,000 training pairs \(seed = 42\) and 2,000 validation pairs \(seed = 1,042\), each formatted as chat\-style messages with the system prompt, user question, and gold SQL delimited bySQL\_START/SQL\_ENDmarkers\. The training set uses only object and class*name lists*; no data from the ground\-truth explanation database is accessed during training\. The reference database is built from the mDNF explanations\[[21](https://arxiv.org/html/2606.19735#bib.bib6)\]of a VGG19 model computed over ADE20K\[[24](https://arxiv.org/html/2606.19735#bib.bib28)\]dataset\[[21](https://arxiv.org/html/2606.19735#bib.bib6)\]\. The local explanations achieves a Fidelity\+and Fidelity\-of0\.113±0\.2970\.113\\pm 0\.297and0\.992±0\.0900\.992\\pm 0\.090respectively, reflecting their faithfulness to the underlying model\. Note, our system can accommodate any local explanation algorithm that generates explanations in symbolic form\. It serves exclusively as the*execution environment*during evaluation: generated SQL queries are executed against this database to verify the correctness\.
#### Evaluation splits:
We construct three evaluation sets that show progressively more challenging forms of generalization:
1. 1\.OOD Test\(300 examples\): Split into*phrasing variations*of trained templates \(182 examples\) and*novel SQL constructs*absent from training \(118 examples\)\. Phrasing variations include informal language \(e\.g\., “yo how many”\), negations, double negations, ambiguous wording, alias usage, extreme boundary values, and multi\-step reasoning\. Novel SQL constructs include window functions \(RANK\(\) OVER\),CASEexpressions, correlated subqueries, complexHAVINGclauses, XOR logic, relative comparisons, and cross\-class set differences, none of which appear in any training template\. This split directly tests whether the model has learned generalizable SQL structure beyond the specific patterns it was trained on\.
2. 2\.Robustness Test\(500 examples\): Each Fresh Test query is subjected to seven perturbation types:*spelling*errors,*grammar*corruptions,*synonym*substitution,*verbose*padding,*telegraphic*compression,*word drop*, and*word swap*\. We additionally measure*consistency*\(whether paraphrases of the same query yield identical results\) and*sanity*\(whether outputs satisfy domain constraints\)\.
3. 3\.Pascal VOC\(500 examples\): A cross\-dataset evaluation in which the model, trained exclusively on ADE20K entity names, is tested against a database built from Pascal VOC\[[4](https://arxiv.org/html/2606.19735#bib.bib29)\]annotations with a disjoint object vocabulary \(166 objects, different scene taxonomy\)\. Test queries follow the same template distribution\. This evaluates whether the learned SQL structure transfers to an entirely new domain without retraining\.
### 4\.2Models
To assess the impact of fine\-tuning and the generality of our approach across model scales, we evaluate two instruction\-tuned model families:
- •Gemma 2\[[14](https://arxiv.org/html/2606.19735#bib.bib30)\]: 2B / 9B / 27B parameters
- •Qwen 2\.5\[[13](https://arxiv.org/html/2606.19735#bib.bib32)\]: 0\.5B / 7B / 14B parameters
All fine\-tuned models use the identical QLoRA configuration and SQL\-fence loss masking described in Section[3\.2](https://arxiv.org/html/2606.19735#S3.SS2), with the same 50,000 synthetic training examples\. We additionally evaluate each model in its base \(untrained\) configuration, while keeping all other aspects the same to isolate the contribution of our synthetic training pipeline \(Section[5\.1](https://arxiv.org/html/2606.19735#S5.SS1)\)\.
### 4\.3Evaluation Metrics
We report the following metrics, evaluated by executing generated SQL against the ground\-truth database and comparing the returned result sets:
- •Fence Detection \(%\): Fraction of outputs correctly delimited bySQL\_START/SQL\_ENDmarkers, confirming adherence to the expected output format\.
- •SQL Parse Rate \(%\): Fraction of outputs containing syntactically valid SQL\.
- •Execution Rate \(%\): Fraction of SQL queries that execute without runtime error against the explanation database\.
- •Result Match \(%\): Fraction of queries whose executed results match the gold\-standard output\. We use*relaxed*matching: row\-order andLIMITdifferences are tolerated, extra columns are permitted, and numeric values are compared with 1% relative tolerance\. This reflects the principle that semantically equivalent SQL may differ in surface form\.
- •Partial Match \(%\): For queries that fail exact match, we compute the Jaccard similarity over first\-column value sets\. Queries with Jaccard\>0\.5\>0\.5are counted as partial matches, indicating substantially correct query intent despite structural differences\.
The robustness evaluation \(Section[5\.2](https://arxiv.org/html/2606.19735#S5.SS2)\) additionally reports:
- •Robustness Score \(%\): The percentage of the unperturbed baseline result\-match rate preserved under each perturbation type\. A score of100%100\\%indicates zero degradation\.
- •Consistency Rate \(%\): For each baseline query that executes successfully, two paraphrases are generated \(via synonym substitution and verbose rephrasing\)\. Consistency is the fraction of paraphrases that produce result sets identical to the original\.
## 5Results and Discussion
Table 1:Performance on the ADE20K Fresh Test set \(500 examples\)\. Fine\-tuned \(FT\) models use QLoRA with SQL\-fence loss masking on 50,000 synthetic training pairs generated from 24 query templates\. Base models receive the same prompt but no task\-specific fine\-tuning\. Cross rows denote zero\-shot transfer performance to a different taxonomy from PASCAL VOC \(166 objects\) while ’Regex’ uses Regular Expression \(Regex\) for a non\-learning baseline\.### 5\.1In\-Distribution Query Accuracy
Table[1](https://arxiv.org/html/2606.19735#S5.T1)summarizes performance on the 500\-example Fresh Test set, comparing both fine\-tuned and base \(untrained\) model configurations\.
Among fine\-tuned models, Gemma 2 9B achieves perfect scores on all three structural metrics \(fence detection, SQL parse, and execution\) and a result\-match rate of 95\.2%, with an additional 2\.4% of examples achieving partial matches \(Jaccard\>0\.5\>0\.5\), bringing the effective accuracy above 97%\. Gemma 2 27B and Gemma 2 2B achieve comparable result\-match rates of 95\.2% and 95\.4% respectively, both with perfect structural metrics, indicating that the fine\-tuning pipeline saturates in\-distribution performance across a wide range of model scales within the Gemma 2 family\. While the non\-learning Regex baseline achieves perfect structural metrics, our fine\-tuned models substantially outperform its 74\.4% result\-match rate, achieving over 95% match accuracy across the Gemma 2 family\. Base models, despite receiving the identical prompt, achieve near\-zero result\-match accuracy \(Table[1](https://arxiv.org/html/2606.19735#S5.T1)\), confirming that the observed generalization is the product of task\-specific fine\-tuning rather than pre\-existing SQL knowledge or in\-context learning\. Across architectures, a minimum capacity threshold is required: Qwen 2\.5 0\.5B fails even after fine\-tuning, while Qwen 2\.5 7B reaches 93\.0%\. Please refer to the Appendix[5\.1](https://arxiv.org/html/2606.19735#S5.SS1)for the breakdown of performance across different query types\.
### 5\.2Robustness to Input Perturbations
Table[2](https://arxiv.org/html/2606.19735#S5.T2)reports robustness to seven perturbation types applied to the 500\-example Fresh Test set, along with consistency and sanity metrics, explicitly comparing the fine\-tuned Gemma 2 9B model against the regex baseline\.
The fine\-tuned model demonstrates significant architectural advantages over regex, particularly on perturbations that reflect authentic user behavior:
Spelling errors:The model maintains a 79\.6% match rate compared to the regex’s 48\.7%, a stark \+31 percentage point \(pp\) gap\. Real users frequently mistype words \(e\.g\., “bedrrom”, “kitchn”, “chiar”\)\. A regex requires the canonical token verbatim and offers no path forward without engineering a separate fuzzy\-matching system, whereas the model natively absorbs typographical noise\. This \+31 pp gap is arguably the cleanest evidence in the suite demonstrating the superiority of fine\-tuning\.Synonym substitution:The model exhibits zero degradation \(100% robustness\) to synonyms, outperforming the regex by \+17 pp in absolute match rate\. Users naturally express concepts using diverse vocabulary \(e\.g\., “proportion”, “share”, “fraction”, “slice of the pie”\)\. The regex only recognizes hand\-written synonyms; the moment a user employs an unmapped term, it silently falls through to an incorrect template\. The model, conversely, leverages broad lexical knowledge from pretraining to infer meaning and degrade gracefully\.Word drop:Real users often submit terse, ungrammatical queries \(e\.g\., “percent bedroom bed”, “top chair classroom”\)\. While word drop is the most damaging perturbation overall, removing core content words inherently destroys the information needed to generate correct SQL, the model still maintains a \+2\.7 pp advantage \(42\.6% vs\. 39\.9%\)\. The regex strictly requires trigger tokens to be present and adjacent to fire its rules\. The model’s ability to infer intent from partial context is the only scalable path forward, as simply writing more regex rules cannot recover missing tokens\.
The other four perturbations \(verbose padding, telegraphic compression, grammar variations, and synonym substitutions\) are primarily designed to test paraphrase invariance, varying surface structure while preserving core content words\. While the regex baseline scores artificially high on some of these metrics, a regex literally ignores surface structure, so by construction, it cannot be confused by it\. This is not robustness in the language\-understanding sense; it is merely deafness to syntax\. Because of this architectural blindness, direct comparisons to the regex baseline on syntactic paraphrase metrics are fundamentally uninformative\.
Finally, modelconsistencyacross paraphrases reaches 94\.1% \(744/791 tests\), indicating that semantically equivalent questions reliably yield identical SQL, a critical property for user trust\. Furthermore, all 1,485sanity checkspass \(100%\), confirming that generated results never violate underlying domain constraints \(such as percentages outside\[0,100\]\[0,100\]or negative counts\)\.
Table 2:Robustness evaluation \(Gemma 2 9B and Regex Baseline, Fresh Test, 500 examples\)\. Robustness score is the percentage of the unperturbed baseline match rate preserved under each perturbation\.PerturbationnnPrs\.Exec\.MatchRob\. \(%\)Orig\.RegexOrig\.RegexOrig\.RegexOrig\.RegexSynonym15097\.3100\.097\.3100\.089\.372\.010089Verbose50098\.4100\.098\.4100\.086\.680\.897100Spelling27598\.9100\.098\.9100\.079\.648\.78960Telegraphic45099\.1100\.099\.1100\.075\.374\.98493Word swap38498\.7100\.098\.7100\.073\.467\.28283Grammar14098\.6100\.098\.6100\.067\.984\.376104Word drop30396\.7100\.096\.7100\.042\.639\.94849Consistency \(paraphrase→\\rightarrowsame result\)94\.195\.0Sanity \(domain constraints satisfied\)100\.0100\.0
### 5\.3Out\-of\-Distribution Generalization
The OOD evaluation \(Table[3](https://arxiv.org/html/2606.19735#S5.T3)\) probes two distinct aspects of generalization: resilience to novel natural\-language phrasing of trained query patterns, and the ability to produce SQL constructs never seen during training\. Across all 300 examples, the model maintains 99\.3% parse and execution rates, indicating that the learned SQL structure generalizes broadly even when semantics diverge from training\.
#### Phrasing variations\.
On the 182 examples that rephrase trained query types using unfamiliar surface forms, Gemma 2 9B achieves 45\.1% exact match\. Performance varies widely:zero\_threshold\(100%\) andnested\_question\(90%\) are handled well, the model correctly interprets complex nested clause structure and edge\-case thresholds\.negation\(70%\) shows reasonable handling of NOT\-style queries\. However,informal\_question\(12%\),double\_negation\(0%\), andcomparative\(0%\) reveal brittleness: the model has learned the underlying SQL patterns but struggles when the surface form deviates substantially from training templates\.
#### Novel SQL constructs\.
On the 118 examples requiring SQL constructs entirely absent from training, the model achieves 19\.5% exact match\.chained\_filter\(100%\) andstring\_pattern\(100%\) represent compositional generalization i\.e the model assembles familiar SQL fragments \(WHEREclauses,LIKEoperators\) into structures it was never explicitly trained on\. Conversely,window\_function,case\_expression,subquery\_select,having\_complex, andrelative\_comparisonall score 0%, confirming that the model learns the compositional structure of the SQL it is trained on, but cannot extrapolate to truly novel syntax \(e\.g\.,RANK\(\) OVER,CASE WHEN\)\. Notably, even on these unsupported constructs, the model maintains 100% execution rate by producing the closest known template rather than generating invalid SQL, a form of graceful degradation that means expanding the system’s analytical scope requires only adding new templates to the synthetic training pipeline\.
Table 3:Out\-of\-distribution results \(Gemma 2 9B, 300 examples\)\. Left: phrasing variations of trained query patterns\. Right: novel SQL constructs absent from training\.
### 5\.4Cross\-Dataset Transfer
Table[1](https://arxiv.org/html/2606.19735#S5.T1)\(row Cross\) presents a cross\-dataset evaluation where the model, trained exclusively on ADE20K entity names, is tested on a database built from Pascal VOC annotations, a different object vocabulary \(166 objects vs\. 150\) and scene taxonomy\. Gemma 2 9B achieves 89\.6% result\-match accuracy with perfect structural metrics \(100% fence detection, parse, and execution\)\. Gemma 2 27B achieves the highest result\-match accuracy at 90\.6%, while Gemma 2 2B reaches 90\.0%, both slightly outperforming the 9B variant\. Performance is tightly clustered across all three Gemma 2 scales, suggesting that the learned SQL structure transfers robustly even at 2B parameters\. Performance is strong across all query types, with 17 of 25 types at≥\{\\geq\}95% accuracy\. These results provide the strongest evidence that the model has learned generalizable*relational structure*and SQL’s compositionality separates query structure from vocabulary, enabling deployment on any new explanation database conforming to the same schema without retraining, given only an entity name list\.
### 5\.5Cross\-Model Comparison
Table[4](https://arxiv.org/html/2606.19735#S5.T4)summarizes fine\-tuned model performance across all evaluation axes\. Gemma 2 9B and Qwen 2\.5 7B achieve comparable Fresh Test accuracy \(95\.2% vs\. 93\.0%\) despite being architecturally distinct model families, Gemma 2 and Qwen 2\.5 differ in pre\-training data, tokenizer, and architectural details such as attention head configuration\. Gemma 2 27B and Gemma 2 2B achieve 95\.2% and 95\.4% respectively, confirming that in\-distribution performance saturates across model scales within the Gemma 2 family\. This close agreement confirms that our synthetic training pipeline and SQL\-fence loss masking are*architecture\-agnostic*: any sufficiently large instruction\-tuned causal language model can serve as the backbone\.
On OOD generalization, Gemma 2 27B achieves 34\.3%, comparable to Gemma 2 9B \(35\.0%\), while Gemma 2 2B achieves 30\.3%, suggesting that OOD performance does not scale as strongly with model size as in\-distribution accuracy\. Qwen 2\.5 7B slightly outperforms all Gemma 2 variants on OOD \(40\.0%\), suggesting that OOD performance may depend on the base model’s pre\-training mixture as much as on fine\-tuning\. On cross\-dataset transfer, Gemma 2 27B achieves 90\.6% on Pascal VOC, with Gemma 2 2B at 90\.0%, both slightly outperforming the 9B variant \(89\.6%\)\. At the lower end, Qwen 2\.5 0\.5B fails entirely \(4\.4% Fresh, 4\.7% OOD\), confirming that a minimum model capacity is required for the fine\-tuning pipeline to take effect\.
Table 4:Cross\-model comparison of fine\-tuned models across all evaluation axes\. Fresh and Pascal report result\-match \(%\); OOD reports overall match \(%\); Robust reports the mean robustness weighted average score \(%\) across all seven perturbation types\.
### 5\.6Qualitative Case Study
We present representative examples drawn from actual evaluation outputs, illustrating system behavior across the generalization spectrum\. For each case we show the user’s natural\-language question, the model\-generated SQL, the executed result, and the natural\-language answer returned to the user by the downstream NL generation stage\. Where noted, SQL is abbreviated to highlight the structurally relevant clauses\.
#### Case Study: End\-to\-end multi\-modal response co\-occurrence with visual grounding\.
> Question:“In living\_room, what objects appear with sculpture?”
NL Output:*“Sculptures in living rooms most often appear with walls \(80% of the time\), followed by sofas \(60%\)\. Other common objects found alongside sculptures include cushions, coffee tables, and windowpanes\.”*
Top co\-occurrences\(count / % of anchor images\):wall12/80%,sofa9/60%,cushion8/53%,coffee\_table7/47%,windowpane5/33%,floor4/27%,painting4/27%,ceiling3/20%,curtain3/20%,fireplace3/20%\.
The system also provides visual grounding \(Figure[2](https://arxiv.org/html/2606.19735#S5.F2)\): the top\-3 evidence images with relevant objects identified, enabling users to verify that co\-occurrence statistics correspond to meaningful patterns\. This case demonstrates the full explanatory loop: a natural\-language question translated into SQL, executed, and returned as a multi\-modal response comprising a fluent summary, quantitative statistics, and visual evidence\.
Figure 2:Visual grounding evidence for“In living rooms, what objects appear with sculpture?”: top\-3 evidence images showing the original image \(Column 1\), Objects deemed important \(with value 1\) by local explanations \(Column 2\), and finally the masked image highlighting only the important objects\.
## 6Conclusion and Future Work
We have presented GLARE, an LLM\-mediated interface that transforms global explanation consumption from the passive inspection of static artifacts into an active, query\-driven dialogue\. By constraining generation to a SQL intermediate representation over a fixed relational schema, the system ensures compositional expressiveness and formal correctness while remaining accessible to users without programming expertise\. Our synthetic training pipeline and fence\-masked fine\-tuning are architecture\-agnostic, requiring no manual annotation and enabling zero\-shot transfer to new datasets provided only with an entity name list\. Empirically, small fine\-tuned models \(≥\\geq7 B parameters\) achieve\>\{\>\}95% accuracy on in\-distribution queries, transfer to unseen domains with∼\{\\sim\}90% accuracy, and maintain high robustness to linguistic perturbations\.
One of the future works includes dealing with the template taxonomy that is extensible but not yet exhaustive; covering additional SQL constructs requires only additions to the synthetic generator\. Overall, these results suggest that LLM\-mediated querying offers a practical and reliable path toward more accessible, human\-centered global explanations in XAI workflows\.
## References
- \[1\]S\. Azzolin, A\. Longa, P\. Barbiero, P\. Lio, A\. Passerini,et al\.\(2023\)Global explainability of gnns via logic combination of learned concepts\.In11th International Conference on Learning Representations \(ICLR 2023\),pp\. 1–19\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p1.1)\.
- \[2\]K\. G\. Baugh, L\. Dickens, and A\. Russo\(2025\)Neural dnf\-mt: a neuro\-symbolic approach for learning interpretable and editable policies\.arXiv preprint arXiv:2501\.03888\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p1.1)\.
- \[3\]S\. Bills, N\. Cammarata, D\. Mossing, H\. Tillman, L\. Gao, G\. Goh, I\. Sutskever, J\. Leike, J\. Wu, and W\. Saunders\(2023\)Language models can explain neurons in language models\.Note:[https://openaipublic\.blob\.core\.windows\.net/neuron\-explainer/paper/index\.html](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html)Cited by:[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px2.p1.1)\.
- \[4\]X\. Chen, R\. Mottaghi, X\. Liu, S\. Fidler, R\. Urtasun, and A\. Yuille\(2014\)Detect what you can: detecting and representing objects using holistic models and body parts\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 1971–1978\.Cited by:[item 3](https://arxiv.org/html/2606.19735#S4.I1.i3.p1.1)\.
- \[5\]M\. Craven and J\. Shavlik\(1995\)Extracting tree\-structured representations of trained networks\.Advances in neural information processing systems8\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p1.1)\.
- \[6\]A\. Darwiche and C\. Ji\(2022\)On the computation of necessary and sufficient explanations\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 5582–5591\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p1.1)\.
- \[7\]F\. Hohman, A\. Head, R\. Caruana, R\. DeLine, and S\. M\. Drucker\(2019\)Gamut: a design probe to understand how data scientists understand machine learning models\.InProceedings of the 2019 CHI conference on human factors in computing systems,pp\. 1–13\.Cited by:[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, and W\. Chen\(2021\)LoRA: low\-rank adaptation of large language models\.ArXivabs/2106\.09685\.External Links:[Link](https://api.semanticscholar.org/CorpusID:235458009)Cited by:[§3\.2](https://arxiv.org/html/2606.19735#S3.SS2.p1.2)\.
- \[9\]P\. W\. Koh, T\. Nguyen, Y\. S\. Tang, S\. Mussmann, E\. Pierson, B\. Kim, and P\. Liang\(2020\)Concept bottleneck models\.InInternational conference on machine learning,pp\. 5338–5348\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p1.1)\.
- \[10\]H\. Lakkaraju, S\. H\. Bach, and J\. Leskovec\(2016\)Interpretable decision sets: a joint framework for description and prediction\.InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining,pp\. 1675–1684\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p1.1)\.
- \[11\]Q\. V\. Liao, D\. Gruen, and S\. Miller\(2020\)Questioning the ai: informing design practices for explainable ai user experiences\.InProceedings of the 2020 CHI conference on human factors in computing systems,pp\. 1–15\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p2.1),[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px1.p1.1)\.
- \[12\]T\. Miller\(2019\)Explanation in artificial intelligence: insights from the social sciences\.Artificial intelligence267,pp\. 1–38\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p2.1),[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]Qwen, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, and et\.al\(2025\-01\)Qwen2\.5 Technical Report\.arXiv\.Note:arXiv:2412\.15115 \[cs\]External Links:[Link](http://arxiv.org/abs/2412.15115),[Document](https://dx.doi.org/10.48550/arXiv.2412.15115)Cited by:[2nd item](https://arxiv.org/html/2606.19735#S4.I2.i2.p1.1)\.
- \[14\]M\. Rivière, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé, J\. Ferret,et al\.\(2024\)Gemma 2: improving open language models at a practical size\.CoRR\.Cited by:[§3\.2](https://arxiv.org/html/2606.19735#S3.SS2.p1.2),[1st item](https://arxiv.org/html/2606.19735#S4.I2.i1.p1.1)\.
- \[15\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.Advances in Neural Information Processing Systems36,pp\. 68539–68551\.Cited by:[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]T\. Scholak, N\. Schucher, and D\. Bahdanau\(2021\)PICARD: parsing incrementally for constrained auto\-regressive decoding from language models\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 9895–9901\.Cited by:[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]R\. R\. Selvaraju, M\. Cogswell, A\. Das, R\. Vedantam, D\. Parikh, and D\. Batra\(2017\)Grad\-cam: visual explanations from deep networks via gradient\-based localization\.InProceedings of the IEEE international conference on computer vision,pp\. 618–626\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p1.1)\.
- \[18\]C\. Singh, J\. P\. Inala, M\. Galley, R\. Caruana, and J\. Gao\(2024\)Rethinking interpretability in the era of large language models\.CoRR\.Cited by:[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]D\. Slack, S\. Krishna, H\. Lakkaraju, and S\. Singh\(2022\)TalkToModel: explaining machine learning models with interactive natural language conversations\.arXiv preprint arXiv:2207\.04154\.Cited by:[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]B\. K\. Vasu and P\. Tadepalli\(2023\)Global explanations for image classifiers \(student abstract\)\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 16352–16353\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p3.1)\.
- \[21\]B\. Vasu, G\. Raffa, and P\. Tadepalli\(2026\)Local\-to\-global logical explanations for deep vision models\.arXiv preprint arXiv:2601\.13404\.Cited by:[§1](https://arxiv.org/html/2606.19735#S1.p1.1),[§1](https://arxiv.org/html/2606.19735#S1.p3.1),[§3](https://arxiv.org/html/2606.19735#S3.p1.12),[§4\.1](https://arxiv.org/html/2606.19735#S4.SS1.SSS0.Px1.p1.4)\.
- \[22\]J\. Wexler, M\. Pushkarna, T\. Bolukbasi, M\. Wattenberg, F\. Viégas, and J\. Wilson\(2019\)The what\-if tool: interactive probing of machine learning models\.IEEE transactions on visualization and computer graphics26\(1\),pp\. 56–65\.Cited by:[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman,et al\.\(2018\)Spider: a large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.arXiv preprint arXiv:1809\.08887\.Cited by:[§2](https://arxiv.org/html/2606.19735#S2.SS0.SSS0.Px2.p1.1)\.
- \[24\]B\. Zhou, H\. Zhao, X\. Puig, S\. Fidler, A\. Barriuso, and A\. Torralba\(2017\)Scene parsing through ade20k dataset\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 633–641\.Cited by:[§4\.1](https://arxiv.org/html/2606.19735#S4.SS1.SSS0.Px1.p1.4)\.
## Appendix 0\.AAppendix
### 0\.A\.1Query type breakdown
Table[5](https://arxiv.org/html/2606.19735#Pt0.A1.T5)provides a per\-query\-type breakdown for Gemma 2 9B\. The model achieves 100% accuracy on 18 of 25 evaluated query types, spanning percentage calculations, top\-kkranking, co\-occurrence joins, set operations, conditional existence checks, threshold filtering, and statistical aggregations\. In contrast, the non\-learning Regex baseline proves highly brittle on queries requiring complex structural constraints, suffering severe accuracy drops on types likecombos\(8%\),absence\_count\(36%\), andtopclass\(40%\)\. The two underperforming categories for the model,combos\(49%\) andimages\_with\_exact\_count\(43%\), involve complex multi\-way self\-joins and exact\-countHAVINGclauses, respectively\. Notably, the model achieves≥\{\\geq\}98% on 56 examples labeledunknown\(edge\-case queries outside the standard template taxonomy\), suggesting robustness to minor distribution shifts even within the in\-distribution evaluation\. These results confirm that the model internalizes the compositional structure of SQL, joins, filters, aggregations, subqueries as reusable rules that can be instantiated with arbitrary entity names, rather than memorizing specific query\-answer associations\.
Table 5:Per\-query\-type result\-match accuracy \(%\) on the Fresh Test set for Gemma 2 9B and the Regex \(RX\) baseline, sorted by example count\.Query TypennGemmaRXpercent\_simple709967cooccur6510092topk\_objects3710081percent\_exclude3610090combos35498topclass2910040percent\_prob\.2710083cross\_cls\_comp\.15100100set\_difference13100100obj\_per\_img\_st\.13100100threshold\_qry11100100cnt\_distinct\_obj11100100
Query TypennGemmaRXpercent\_and10100100existence\_chk810080least\_common810089cond\_cooccur7100100absence\_count710036class\_count710054set\_intersect\.7100100img\_w\_exact\_cnt743100percent\_or710082all\_objects6100100all\_classes4100100object\_ratio410075
Overall Accuracy: 95\.2% \(Gemma\) / 74\.4% \(Regex\)Similar Articles
GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods
GridVQA-X introduces a diagnostic framework to evaluate cross-modal explainability by distinguishing genuine spatial-relational reasoning from cross-modal shortcuts in multimodal models.
OneGlanse
OneGlanse is a free open-source tool for tracking geographic visibility of large language models.
Applied Explainability for Large Language Models: A Comparative Study
A comparative study evaluating three explainability techniques (Integrated Gradients, Attention Rollout, SHAP) on fine-tuned DistilBERT for sentiment classification, highlighting trade-offs between gradient-based, attention-based, and model-agnostic approaches for LLM interpretability.
GLM-5.2 is a win for local AI
GLM-5.2, a 753B parameter open-source model with MIT license, offers frontier-level coding capabilities and massive context window. Its distillation potential promises significant improvements for local AI setups.
SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation
Introduces SGR, a stepwise reasoning framework that enhances LLM reasoning by generating query-specific subgraphs from external knowledge bases, improving accuracy and factual reliability.