When Rule Violations Are Rare: Chimera Training for Logical Anomaly Detection

arXiv cs.LG Papers

Summary

The paper introduces Chimera Training, a method for logical anomaly detection that uses counterfactual construction at the feature level to train neural rule evaluators without requiring real anomalous images, improving rule-level anomaly detection performance on benchmarks like CLEVRER, OpenImages, and VidOR.

arXiv:2605.26171v1 Announce Type: new Abstract: Many practical anomalies are not merely rare inputs, but violations of semantic constraints: objects co-occur in structured ways, actions imply preconditions, and events satisfy temporal or relational regularities. We study anomaly detection in this setting, where constraints are given as logical rules over learned visual concepts, but real rule violations are rare or absent during training. We propose a neural rule evaluator that compiles each constraint into a directed acyclic graph and learns feature-aware subtree MLP gates for its internal logical operators. Each gate maps child features and edge-level negations to a parent representation and a rule-satisfaction probability, with intermediate supervision obtained from exact Boolean propagation over ground-truth concept labels. The key difficulty is that same-image training data often provide insufficient coverage of informative truth configurations and also allow shortcut solutions. To address this, we introduce chimera training: an operand-level counterfactual construction at the feature level. Instead of mixing input images, we concatenate subtree features from different samples; each operand keeps the hard truth label of the sample it came from, and the chimera target is obtained by applying the node's logical operator to those inherited labels. This supplies supervised logical counterexamples without requiring real anomalous images. Across CLEVRER, OpenImages, and VidOR, the resulting evaluator improves rule-level anomaly AUROC over independent-events and same-image semantic-training baselines, especially for compositional and relational rules. The method yields both scalar anomaly scores and rule-level attributions.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:04 AM

# When Rule Violations Are Rare: Chimera Training for Logical Anomaly Detection
Source: [https://arxiv.org/html/2605.26171](https://arxiv.org/html/2605.26171)
1stAlejandro Ascárate a\.ascaratecastro@hdr\.qut\.edu\.auSchool of Electrical Engineering and Robotics, Faculty of Engineering, −Queensland\-\\;\\,\\,\\,\\text\{Queensland\}University of Technology, Brisbane, Queensland, Australia3rdRodrigo Santa Cruz∗ rodrigo\.santacruz@qut\.edu\.au4thClinton Fookes∗ c\.fookes@qut\.edu\.au5thOlivier Salvado∗ olivier\.salvado@qut\.edu\.au

###### Abstract

Many practical anomalies are not merely rare inputs, but violations of semantic constraints: objects co\-occur in structured ways, actions imply preconditions, and events satisfy temporal or relational regularities\. We study anomaly detection in this setting, where constraints are given as logical rules over learned visual concepts, but real rule violations are rare or absent during training\. We propose a neural rule evaluator that compiles each constraint into a directed acyclic graph and learns feature\-aware subtree MLP gates for its internal logical operators\. Each gate maps child features and edge\-level negations to a parent representation and a rule\-satisfaction probability, with intermediate supervision obtained from exact Boolean propagation over ground\-truth concept labels\. The key difficulty is that same\-image training data often provide insufficient coverage of informative truth configurations and also allow shortcut solutions\. To address this, we introduce chimera training: an operand\-level counterfactual construction at the feature level\. Instead of mixing input images, we concatenate subtree features from different samples; each operand keeps the hard truth label of the sample it came from, and the chimera target is obtained by applying the node’s logical operator to those inherited labels\. This supplies supervised logical counterexamples without requiring real anomalous images\. Across CLEVRER, OpenImages, and VidOR, the resulting evaluator improves rule\-level anomaly AUROC over independent\-events and same\-image semantic\-training baselines, especially for compositional and relational rules\. The method yields both scalar anomaly scores and rule\-level attributions\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.26171v1/imgs/fig1.jpg)Figure 1:Training and inference of our proposed method\. In the first stage a classifier is trained\. The head is discarded while the features of the backbone are used next\. Each node logical rule needs to be learned using a specific model \(stage 2\)\. The backbones of the node rules can be used to learn more complex expressions \(stage 3\)\. Once a logical expression has been learned \(e\.g\. a logical ‘and\\mathrm\{and\}’\), it can be frozen and assembled to form more complex rules \(alternate stage 3\)\. During inference \(stage 4\) the frozen ops and classifier backbone are applied to a new image to apply a rule\. The output can then be used for anomaly detection or logical inference from the image attributes\.Detecting anomalies from a dataset is often framed as statistical detection: detecting outliers, samples outside the data distribution \(out of distribution, OOD\)\[Chandolaet al\.,[2009](https://arxiv.org/html/2605.26171#bib.bib1), Hendrycks and Gimpel,[2017](https://arxiv.org/html/2605.26171#bib.bib2), Ruffet al\.,[2021](https://arxiv.org/html/2605.26171#bib.bib3)\]\. Instead, we are concerned with detecting anomalous samples on the basis of those samples breaking known rules usually satisfied by the initial distribution\. This Logical Anomaly Detection approach usually requires to detect and/or learn all the possible logical cases\. Our new approach in this paper, describes a method that does not require identifying all the logical cases and can thus detect if a rule is broken when the training dataset contains only samples that are consistent with the rule \(missing the anomalous cases\)\. This is a key distinguishing aspect of the anomaly detection problem\. In comparison, a fully supervised binary classification would require access to training samples of anomalies, which are by definition rare and usually not accessible in large quantities for training\.

Using the classic simple example of digits classification in MNIST, our method allows detecting all the labeled “7” that look like “1” \(see Fig\.[2](https://arxiv.org/html/2605.26171#S1.F2)\)\. When applied to natural images, one could identify all the atypical presentations in images labeled “\(wo\)man” as shown in Fig\.[3](https://arxiv.org/html/2605.26171#S1.F3)\(from the OpenImages dataset\)\. More complex ‘rules’ can be learned as long as they can be expressed as functions of attributes that can be estimated using an appropriate model\. We show examples ofcausalrules from videos of basic moving shapes such as “collide​\(shape=sphere,color=red\)⟹collide\_before\_half\_of\_video\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=red\}\)\\implies\\text\{collide\\\_before\\\_half\\\_of\\\_video\}” \(using the CLEVRER dataset\), and realistic rules on complex and changing scenes in videos such as “obj:baby⟹\(rel:baby\-in\_front\_of\-adult∧rel:adult\-watch\-baby\)\\text\{obj:baby\}\\implies\(\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-watch\-baby\}\)” \(using the VidOR dataset\)\.

![Refer to caption](https://arxiv.org/html/2605.26171v1/imgs/mnistchims2.jpg)Figure 2:Score\-sorted MNIST test images for the rule1∧71\\wedge 7, restricted to true digit\-77samples\. Images are ordered by increasing learned conjunction scoreP^​\(1∧7\)\\widehat\{P\}\(1\\wedge 7\); high\-scoring examples correspond to atypical77’s whose stroke geometry also activates evidence for digit11\.![Refer to caption](https://arxiv.org/html/2605.26171v1/imgs/mannotman.jpg)Figure 3:Qualitative visualization of results for the OpenImages contradiction ruleA⇔¬A\{A\\Leftrightarrow\\neg A\}\. In this experiment, the anomaly score is the model output for the contradiction rule itself \(not1−p1\-p\)\. Independent Events calculates it only on the basis of the initial classifier’s logits, whereas the chimera training version trains an MLP gate for the rule while additionally introducing synthetic contradictory examples at the feature level that cannot occur in real data\. For a fixed test class, we sort samples by anomaly score and show the 10 smallest on the left and the 10 largest on the right\. Although all shown images share the same dataset label, the high\-score samples are visually more distorted and less prototypical\. Chimera produces a visibly cleaner separation, tending to place more normal\-looking instances on the left and more abnormal\-looking ones on the right, which suggests that the synthetic contradictory supervision helps the evaluator detect within\-class visual abnormality more effectively\. Upper panel, rule is ‘man⇔¬man\{\\text\{man\}\\Leftrightarrow\\neg\\,\\text\{man\}\}’; lower panel, rule is ‘woman⇔¬woman\{\\text\{woman\}\\Leftrightarrow\\neg\\,\\text\{woman\}\}’\.Thus, many anomaly scenarios are better characterized not merely by “rarity” in pixel \(or even latent\) space, but by violations of*domain constraints*: objects co\-occur in structured ways; actions imply preconditions; and relationships satisfy logical regularities\. If such constraints are available \(hand\-written, mined, or curated\), they can serve as a semantically meaningful interface for detection: an input is anomalous if it contradicts one or more constraints\.

However, integrating constraint evaluation with high\-dimensional perception is nontrivial\. Fully symbolic pipelines require brittle perception outputs; fully neural pipelines often re\-learn constraints implicitly and entangle them with spurious cues\.

This paper proposes a neuro\-symbolic anomaly detection framework that treats constraints as*explicit computation graphs*\(e\.g\., binarytrees\) and learns reusable*neural operators*\(which we call ‘gates’\) that implement logical composition\. The method will be referred to as ‘Neural Evaluator’ in the rest of the paper \(see Fig\.[1](https://arxiv.org/html/2605.26171#S1.F1)\)\.

A central challenge is preventing gates from collapsing into shortcut classifiers for the entire rule \(e\.g\., recognizing an anomaly template directly from image features\)\. We address this via*‘chimera’111[Wikipedia\-Chimera\_\(mythology\)](https://en.wikipedia.org/wiki/Chimera_(mythology))\.negative training*\(see Fig\.[1](https://arxiv.org/html/2605.26171#S1.F1), also Sec\.[3](https://arxiv.org/html/2605.26171#S3)\)\. This construction is conceptually related to mixup\-style interventions\[Zhanget al\.,[2018](https://arxiv.org/html/2605.26171#bib.bib6), Yunet al\.,[2019](https://arxiv.org/html/2605.26171#bib.bib7)\], but operates at the level of*subtree operands*rather than raw pixels or labels\. Empirically, this encourages gates to behave as compositional operators and improves transfer of learned subtrees across constraints\. Furthermore, for many rules, informative counterexamples are exceedingly rare under the natural data distribution \(e\.g\., implication violations require antecedent true and consequent false\)\. Consequently, training a global predictor on real observed samples only yields degenerate solutions \(see table[2](https://arxiv.org/html/2605.26171#S5.T2), ‘SEM’\)\.

##### Contributions\.

- •We propose*subtree gates*, a bottom\-up node\-local\-learned evaluator that composes concept\-conditioned features into truth probabilities under hard Boolean supervision\.
- •We introduce*chimera negative training*to enforce operator\-level compositionality, reduce shortcut learning in rule evaluators, and to operate in cases where real counterfactuals to the rules are completely absent in the training data \(most real applications in anomaly detection\)\.
- •We demonstrate effectiveness on structured image/video benchmarks and real\-world data \(CLEVRER, OpenImages, VidOR\)\.

## 2Related Work

##### Deep anomaly and OOD detection

A large body of work scores anomalies using density or reconstruction surrogates \(e\.g\., autoencoders/VAEs\), feature\-distance criteria, or uncertainty estimates\[Chandolaet al\.,[2009](https://arxiv.org/html/2605.26171#bib.bib1), Ruffet al\.,[2021](https://arxiv.org/html/2605.26171#bib.bib3)\]\. In modern deep OOD detection, common baselines include softmax confidence\[Hendrycks and Gimpel,[2017](https://arxiv.org/html/2605.26171#bib.bib2)\], input perturbation and temperature scaling\[Lianget al\.,[2018](https://arxiv.org/html/2605.26171#bib.bib9), Guoet al\.,[2017](https://arxiv.org/html/2605.26171#bib.bib4)\], and feature\-space detectors such as Mahalanobis distances\[Leeet al\.,[2018](https://arxiv.org/html/2605.26171#bib.bib10)\]\. Energy\-based scoring has also emerged as a unifying view for some classification models\[Liuet al\.,[2020](https://arxiv.org/html/2605.26171#bib.bib11)\]\. These approaches typically provide a scalar score with limited semantic attribution: they rarely explain*which*structured expectation is violated\.

##### Semantic and constraint\-aware anomaly detection

A complementary line leverages structure, constraints, or knowledge to detect implausible samples\. Constraint\-based and logic\-guided learning often imposes penalties for rule violations or encourages outputs to satisfy known relations\[Huet al\.,[2016](https://arxiv.org/html/2605.26171#bib.bib12), Xuet al\.,[2018](https://arxiv.org/html/2605.26171#bib.bib13)\]\. Related ideas appear in weakly\-supervised and knowledge\-driven settings, where symbolic constraints regularize predictors without requiring full labels\[Ganchevet al\.,[2010](https://arxiv.org/html/2605.26171#bib.bib14)\]\. While effective, many methods treat constraints as a global regularizer and do not explicitly construct reusable*modules*implementing logical composition that can be transferred across many rules\.

##### Neuro\-symbolic reasoning and differentiable logic

Neuro\-symbolic methods aim to combine sub\-symbolic perception with symbolic reasoning, including differentiable logical frameworks and probabilistic logic programming\[d’Avila Garcezet al\.,[2009](https://arxiv.org/html/2605.26171#bib.bib15), Manhaeveet al\.,[2018](https://arxiv.org/html/2605.26171#bib.bib16), Donadelloet al\.,[2017](https://arxiv.org/html/2605.26171#bib.bib17)\]\. In vision\-and\-language and synthetic reasoning benchmarks \(e\.g\., CLEVR\), neural module networks and related compositional models assemble learned operators according to an explicit program or graph structure\[Andreaset al\.,[2016](https://arxiv.org/html/2605.26171#bib.bib18), Johnsonet al\.,[2017b](https://arxiv.org/html/2605.26171#bib.bib19)\]\. These works motivate our design choice of compiling constraints into explicit computation graphs\. However, much of this literature targets question answering or program execution rather than anomaly detection; and many methods learn operators end\-to\-end without a mechanism to \(i\) supervise intermediate truth semantics from ground\-truth concepts, \(ii\) prevent shortcut learning at internal nodes, \(iii\) deal with highly unbalanced training data for a standard supervised fit, and, \(iv\) reuse learned subtrees safely across rule sets and runs\.

##### Concept bottlenecks and interpretable interfaces

Concept bottleneck models \(CBMs\) and related “predict\-then\-reason” pipelines provide an interpretable intermediate representation through human\-aligned concepts\[Kohet al\.,[2020](https://arxiv.org/html/2605.26171#bib.bib5)\]\. Our leaf concept bank is similar in spirit: it exposes a semantically meaningful interface for downstream reasoning\. Unlike standard CBM pipelines that apply a fixed symbolic reasoner \(or a shallow classifier\) atop concept predictions, we learn a*structured evaluator*that maps concept\-conditioned features through a constraint DAG, producing per\-rule satisfaction probabilities and anomaly attributions\.

##### Compositional regularization and counterfactual mixing

Data mixing strategies such as mixup and CutMix improve robustness by constructing interpolated or patched examples\[Zhanget al\.,[2018](https://arxiv.org/html/2605.26171#bib.bib6), Yunet al\.,[2019](https://arxiv.org/html/2605.26171#bib.bib7)\]\. Our*chimera negative training*is related in the sense of creating counterfactual combinations, but differs in locus and supervision: we mix*subtree operands*\(child features\) rather than raw inputs, and supervise targets using exact Boolean semantics computed from the corresponding hard child truths\. This pushes internal gates to implement the intended connective rather than overfitting to global visual templates of a particular rule\.

##### Neural Algebra of Classifiers \(NAC\) and similar methods, and relation to our evaluator\.

Neural Algebra of Classifiers \(NAC;Santa Cruzet al\.\[[2018](https://arxiv.org/html/2605.26171#bib.bib28)\]\) \(other similar methods are\[Misraet al\.,[2017](https://arxiv.org/html/2605.26171#bib.bib31), Nagarajan and Grauman,[2018](https://arxiv.org/html/2605.26171#bib.bib32), Yanget al\.,[2020](https://arxiv.org/html/2605.26171#bib.bib33), Liet al\.,[2021](https://arxiv.org/html/2605.26171#bib.bib34)\]\) is the closest conceptual precedent to our work in that it learns neural modules intended to implement Boolean connectives and composes them along an expression tree\. However, NAC composes classifier parameters \(e\.g\., weight vectors for primitive concept classifiers\) to synthesize a new classifier for a composed expression, and is trained primarily with expression\-level supervision \(labels for the whole composed concept\)\.

## 3Method

### 3\.1Problem set up, training, and inference

##### Problem set up

We consider a multi\-label dataset

𝒟=\{\(xi,yi\)\}i=1M,yi∈\{0,1\}N,\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{M\},\\qquad y\_\{i\}\\in\\\{0,1\\\}^\{N\},\(1\)whereyi,c=1y\_\{i,c\}=1indicates that conceptccis present inxix\_\{i\}\. We are given rules\{ℛr\}r=1R\\\{\\mathcal\{R\}\_\{r\}\\\}\_\{r=1\}^\{R\}, each compiled into a directed acyclic graphGr=\(Vr,Er\)G\_\{r\}=\(V\_\{r\},E\_\{r\}\)whose leaves are concept IDs and whose internal nodes are logical operators in\{IFF,IMPLIES,AND,OR\}\\\{\\mathrm\{IFF\},\\mathrm\{IMPLIES\},\\mathrm\{AND\},\\mathrm\{OR\}\\\}\. Edges may carry negation flags\. The task is to output an anomaly scores​\(x\)∈\[0,1\]s\(x\)\\in\[0,1\]together with per\-rule violation scores\{sr​\(x\)\}\\\{s\_\{r\}\(x\)\\\}\.

##### Training

A leaf concept bank first produces

z=Eϕ​\(x\)∈ℝF,ℓ​\(x\)∈ℝN,p​\(x\)=σ​\(ℓ​\(x\)\)∈\(0,1\)N\.z=E\_\{\\phi\}\(x\)\\in\\mathbb\{R\}^\{F\},\\qquad\\ell\(x\)\\in\\mathbb\{R\}^\{N\},\\qquad p\(x\)=\\sigma\(\\ell\(x\)\)\\in\(0,1\)^\{N\}\.\(2\)The encoder featurezzis used as the evidence carrier for neural rule evaluation, whilep​\(x\)p\(x\)is used for scoring\.

Each rule graph stores node attributes identifying leaves, concept IDs, and operator codes, and edge attributes encoding negation and, for implications, operand order\. Given hard concept labelsyy, exact Boolean semantics are propagated bottom\-up through the graph to obtain node\-level targetstv​\(y\)∈\{0,1\}t\_\{v\}\(y\)\\in\\\{0,1\\\}for every internal node, not only the root\.

For each internal nodevvwith childrenc1,…,cavc\_\{1\},\\dots,c\_\{a\_\{v\}\}, we assign a learned subtree gategθvg\_\{\\theta\_\{v\}\}\. With child featureshcj∈ℝFh\_\{c\_\{j\}\}\\in\\mathbb\{R\}^\{F\}and negation indicatorsbj∈\{0,1\}b\_\{j\}\\in\\\{0,1\\\}, define

uv\\displaystyle u\_\{v\}=\[hc1​‖b1‖​⋯​‖hcav‖​bav\],\\displaystyle=\[h\_\{c\_\{1\}\}\\\|b\_\{1\}\\\|\\cdots\\\|h\_\{c\_\{a\_\{v\}\}\}\\\|b\_\{a\_\{v\}\}\],\(3\)hv\\displaystyle h\_\{v\}=fθv​\(uv\),\\displaystyle=f\_\{\\theta\_\{v\}\}\(u\_\{v\}\),\(4\)t^v\\displaystyle\\hat\{t\}\_\{v\}=σ​\(wv⊤​hv\+βv\)\.\\displaystyle=\\sigma\(w\_\{v\}^\{\\top\}h\_\{v\}\+\\beta\_\{v\}\)\.\(5\)Leaves are initialized withhv←z=Eϕ​\(x\)h\_\{v\}\\leftarrow z=E\_\{\\phi\}\(x\)\. Thus, rule structure determines which concept role each copy ofzzplays, while gates learn concept\- and operator\-specific composition in feature space\.

Training is performed bottom\-up by depth\. For depth level𝒱d=\{v:depth​\(v\)=d\}\\mathcal\{V\}\_\{d\}=\\\{v:\\mathrm\{depth\}\(v\)=d\\\}, lower\-depth gates are frozen and used to produce child features\. Each gate at levelddis then trained with node\-wise binary cross\-entropy,

ℒv=1B​∑i=1BBCE​\(t^v​\(xi\),tv​\(yi\)\)\.\\mathcal\{L\}\_\{v\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathrm\{BCE\}\\big\(\\hat\{t\}\_\{v\}\(x\_\{i\}\),t\_\{v\}\(y\_\{i\}\)\\big\)\.\(6\)This internal supervision forces local logical composition rather than a monolithic root\-level rule classifier\.

To prevent shortcut learning, we use chimera training\. For a binary nodevvwith children\(ℓ,r\)\(\\ell,r\), choose a permutationπ\\piwithπ​\(i\)≠i\\pi\(i\)\\neq iand form mixed operands

\(uvchim\)i=\[hℓ​\(xi\)​‖bℓ‖​hr​\(xπ​\(i\)\)∥br\]\.\(u\_\{v\}^\{\\mathrm\{chim\}\}\)\_\{i\}=\[h\_\{\\ell\}\(x\_\{i\}\)\\\|b\_\{\\ell\}\\\|h\_\{r\}\(x\_\{\\pi\(i\)\}\)\\\|b\_\{r\}\]\.\(7\)The target is computed by the intended Boolean operator:

\(tvchim\)i=op​\(v\)​\(tℓ​\(yi\),tr​\(yπ​\(i\)\)\)\.\(t\_\{v\}^\{\\mathrm\{chim\}\}\)\_\{i\}=\\mathrm\{op\}\(v\)\\big\(t\_\{\\ell\}\(y\_\{i\}\),t\_\{r\}\(y\_\{\\pi\(i\)\}\)\\big\)\.\(8\)Thus, counterfactual mixing occurs at the operand level, giving informative truth assignments even when some rule outcomes are rare or absent in the observed data\.

Finally, to scale across many rules, trained gates are reused through lineage\-aware caching\. A subtree key records the symbolic subtree, edge negations, operator order, gate architecture, feature dimension, and a fingerprint of the upstream encoder\. Hence a cached gate is reused only when both the logical structure and the feature representation match\.

##### Inference and anomaly scoring

For each rulerr, we evaluate the root satisfaction probabilityt^root\(r\)​\(x\)∈\(0,1\)\\hat\{t\}^\{\(r\)\}\_\{\\mathrm\{root\}\}\(x\)\\in\(0,1\)by bottom\-up propagation through the trained/cached gates onGrG\_\{r\}\(see Algorithm[4](https://arxiv.org/html/2605.26171#alg4)\), where again we usedgl\.topological\_nodes\_generatorto great advantage\.

##### Per\-rule violation\.

We define a basic violation score \(see Algorithm[5](https://arxiv.org/html/2605.26171#alg5)\)

vr​\(x\)=1−t^root\(r\)​\(x\)\.v\_\{r\}\(x\)=1\-\\hat\{t\}^\{\(r\)\}\_\{\\mathrm\{root\}\}\(x\)\.\(9\)
Because the scoring is rule\-decomposable, the detector naturally provides semantic attributions: the top\-kkrules with the largestsr​\(x\)s\_\{r\}\(x\)explain the anomaly\.

### 3\.2Why implications are particularly suited for anomaly detection\.

Among Boolean connectives, the implicationA⇒B\\;A\\Rightarrow B\\;is especially well\-matched to anomaly detection because it’snon\-symmetricand its truth table has a*single*andnon\-trivialfalsifying configuration: it is false only when the antecedent holds but the consequent does not, i\.e\.\(A=1,B=0\)\(A=1,\\,B=0\)\. Consequently, each implication directly defines a sharp notion of “violation” \(our anomaly signal\) that is both sparse and interpretable: an anomaly corresponds precisely to the presence of the contextual preconditionAAtogether with the absence of the expected outcomeBB\. Equivalently, the implication acts as a*context\-gated*detector: when the antecedent is not present \(A=0A=0\), the rule is vacuously satisfied and produces no alarm \(it ignores the sample outside its intended regime of applicability\), whereas when the antecedent is present \(A=1A=1\), the rule activates and flags an anomaly exactly when the consequent is absent \(B=0B=0\) in that specific context\. This “activate\-on\-context, trigger\-on\-violation” behavior is precisely what we want in realistic perceptual settings, where we aim to avoid spurious alarms on irrelevant inputs while reliably detecting context\-specific inconsistencies\.

## 4Experiments

### 4\.1Datasets and concept vocabularies

We evaluate on three vision benchmarks where \(i\) a multi\-label concept inventory is available \(or can be induced\), and \(ii\) logical constraints over these concepts are meaningful\.

CLEVR \(images\)\.We use CLEVR images and their ground\-truth scene annotationsJohnsonet al\.\[[2017a](https://arxiv.org/html/2605.26171#bib.bib20)\]\.CLEVRER \(videos\)\.We use CLEVRER and its structured annotationsYiet al\.\[[2020](https://arxiv.org/html/2605.26171#bib.bib21)\]\.Open Images\.We use Open Images V4 annotationsKuznetsovaet al\.\[[2018](https://arxiv.org/html/2605.26171#bib.bib22)\]\.VidOR\.We evaluate on the VidOR \(Video Object Relation\) datasetShanget al\.\[[2019](https://arxiv.org/html/2605.26171#bib.bib29)\]\.

### 4\.2Evaluation tasks and metrics

##### Consistency anomaly detection\.

We define a binary anomaly label from ground\-truth concepts and rules:

yanom​\(x\)=𝕀​\[minr≤R⁡Truthr​\(x\)=0\],y\_\{\\text\{anom\}\}\(x\)=\\mathbb\{I\}\\Big\[\\min\_\{r\\leq R\}\\;\\text\{Truth\}\_\{r\}\(x\)=0\\Big\],i\.e\., an example is anomalous iff it violates*at least one*constraint under hard boolean evaluation\. We report AUROC and \(when useful\) FPR@95TPR for anomaly detection\.

##### Concept prediction quality\.

We report macro AUROC and macro AUPRC over theKKconcept heads on the held\-out split, since rule performance depends on leaf quality\.

### 4\.3Baselines and Comparisons

We consider as baselines only*rule\-aware*detectors \(use the same concept vocabulary and rule set as our method\), since*perception\-only*anomaly detectors ignore rules\. Unless otherwise stated, all baselines use the same backbone/encoder as the leaf concept bank for a controlled comparison\.

#### 4\.3\.1Independent\-events probabilistic evaluator \(IndepProb\) \(Rule\-aware symbolic baseline\)\.

LetA,BA,Bdenote leaf events \(concepts\) with predicted probabilitiespA,pBp\_\{A\},p\_\{B\}from the leaf bank\. Letφ\\varphibe a rule formula built from leaves using¬,∧,∨,⇒,⇔\\neg,\\wedge,\\vee,\\Rightarrow,\\Leftrightarrow\. Assume leaf events areindependent*given*xx, then one computes a soft satisfaction probabilityP​\(φ\)P\(\\varphi\)by recursion using the independent events prescriptions in App\.[D](https://arxiv.org/html/2605.26171#A4)\. For a general formulaφ\\varphi, evaluate bottom\-up on the compiled rule DAG using these local identities \(and applying edge\-level negation byp↦1−pp\\mapsto 1\-p\)\. The anomaly score is thensr​\(x\)=1−P​\(φr∣x\)s\_\{r\}\(x\)=1\-P\(\\varphi\_\{r\}\\mid x\)\.

#### 4\.3\.2Semantic\-loss \(SEM\) training baseline \(semantic consistency from ordinary data\)\.

The core idea of SEM is to train with semantic structure as supervision\[Xuet al\.,[2018](https://arxiv.org/html/2605.26171#bib.bib13)\], rather than relying only on independent per\-concept losses or on explicit real or synthetic anomalies\. In our setting, this means encouraging the evaluator to learn rule\-consistent compositions from ordinary samples: the model is exposed only to concept tuples that arise naturally in the data, and it must infer semantic compatibility or incompatibility from those same\-image assignments\. Thus, SEM captures the idea that logical or semantic structure can itself serve as a training signal, without requiring the explicit counterfactual/chimeric constructions that are central to our method\.

##### Our implementation of SEM\.

We instantiate SEM in the variant that, in our view, provides the cleanest and fairest comparison with our full model while preserving the core semantic\-training idea above\. Concretely, SEM uses the same overall compositional evaluator pipeline as our method, but removes chimera\-based supervision: all node concepts in a formula are instantiated from features extracted from the same normal image, and no real or synthesized anomalous or counterfactual compositions are introduced during training \(since the application here is OOD, real anomalous samples are removed from the training\)\. This keeps the rule compilation, evaluator structure, feature extractor, and test\-time scoring protocol aligned with our full system, so that the main difference lies in the training signal itself\. Consequently, the gap between SEM and the full model should be interpreted as quantifying the benefit of chimera\-based compositional supervision beyond what can be learned from same\-image semantic consistency alone\.

#### 4\.3\.3Monolithic root\-only baseline\.

To isolate the contribution of our level\-wise, node\-local modular training scheme, we also consider a monolithic root\-only baseline\. In this variant, we remove all internal subtree gates and bottom\-up supervision, and instead place a single MLP on top of the frozen leaf\-bank outputs to predict the root truth value directly for each rule\. The baseline therefore retains the same leaf encoder and rule\-level target, but discards the explicit decomposition into local operators\. We evaluate two versions: one trained only on normal same\-image samples, and one additionally trained with chimera\-based counterfactual compositions\. The latter is the more informative ablation, since it tests whether chimera supervision alone is sufficient, or whether the full benefit of our method also depends on the level\-wise and node\-local modular structure\.

## 5Results

![Refer to caption](https://arxiv.org/html/2605.26171v1/imgs/fig_all_datasets_neural_vs_indep_andbars.jpg)Figure 4:Comparison across datasets \(left\), and by dataset and rule family \(right\)\.DatasetMacro ROC\-AUCMacro APMacro Acc\.CLEVRER0\.9210\.6820\.867OpenImages0\.9480\.5960\.904VidOR0\.6790\.0570\.976Table 1:Leaf\-bank performance on the evaluation split\. For VidOR, macro AP is the more informative metric because many relation concepts are extremely sparse, so accuracy is inflated by class imbalance\.DatasetRule family\#RulesIndep\.SEMMono\-NMono\-CNeur\. Eval\.WinsCLEVRERSimple implications160\.7770\.5000\.5000\.8320\.82613/16Chain & Compound100\.7070\.5000\.5000\.8270\.84410/10All rules260\.7500\.5000\.5000\.8290\.83323/26Std\*\-\-0\.0110\.0010\.0010\.0100\.012\-OpenImagesSimple implications110\.8210\.5000\.5000\.9060\.91010/11Compound rules50\.8200\.5000\.5000\.8800\.8744/5All rules160\.8210\.5000\.5000\.8930\.89914/16Std\*\-\-0\.0130\.0010\.0010\.0120\.011\-VidORSimple implications250\.4760\.5000\.5000\.7020\.72223/25Compound rules100\.5270\.5000\.5000\.7150\.73110/10All rules350\.4900\.5000\.5000\.7080\.72433/35Std\*\-\-0\.0130\.0010\.0010\.0120\.013\-Table 2:Rule\-level anomaly AUROC \(\*in ‘All rules’ on 3 runs\)\. Indep\. denotes the independent\-events probabilistic evaluator; SEM is the same\-image semantic\-training ablation; Mono\-N and Mono\-C denote the monolithic root\-only baseline trained, respectively, on normal samples only and with chimera\-based counterfactual supervision\. “Wins” counts how many rules are improved by the full neural evaluator relative to the independent\-events baseline\. For OpenImages, one degenerate rule with undefined AUROC was excluded from the aggregate\.DatasetRuleIndep\.Neur\. Eval\.CLEVRERChain:collide\(brown, sphere\)→before\_half→entered\_then\_collided\\texttt\{collide\(brown, sphere\)\}\\\!\\to\\\!\\texttt\{before\\\_half\}\\\!\\to\\\!\\texttt\{entered\\\_then\\\_collided\}0\.5610\.895CLEVREROR type:collide\(cube,cyan\)→\(before\_half∨entered\_then\_collided\)\\texttt\{collide\(cube,cyan\)\}\\to\(\\texttt\{before\\\_half\}\\vee\\texttt\{entered\\\_then\\\_collided\}\)0\.6790\.868OpenImagesTableware→Bottle\\texttt\{Tableware\}\\to\\texttt\{Bottle\}0\.8910\.954OpenImagesLand vehicle→\(Bicycle∧Car\)\\texttt\{Land\\ vehicle\}\\to\(\\texttt\{Bicycle\}\\wedge\\texttt\{Car\}\)0\.7650\.989VidORadult​\-​in\_front\_of​\-​baby→baby​\-​in\_front\_of​\-​adult\\texttt\{adult\\\!\-\\\!in\\\_front\\\_of\\\!\-\\\!baby\}\\to\\texttt\{baby\\\!\-\\\!in\\\_front\\\_of\\\!\-\\\!adult\}0\.4670\.766VidORobj:toy→\(child​\-​next\_to​\-​toy∧toy​\-​in\_front\_of​\-​child\)\\texttt\{obj:toy\}\\to\(\\texttt\{child\\\!\-\\\!next\\\_to\\\!\-\\\!toy\}\\wedge\\texttt\{toy\\\!\-\\\!in\\\_front\\\_of\\\!\-\\\!child\}\)0\.5260\.829Table 3:Representative rule\-level improvements\. Full per\-rule results are deferred to appendix[C](https://arxiv.org/html/2605.26171#A3)\.Table[1](https://arxiv.org/html/2605.26171#S5.T1)first shows that the leaf concept bank is already reasonably strong on CLEVRER and OpenImages, whereas VidOR is substantially more challenging, especially for sparse relation concepts\. Despite this, the proposed neural evaluator consistently improves rule\-level anomaly AUROC over the independent\-events baseline \(Table[2](https://arxiv.org/html/2605.26171#S5.T2)\)\. On the currently exported rules, the gain is \+0\.083 on CLEVRER \(0\.833 vs\. 0\.750, 23/26 wins\), \+0\.078 on OpenImages \(0\.899 vs\. 0\.821, 14/16 wins after excluding one degenerate rule\), and \+0\.234 on VidOR \(0\.724 vs\. 0\.490, 33/35 wins\)\. The gains are especially pronounced on compositional rule families, including chain and mined compound rules on CLEVRER \(\+0\.137\) and both pairwise and compound mined rules on VidOR \(\+0\.246 and \+0\.204, respectively\)\. This is consistent with our hypothesis that feature\-aware local composition captures dependencies that fixed probabilistic evaluation misses\.

The SEM ablation collapses in the actual complex anomaly\-detection benchmarks \(this is quite less trivial than it may look like at first sight, cf\. with the simpler MNIST scenario of Fig\.[5](https://arxiv.org/html/2605.26171#A6.F5), where itdoesactually retain some signal, likely thanks to the intrinsic uncertainty from the data\)\. In this variant, we keep the same evaluator pipeline but train only on same\-image normal samples and remove all chimera\-based counterfactual compositions\. Empirically, this drives the model to the trivial all\-normal solution: it assigns near\-zero anomaly score to essentially every sample, yielding AUROC≈0\.5\\approx 0\.5across rules on the more complex datasets\. This is consistent with our motivating hypothesis that, under realistic concept sparsity, ordinary normal data do not provide sufficient coverage of informative violating truth configurations; explicit chimera\-based supervision is needed to learn a nontrivial anomaly signal\.

##### Ablation: approximate independence vs\. event entanglement\.

To probe whether the main failure mode of the fixed symbolic baseline is specifically its leaf\-event independence assumption, we compare two closely related regimes\. Instatic CLEVR, the leaves are mostly unary attribute\-existence predicates \(e\.g\.,blue\_sphere,green\_cube,metal\_any\), for which approximate factorization is often a reasonable first\-order model\. Inevent\-rich CLEVRER, by contrast, we retain the same basic shape/color\-style vocabulary but add temporally and relationally structured event predicates such ascollide\(A,B\),collide\_before\_half\(A,B\), andentered\_then\_collided\(A,B\)\. These predicates are strongly entangled by construction: collisions couple object pairs, temporal predicates depend on collision times, and several events co\-occur non\-independently\.

This distinction is reflected clearly in the behavior of the independent\-events evaluator\. On static CLEVR, IndepProb is already essentially perfect on the tested rules \(mean AUROC=0\.9995=0\.9995; all rules in the range0\.9990\.999–1\.0001\.000\)\. In contrast, on the currently exported event\-rich CLEVRER rules, the same evaluator drops to mean AUROC=0\.750=0\.750, whereas the learned neural evaluator reaches0\.8330\.833on the same rule set\. Thus, the degradation is not evidence that explicit symbolic structure is itself inadequate; rather, it strongly supports the view that the main weakness of the fixed symbolic baseline is*misspecification of leaf\-event independence*\. Once collision\-driven event structure is introduced, the factorized evaluator becomes structurally mismatched to the data, while the feature\-aware local evaluator remains effective\.

SetupRule regimeIndepProbNeur\. Eval\.CLEVRstatic attribute rules0\.9995—CLEVRERevent\-rich rules0\.7500\.833Table 4:Ablation contrasting a regime where approximate leaf independence is plausible \(static CLEVR\) with one where collision\- and time\-derived event predicates induce strong dependence \(event\-rich CLEVRER\)\. On static CLEVR, the independent\-events evaluator is already essentially perfect; on CLEVRER, its performance degrades substantially, while the learned evaluator remains stronger\.
##### Ablation of chimera training\.

The SEM baseline is not only a baseline against competing approaches, but also an internal ablation of our own training scheme\. In this variant, we keep the same overall compositional evaluator pipeline, but remove chimera\-based supervision: all node concepts are instantiated using features drawn from the same image, and training uses only normal images, without the synthesized anomalous/counterfactual compositions introduced by chimera sampling\. Thus, the comparison between the full model and SEM isolates the contribution of chimera training in a broad sense\. The resulting performance gap indicates that the gain of the full method does not come merely from the architecture itself, but from the additional compositional supervision provided by chimera examples, which teaches the evaluator to detect semantically inconsistent concept combinations that are rarely or never observed in same\-image normal training data alone\.

##### Ablation of node\-local, level\-wise training\.

The chimera\-trained monolithic baseline performs similarly to the full neural evaluator\. This suggests that, in the present regime, the dominant source of improvement is the chimera\-based counterfactual supervision rather than the level\-wise modular training scheme itself\. In other words, once the model is exposed to informative synthetic rule violations, even a single root\-level predictor can recover most of the anomaly signal\. The result does not make the modular formulation unimportant—it still provides a more interpretable and reusable decomposition of rule evaluation,crucially without significant performance loss w\.r\.t\. the monolithic case—but it indicates that the principal performance bottleneck here was the lack of informative violating configurations in the training data\.

## 6Discussion and Conclusion

We presented a method to learn and apply logical rules based on attributes that can be learned from the data\. The key novelty is that complex rules can be learned without requiring all the logical combinations of attributes to be present in the data\. We showed practical applications for anomaly detection in images and complex causality\-constrained ones from video\.

The approach relies on two assumptions that may be violated in some regimes:

- •Concept interface adequacy\.If the concept vocabulary is too weak \(missing predicates needed to express true constraints\), then rules become either vacuous or systematically violated, and anomaly detection degenerates\.
- •Rule quality\.Mined rules can encode spurious correlations, especially in biased datasets\. This can inflate anomaly scores for legitimate but underrepresented cases\. Antecedent gating helps for implications, but it is not a complete fix for rule bias\.

Our central premise is that many practical “anomalies” are not merely rare inputs, but*constraint violations*\. By compiling constraints into explicit DAGs and producing per\-rule satisfaction probabilities via our chimera\-trained evaluator, the detector returns not only a scalar score but also a structured explanation: which rules \(and even which internal subclauses\) are most violated\. This is qualitatively different from perception\-only OOD scores that are often hard to interpret, and it is difficult to obtain with monolithic end\-to\-end models without imposing explicit structure\.

## References

- J\. Andreas, M\. Rohrbach, T\. Darrell, and D\. Klein \(2016\)Neural module networks\.InIEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px3.p1.1)\.
- V\. Chandola, A\. Banerjee, and V\. Kumar \(2009\)Anomaly detection: a survey\.ACM Computing Surveys41\(3\),pp\. 1–58\.Cited by:[§1](https://arxiv.org/html/2605.26171#S1.p1.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px1.p1.1)\.
- A\. S\. d’Avila Garcez, L\. C\. Lamb, and D\. M\. Gabbay \(2009\)Neural\-symbolic cognitive reasoning\.Springer\.Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px3.p1.1)\.
- I\. Donadello, L\. Serafini, and A\. S\. d’Avila Garcez \(2017\)Logic tensor networks for semantic image interpretation\.InInternational Joint Conference on Artificial Intelligence \(IJCAI\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Ganchev, J\. Graça, J\. Gillenwater, and B\. Taskar \(2010\)Posterior regularization for structured latent variable models\.Journal of Machine Learning Research11\(67\),pp\. 2001–2049\.External Links:[Link](http://jmlr.org/papers/v11/ganchev10a.html)Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks and K\. Gimpel \(2017\)A baseline for detecting misclassified and out\-of\-distribution examples in neural networks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.26171#S1.p1.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Hu, X\. Ma, Z\. Liu, E\. Hovy, and E\. P\. Xing \(2016\)Harnessing deep neural networks with logic rules\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Johnson, B\. Hariharan, L\. van der Maaten, L\. Fei\-Fei, C\. L\. Zitnick, and R\. Girshick \(2017a\)CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[Appendix B](https://arxiv.org/html/2605.26171#A2.SS0.SSS0.Px1.p1.3),[§4\.1](https://arxiv.org/html/2605.26171#S4.SS1.p2.1)\.
- J\. Johnson, B\. Hariharan, L\. van der Maaten, L\. Fei\-Fei, C\. L\. Zitnick, and R\. Girshick \(2017b\)Inferring and executing programs for visual reasoning\.InIEEE International Conference on Computer Vision \(ICCV\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px3.p1.1)\.
- P\. W\. Koh, T\. Nguyen, Y\. S\. Tang, S\. Mussmann, E\. Pierson, B\. Kim, and P\. Liang \(2020\)Concept bottleneck models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Kuznetsova, H\. Rom, N\. Alldrin, J\. Uijlings, I\. Krasin, J\. Pont\-Tuset, S\. Kamali, S\. Popov, M\. Malloci, T\. Duerig, and V\. Ferrari \(2018\)The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale\.arXiv preprint arXiv:1811\.00982\.Cited by:[Appendix B](https://arxiv.org/html/2605.26171#A2.SS0.SSS0.Px3.p1.2),[§4\.1](https://arxiv.org/html/2605.26171#S4.SS1.p2.1)\.
- K\. Lee, K\. Lee, H\. Lee, and J\. Shin \(2018\)A simple unified framework for detecting out\-of\-distribution samples and adversarial attacks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Li, M\. Mozer, and J\. Whitehill \(2021\)Compositional embeddings for multi\-label one\-shot learning\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\),pp\. 296–304\.Cited by:[Appendix E](https://arxiv.org/html/2605.26171#A5.p1.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px6.p1.1)\.
- S\. Liang, Y\. Li, and R\. Srikant \(2018\)Enhancing the reliability of out\-of\-distribution image detection in neural networks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Liu, X\. Wang, J\. D\. Owens, and Y\. Li \(2020\)Energy\-based out\-of\-distribution detection\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Manhaeve, S\. Dumančić, A\. Kimmig, T\. Demeester, and L\. De Raedt \(2018\)DeepProbLog: neural probabilistic logic programming\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px3.p1.1)\.
- I\. Misra, A\. Gupta, and M\. Hebert \(2017\)From red wine to red tomato: composition with context\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 1792–1801\.Cited by:[Appendix E](https://arxiv.org/html/2605.26171#A5.p1.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px6.p1.1)\.
- T\. Nagarajan and K\. Grauman \(2018\)Attributes as operators: factorizing unseen attribute\-object compositions\.InProceedings of the European Conference on Computer Vision \(ECCV\),pp\. 169–185\.Cited by:[Appendix E](https://arxiv.org/html/2605.26171#A5.p1.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px6.p1.1)\.
- L\. Ruff, J\. R\. Kauffmann, R\. A\. Vandermeulen, G\. Montavon, W\. Samek, M\. Kloft, T\. G\. Dietterich, and K\. Müller \(2021\)A unifying review of deep and shallow anomaly detection\.Proceedings of the IEEE109\(5\),pp\. 756–795\.Cited by:[§1](https://arxiv.org/html/2605.26171#S1.p1.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Santa Cruz, B\. Fernando, A\. Cherian, and S\. Gould \(2018\)Neural algebra of classifiers\.In2018 IEEE Winter Conference on Applications of Computer Vision \(WACV\),pp\. 729–737\.External Links:[Document](https://dx.doi.org/10.1109/WACV.2018.00085),ISBN 978\-1\-5386\-4886\-5Cited by:[Appendix E](https://arxiv.org/html/2605.26171#A5.p1.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px6.p1.1)\.
- X\. Shang, D\. Di, J\. Xiao, Y\. Cao, X\. Yang, and T\. Chua \(2019\)Annotating objects and relations in user\-generated videos\.InProceedings of the 2019 on International Conference on Multimedia Retrieval,pp\. 279–287\.Cited by:[Appendix B](https://arxiv.org/html/2605.26171#A2.SS0.SSS0.Px4.p1.23),[§4\.1](https://arxiv.org/html/2605.26171#S4.SS1.p2.1)\.
- J\. Xu, Z\. Zhang, T\. Friedman, Y\. Liang, and G\. Van den Broeck \(2018\)A semantic loss function for deep learning with symbolic knowledge\.InProceedings of the 35th International Conference on Machine Learning,J\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 5502–5511\.External Links:[Link](https://proceedings.mlr.press/v80/xu18h.html)Cited by:[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px2.p1.1),[§4\.3\.2](https://arxiv.org/html/2605.26171#S4.SS3.SSS2.p1.1)\.
- M\. Yang, C\. Deng, J\. Yan, X\. Liu, and D\. Tao \(2020\)Learning unseen concepts via hierarchical decomposition and composition\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[Appendix E](https://arxiv.org/html/2605.26171#A5.p1.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px6.p1.1)\.
- K\. Yi, C\. Gan, Y\. Li, P\. Kohli, J\. Wu, A\. Torralba, and J\. B\. Tenenbaum \(2020\)CLEVRER: collision events for video representation and reasoning\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:1910\.01442Cited by:[Appendix B](https://arxiv.org/html/2605.26171#A2.SS0.SSS0.Px2.p1.2),[§4\.1](https://arxiv.org/html/2605.26171#S4.SS1.p2.1)\.
- S\. Yun, D\. Han, S\. J\. Oh, S\. Chun, J\. Choe, and Y\. Yoo \(2019\)CutMix: regularization strategy to train strong classifiers with localizable features\.InIEEE/CVF International Conference on Computer Vision \(ICCV\),Cited by:[§1](https://arxiv.org/html/2605.26171#S1.p6.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px5.p1.1)\.
- H\. Zhang, M\. Cisse, Y\. N\. Dauphin, and D\. Lopez\-Paz \(2018\)Mixup: beyond empirical risk minimization\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.26171#S1.p6.1),[§2](https://arxiv.org/html/2605.26171#S2.SS0.SSS0.Px5.p1.1)\.

## Appendix

## Appendix AMethod \(in extended detail\) and Algorithms

### A\.1Problem setup and notation

We assume a dataset of inputs and*concept*annotations

𝒟=\{\(xi,yi\)\}i=1M,yi∈\{0,1\}N,\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{M\},\\qquad y\_\{i\}\\in\\\{0,1\\\}^\{N\},\(10\)whereyi,c=1y\_\{i,c\}=1indicates that conceptc∈\{1,…,N\}c\\in\\\{1,\\dots,N\\\}is present inxix\_\{i\}\(multi\-label\)\. We are also given a set ofRR*rules*\{ℛr\}r=1R\\\{\\mathcal\{R\}\_\{r\}\\\}\_\{r=1\}^\{R\}\. Each rule is compiled into a directed acyclic graph \(DAG\)

Gr=\(Vr,Er\),G\_\{r\}=\(V\_\{r\},E\_\{r\}\),\(11\)whose leaves correspond to concept IDs and whose internal nodes are logical operators \(IFF / IMPLIES / AND / OR\)\. Edges may carry a negation flag, implementing literal negation at the child\-to\-parent interface\.

Our goal is anomaly detection with*semantic attribution*: for a test inputxx, output \(i\) an anomaly scores​\(x\)∈\[0,1\]s\(x\)\\in\[0,1\]and \(ii\) a decomposition into per\-rule violation scores\{sr​\(x\)\}\\\{s\_\{r\}\(x\)\\\}\(and optionally per\-node scores\)\.

### A\.2Leaf concept bank

We first train a leaf concept bank that exposes:

z\\displaystyle z=Eϕ​\(x\)∈ℝF,\\displaystyle=E\_\{\\phi\}\(x\)\\in\\mathbb\{R\}^\{F\},\(12\)ℓ​\(x\)\\displaystyle\\ell\(x\)=\(ℓ1​\(x\),…,ℓN​\(x\)\)∈ℝN,\\displaystyle=\(\\ell\_\{1\}\(x\),\\dots,\\ell\_\{N\}\(x\)\)\\in\\mathbb\{R\}^\{N\},\(13\)p​\(x\)\\displaystyle p\(x\)=σ​\(ℓ​\(x\)\)∈\(0,1\)N,\\displaystyle=\\sigma\(\\ell\(x\)\)\\in\(0,1\)^\{N\},\(14\)whereEϕE\_\{\\phi\}is a shared encoder andℓc​\(x\)\\ell\_\{c\}\(x\)is a concept\-specific logit head\. Optionally, we apply post\-hoc temperature scaling on logits \(one scalarT\>0T\>0\) before the sigmoid\.

In our implementation, the encoder featurezzis the basic evidence carrier for downstream rule evaluation\. Concept probabilitiesp​\(x\)p\(x\)are additionally used for*antecedent gating*in implication\-style anomaly scoring \(§[3\.1](https://arxiv.org/html/2605.26171#S3.SS1.SSS0.Px3)\)\.

### A\.3Rule compilation as a DAG

Each ruleℛr\\mathcal\{R\}\_\{r\}is converted to a DAGGrG\_\{r\}with:

- •node attributes:mask∈\{0,1\}\\texttt\{mask\}\\in\\\{0,1\\\}\(leaf vs internal\),x∈\{0,1,…,N\}\\texttt\{x\}\\in\\\{0,1,\\dots,N\\\}\(concept id for leaves\), andy∈\{1,2,3,4\}\\texttt\{y\}\\in\\\{1,2,3,4\\\}\(operator code at internal nodes: IFF, IMPLIES, AND, OR\),
- •edge attributes:neg∈\{−1,\+1\}\\texttt\{neg\}\\in\\\{\-1,\+1\\\}\(negation\) and \(optionally\)pos∈\{0,1\}\\texttt\{pos\}\\in\\\{0,1\\\}to enforce operand order for IMPLIES\.

Commutative operators \(IFF/AND/OR\) treat children as an unordered multiset \(canonicalized by sorting\); IMPLIES is ordered \(canonicalized byposif present\)\.

##### Hard semantics \(for supervision\)\.

Given a concept label vectory∈\{0,1\}Ny\\in\\\{0,1\\\}^\{N\}, we compute node\-level ground\-truth truth valuestv​\(y\)∈\{0,1\}t\_\{v\}\(y\)\\in\\\{0,1\\\}by bottom\-up propagation onGrG\_\{r\}, applying edge negations at the child interface and then applying the exact Boolean operator at the parent\. Thesetv​\(y\)t\_\{v\}\(y\)provide training targets for every internal node, not only the root\.

### A\.4Subtree gates: learned, feature\-aware logical composition

Each internal nodevvwith arityava\_\{v\}is assigned a lightweight*subtree gate*gθvg\_\{\\theta\_\{v\}\}that maps child features and edge\-negation flags to: \(i\) a parent featurehv∈ℝFh\_\{v\}\\in\\mathbb\{R\}^\{F\}and \(ii\) a satisfaction probabilityt^v∈\(0,1\)\\hat\{t\}\_\{v\}\\in\(0,1\)\. Concretely, for childrenc1,…,cavc\_\{1\},\\dots,c\_\{a\_\{v\}\}with featureshcj∈ℝFh\_\{c\_\{j\}\}\\in\\mathbb\{R\}^\{F\}and negation flagsbj∈\{0,1\}b\_\{j\}\\in\\\{0,1\\\},

uv\\displaystyle u\_\{v\}=\[hc1​‖b1‖​⋯​‖hcav‖​bav\]∈ℝav​\(F\+1\),\\displaystyle=\\big\[h\_\{c\_\{1\}\}\\,\\\|\\,b\_\{1\}\\,\\\|\\,\\cdots\\,\\\|\\,h\_\{c\_\{a\_\{v\}\}\}\\,\\\|\\,b\_\{a\_\{v\}\}\\big\]\\in\\mathbb\{R\}^\{a\_\{v\}\(F\+1\)\},\(15\)hv\\displaystyle h\_\{v\}=fθv​\(uv\)∈ℝF,\\displaystyle=f\_\{\\theta\_\{v\}\}\(u\_\{v\}\)\\in\\mathbb\{R\}^\{F\},\(16\)t^v\\displaystyle\\hat\{t\}\_\{v\}=σ​\(wv⊤​hv\+βv\)∈\(0,1\)\.\\displaystyle=\\sigma\(w\_\{v\}^\{\\top\}h\_\{v\}\+\\beta\_\{v\}\)\\in\(0,1\)\.\(17\)Intuitively,gθvg\_\{\\theta\_\{v\}\}is a learned operator specialized to the local connective atvv, but it operates in a feature space that can represent richer evidence than a scalar leaf probability\.

##### Leaf features\.

For a givenxx, we initialize each leaf nodevvthat corresponds to conceptc​\(v\)c\(v\)with a feature derived from the leaf\-bank encoder:

hv←z=Eϕ​\(x\)\.h\_\{v\}\\leftarrow z=E\_\{\\phi\}\(x\)\.\(18\)Although this means all leaves share the same base feature vector, the*rule structure and gate identities*specify which leaf positions correspond to which concepts; gates learn to extract concept\-relevant evidence fromzzin a rule\-conditional way\. \(This choice makes the evaluator robust to noisy scalar concept probabilities, while retaining concept semantics through the graph wiring and supervision\.\)

### A\.5Bottom\-up, level\-wise training with internal\-node supervision

Directly training all gates jointly can be unstable because higher nodes depend on learned representations of lower subtrees\. We therefore train*bottom\-up*by depth\.

Letdepth​\(v\)\\mathrm\{depth\}\(v\)be the longest path length from any leaf tovv, and let𝒱d=\{v:depth​\(v\)=d\}\\mathcal\{V\}\_\{d\}=\\\{v:\\mathrm\{depth\}\(v\)=d\\\}\. We train levels in increasing depthd=1,2,…d=1,2,\\dots\(see Algorithm[3](https://arxiv.org/html/2605.26171#alg3)\):

1. 1\.For each mini\-batch\{\(xi,yi\)\}i=1B\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{B\}, compute hard truth targetstv​\(yi\)t\_\{v\}\(y\_\{i\}\)for all nodes by exact propagation \(Algorithm[2](https://arxiv.org/html/2605.26171#alg2)\)\. This is where the choice of representing rules by graphs becomes particularly useful and elegant at the implementation level: we use DGL’sdgl\.topological\_nodes\_generatorto generate node frontiers using topological traversal \(each item is a tensor recording the nodes from bottom level to the roots\)\.
2. 2\.Compute leaf encoder featureszi=Eϕ​\(xi\)z\_\{i\}=E\_\{\\phi\}\(x\_\{i\}\)and initialize leaf node features\.
3. 3\.Propagate through already\-trained lower\-depth gates \(and keep them fixed\) to obtain child features for nodes in𝒱d\\mathcal\{V\}\_\{d\}\.
4. 4\.For eachv∈𝒱dv\\in\\mathcal\{V\}\_\{d\}, update gate parameters by minimizing node\-wise BCE: ℒv=1B​∑i=1BBCE​\(t^v​\(xi\),tv​\(yi\)\)\.\\mathcal\{L\}\_\{v\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathrm\{BCE\}\\big\(\\hat\{t\}\_\{v\}\(x\_\{i\}\),\\ t\_\{v\}\(y\_\{i\}\)\\big\)\.\(19\)

This*internal supervision*is crucial: the model learns to implement logical composition locally, rather than only learning a monolithic “rule classifier” at the root\.

### A\.6Chimera negative training: enforcing compositionality and preventing shortcut learning

A key failure mode is*shortcut learning*: a gate \(or an entire rule\) can be predicted directly from global visual features, without respecting the intended operator semantics\. We introduce*chimera negatives*to force gates to behave compositionally\.

Consider a binary nodevvwith \(ordered\) children\(ℓ,r\)\(\\ell,r\)and operatorop​\(v\)\\mathrm\{op\}\(v\)\. For a batch of sizeBB, choose a permutationπ\\piwithπ​\(i\)≠i\\pi\(i\)\\neq i\(e\.g\., a random nontrivial cyclic shift\)\. We form*chimera pairs*by combining the left child from sampleiiwith the right child from a different sampleπ​\(i\)\\pi\(i\):

\(uvchim\)i:=\[hℓ​\(xi\)​‖bℓ‖​hr​\(xπ​\(i\)\)∥br\]\.\(u\_\{v\}^\{\\text\{chim\}\}\)\_\{i\}:=\\big\[h\_\{\\ell\}\(x\_\{i\}\)\\,\\\|\\,b\_\{\\ell\}\\,\\\|\\,h\_\{r\}\(x\_\{\\pi\(i\)\}\)\\,\\\|\\,b\_\{r\}\\big\]\.\(20\)Targets are computed*semantically*from the corresponding hard truths \(including edge negations\):

\(tvchim\)i:=op​\(v\)​\(tℓ​\(yi\),tr​\(yπ​\(i\)\)\)∈\{0,1\}\.\(t\_\{v\}^\{\\text\{chim\}\}\)\_\{i\}:=\\mathrm\{op\}\(v\)\\big\(t\_\{\\ell\}\(y\_\{i\}\),\\ t\_\{r\}\(y\_\{\\pi\(i\)\}\)\\big\)\\in\\\{0,1\\\}\.\(21\)Training batches can include: \(i\) same\-image pairs, optionally restricted to samples where the full rule holds \(“AD\-strict”\), and \(ii\) chimera pairs \(Algorithm[3](https://arxiv.org/html/2605.26171#alg3)\)\. This construction breaks correlations that enable shortcut solutions, because the two operands are decoupled across samples while the supervision remains the exact logical composition\. In practice, chimera training substantially improves transfer of learned subtrees across rules and reduces overfitting to rule\-specific visual templates: only the true relevant features for the rule to hold or not end being recognized\.

##### Novel insight\.

Chimera training moves counterfactual mixing to the operand level, supplying informative truth assignments even when specific rule outcomes have little or no support in the observed training data\.

### A\.7Lineage\-aware caching: safe reuse of learned subtrees

To scale to many rules, we reuse trained gates across rule graphs whenever the corresponding subtrees match\. However, naive caching is unsafe: if the upstream encoder changes, the meaning of features changes\. We therefore define a*lineage\-aware*cache keyK​\(v\)K\(v\)recursively:

K​\(v\)\\displaystyle K\(v\)=\{LEAF​\(c​\(v\)\|enc=fp​\(Eϕ\)\)ifvis a leaf,OP​\_​op​\(v\)​\(K~​\(c1\),…,K~​\(cav\)\)​\|arch\|​Fifvis internal,\\displaystyle=\\begin\{cases\}\\texttt\{LEAF\}\(c\(v\)\\,\|\\,\\texttt\{enc\}=\\mathrm\{fp\}\(E\_\{\\phi\}\)\)&\\text\{if $v$ is a leaf,\}\\\\ \\texttt\{OP\}\\\_\{\\mathrm\{op\}\(v\)\}\\big\(\\widetilde\{K\}\(c\_\{1\}\),\\dots,\\widetilde\{K\}\(c\_\{a\_\{v\}\}\)\\big\)\\,\|\\,\\texttt\{arch\}\\,\|\\,F&\\text\{if $v$ is internal,\}\\end\{cases\}\(22\)K~​\(cj\)\\displaystyle\\widetilde\{K\}\(c\_\{j\}\)=\{K​\(cj\)if edge\(cj→v\)is non\-negated,\!​K​\(cj\)if edge\(cj→v\)is negated,\\displaystyle=\\begin\{cases\}K\(c\_\{j\}\)&\\text\{if edge $\(c\_\{j\}\\\!\\to\\\!v\)$ is non\-negated,\}\\\\ \\texttt\{\!\}K\(c\_\{j\}\)&\\text\{if edge $\(c\_\{j\}\\\!\\to\\\!v\)$ is negated,\}\\end\{cases\}\(23\)wherefp​\(Eϕ\)\\mathrm\{fp\}\(E\_\{\\phi\}\)is a short fingerprint \(hash\) of encoder weights,archtags the gate architecture version, andFFis the feature dimension\. We hashK​\(v\)K\(v\)to obtain a stable filename and store the gate state dict\. A cached gate is reused iff the key matches exactly\.

### A\.8Algorithms

Algorithm 1HardOp​\(o​p,vals\)\\mathrm\{HardOp\}\(op,\\textsf\{vals\}\)— exact Boolean semantics used byPropagateHardTruths0:Operator code

o​p∈\{1,2,3,4\}op\\in\\\{1,2,3,4\\\}and child truth list

vals\[1\.\.m\]\\textsf\{vals\}\[1\.\.m\]with

m=\|vals\|m=\|\\textsf\{vals\}\|\.

0:Hard truth

t∈\{0,1\}t\\in\\\{0,1\\\}\.

1:if

o​p=1op=1then

2:\(IFF\)

3:

t←1t\\leftarrow 1\(empty list returns 1\)

4:for

j=2j=2to

mmdo

5:if

vals​\[j\]≠vals​\[1\]\\textsf\{vals\}\[j\]\\neq\\textsf\{vals\}\[1\]then

6:

t←0t\\leftarrow 0
7:endif

8:endfor

9:else

10:if

o​p=2op=2then

11:\(IMPLIES\)

12:if

m≠2m\\neq 2then

13:

t←1t\\leftarrow 1
14:else

15:

a←vals​\[1\]a\\leftarrow\\textsf\{vals\}\[1\];

b←vals​\[2\]b\\leftarrow\\textsf\{vals\}\[2\]
16:

t←\(1−a\)​or​bt\\leftarrow\(1\-a\)\\ \\textbf\{or\}\\ b
17:endif

18:else

19:if

o​p=3op=3then

20:\(AND\)

21:

t←1t\\leftarrow 1\(empty list returns 1\)

22:for

j=1j=1to

mmdo

23:if

vals​\[j\]=0\\textsf\{vals\}\[j\]=0then

24:

t←0t\\leftarrow 0
25:endif

26:endfor

27:else

28:if

o​p=4op=4then

29:\(OR\)

30:

t←0t\\leftarrow 0\(empty list returns 0\)

31:for

j=1j=1to

mmdo

32:if

vals​\[j\]=1\\textsf\{vals\}\[j\]=1then

33:

t←1t\\leftarrow 1
34:endif

35:endfor

36:else

37:

t←0t\\leftarrow 0
38:endif

39:endif

40:endif

41:endif

42:return

tt\.

Algorithm 2PropagateHardTruths\(Gr,yi\)\(G\_\{r\},y\_\{i\}\)— hard\-truth propagation \(core loop\)0:Rule DAG

Gr=\(V,E\)G\_\{r\}=\(V,E\)with node attrs:

mask​\[v\]∈\{0,1\}\\texttt\{mask\}\[v\]\\in\\\{0,1\\\}\(1=leaf\),

x​\[v\]∈\{0,1,…,K\}\\texttt\{x\}\[v\]\\in\\\{0,1,\\dots,K\\\}\(concept id\),

y​\[v\]∈\{1,2,3,4\}\\texttt\{y\}\[v\]\\in\\\{1,2,3,4\\\}\(1=IFF, 2=IMPLIES, 3=AND, 4=OR\); and edge attr

neg​\[e\]∈\{−1,\+1\}\\texttt\{neg\}\[e\]\\in\\\{\-1,\+1\\\}\(optional\)\.

0:Concept hard labels

yi∈\{0,1\}Ky\_\{i\}\\in\\\{0,1\\\}^\{K\}indexed by concept id

1\.\.K1\.\.K\.

0:Node hard truth vector

t∈\{0,1\}\|V\|t\\in\\\{0,1\\\}^\{\|V\|\}, stored as

Gr\.ndata\[’truth\_value’\]G\_\{r\}\.\\texttt\{ndata\['truth\\\_value'\]\}\.

1:Initialize

t​\[v\]←0t\[v\]\\leftarrow 0for all

v∈Vv\\in V\.

2:Leaf init by concept id

3:foreach node

v∈Vv\\in Vdo

4:if

mask​\[v\]=1\\texttt\{mask\}\[v\]=1then

5:

c←x​\[v\]c\\leftarrow\\texttt\{x\}\[v\]
6:if

c\>0c\>0then

7:

t​\[v\]←yi​\[c\]t\[v\]\\leftarrow y\_\{i\}\[c\]
8:endif

9:endif

10:endfor

11:Bottom\-up pass in topological order

12:Let

π\\pibe a topological ordering of

VVwhere children precede parents\.

13:foreach node

vvin

π\\pido

14:if

mask​\[v\]=0\\texttt\{mask\}\[v\]=0then

15:

\(src,eid\)←Gr\.in\_edges​\(v,form=’all’\)\(\\texttt\{src\},\\texttt\{eid\}\)\\leftarrow G\_\{r\}\.\\texttt\{in\\\_edges\}\(v,\\texttt\{form='all'\}\)\.

16:

m←\|src\|m\\leftarrow\|\\texttt\{src\}\|; create array

vals\[1\.\.m\]\\textsf\{vals\}\[1\.\.m\]\.

17:for

j=1j=1to

mmdo

18:

u←src​\[j\]u\\leftarrow\\texttt\{src\}\[j\];

e←eid​\[j\]e\\leftarrow\\texttt\{eid\}\[j\];

vals​\[j\]←t​\[u\]\\textsf\{vals\}\[j\]\\leftarrow t\[u\]\.

19:ifnegexists

neg​\[e\]=−1\\texttt\{neg\}\[e\]=\-1then

20:

vals​\[j\]←1−vals​\[j\]\\textsf\{vals\}\[j\]\\leftarrow 1\-\\textsf\{vals\}\[j\]\.

21:endif

22:endfor

23:

o​p←y​\[v\]op\\leftarrow\\texttt\{y\}\[v\]\.

24:

t​\[v\]←HardOp​\(o​p,vals\)t\[v\]\\leftarrow\\mathrm\{HardOp\}\(op,\\textsf\{vals\}\)\(Alg\.[1](https://arxiv.org/html/2605.26171#alg1)\)

25:endif

26:endfor

27:

Gr\.ndata\[’truth\_value’\]←tG\_\{r\}\.\\texttt\{ndata\['truth\\\_value'\]\}\\leftarrow t\.

28:return

tt\.

Algorithm 3Training: Leaf Bank \+ Cached Subtree Gates \(Level\-wise with Chimera Negatives\)0:Training set

𝒟train=\{\(xi,yi\)\}\\mathcal\{D\}\_\{\\text\{train\}\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}with concept labels

yi∈\{0,1\}Ky\_\{i\}\\in\\\{0,1\\\}^\{K\}; rule set

\{ℛr\}r=1R\\\{\\mathcal\{R\}\_\{r\}\\\}\_\{r=1\}^\{R\}compiled into DAGs

\{Gr\}\\\{G\_\{r\}\\\}; feature dim

FF; gate architecture tagarch; negatives modeneg\_mode; cache directory

𝒞\\mathcal\{C\}
0:Leaf concept bank parameters

ϕ\\phi; cached gate parameters in

𝒞\\mathcal\{C\}
1:\(A\) Train leaf concept bank

2:Initialize encoder\+heads

ϕ\\phi\(shared encoder

EϕE\_\{\\phi\},

KKsigmoid heads\)

3:forepoch

=1=1to

EleafE\_\{\\text\{leaf\}\}do

4:formini\-batch

\{\(xi,yi\)\}i=1B\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{B\}do

5:

zi←Eϕ​\(xi\)z\_\{i\}\\leftarrow E\_\{\\phi\}\(x\_\{i\}\); logits

ℓi←Hϕ​\(zi\)\\ell\_\{i\}\\leftarrow H\_\{\\phi\}\(z\_\{i\}\)
6:Update

ϕ\\phiby minimizing multi\-label BCEWithLogits

\(ℓi,yi\)\(\\ell\_\{i\},y\_\{i\}\)\(optionally withpos\_weight\)

7:endfor

8:endfor

9:ifuse\_temp\_scalingthen

10:Fit scalar temperature

TTon held\-out logits by minimizing BCE

\(σ​\(ℓ/T\),y\)\(\\sigma\(\\ell/T\),y\); store calibrator

11:endif

12:\(B\) Train rule evaluators as reusable subtree gates

13:Compute encoder fingerprint

fp←Fingerprint​\(Eϕ\)\\mathrm\{fp\}\\leftarrow\\mathrm\{Fingerprint\}\(E\_\{\\phi\}\)
14:foreach rule graph

GrG\_\{r\}do

15:Let

Dmax←D\_\{\\max\}\\leftarrowmaximum internal\-node depth in

GrG\_\{r\}
16:fordepth

d=1d=1to

DmaxD\_\{\\max\}do

17:Let

𝒱d←\{v∈Vr:depth​\(v\)=d\}\\mathcal\{V\}\_\{d\}\\leftarrow\\\{v\\in V\_\{r\}:\\mathrm\{depth\}\(v\)=d\\\}
18:Load or initialize gates

\{gθv\}v∈𝒱d\\\{g\_\{\\theta\_\{v\}\}\\\}\_\{v\\in\\mathcal\{V\}\_\{d\}\}from cache using lineage key

K​\(v;fp,arch,F\)K\(v;\\mathrm\{fp\},\\texttt\{arch\},F\)
19:formini\-batch

\{\(xi,yi\)\}i=1B\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{B\}do

20:Hard node targets:

tv,i←PropagateHardTruths​\(Gr,yi\)​∀vt\_\{v,i\}\\leftarrow\\mathrm\{PropagateHardTruths\}\(G\_\{r\},y\_\{i\}\)\\ \\forall v
21:Leaf evidence:

zi←Eϕ​\(xi\)z\_\{i\}\\leftarrow E\_\{\\phi\}\(x\_\{i\}\);

pi←σ​\(ℓi\)p\_\{i\}\\leftarrow\\sigma\(\\ell\_\{i\}\)from leaf heads

22:Initialize leaf node features

hleaf←zih\_\{\\text\{leaf\}\}\\leftarrow z\_\{i\}\(placed by concept id\); leaf probs

t^leaf←pi\\hat\{t\}\_\{\\text\{leaf\}\}\\leftarrow p\_\{i\}
23:Propagate already\-trained lower\-depth gates to compute

\(hc,i,t^c,i\)\(h\_\{c,i\},\\hat\{t\}\_\{c,i\}\)for all children of nodes in

𝒱d\\mathcal\{V\}\_\{d\}
24:foreach node

v∈𝒱dv\\in\\mathcal\{V\}\_\{d\}do

25:Buildsame\-imageinputs

uv,isame=\[hc1,i​‖b1‖​⋯​‖hca,i‖​ba\]u^\{\\text\{same\}\}\_\{v,i\}=\\big\[h\_\{c\_\{1\},i\}\\\|b\_\{1\}\\\|\\cdots\\\|h\_\{c\_\{a\},i\}\\\|b\_\{a\}\\big\]
26:Set targets

tv,isame=tv,it^\{\\text\{same\}\}\_\{v,i\}=t\_\{v,i\}
27:ifneg\_modeuses chimera

vvis binarythen

28:Choose a permutation

π\\piwith

π​\(i\)≠i\\pi\(i\)\\neq i
29:Buildchimerainputs

uv,ichim=\[hℓ,i​‖bℓ‖​hr,π​\(i\)∥br\]u^\{\\text\{chim\}\}\_\{v,i\}=\\big\[h\_\{\\ell,i\}\\\|b\_\{\\ell\}\\\|h\_\{r,\\pi\(i\)\}\\\|b\_\{r\}\\big\]
30:Set targets

tv,ichim=Opv​\(tℓ,i,tr,π​\(i\)\)t^\{\\text\{chim\}\}\_\{v,i\}=\\mathrm\{Op\}\_\{v\}\\\!\\big\(t\_\{\\ell,i\},t\_\{r,\\pi\(i\)\}\\big\)\(with edge negations applied in the hard truths\)

31:Concatenate training pairs

\(uv,tv\)←\(usame,tsame\)∪\(uchim,tchim\)\(u\_\{v\},t\_\{v\}\)\\leftarrow\(u^\{\\text\{same\}\},t^\{\\text\{same\}\}\)\\cup\(u^\{\\text\{chim\}\},t^\{\\text\{chim\}\}\)\(optionally filter same\-image pairs, e\.g\. “AD\-strict”\)

32:else

33:

\(uv,tv\)←\(usame,tsame\)\(u\_\{v\},t\_\{v\}\)\\leftarrow\(u^\{\\text\{same\}\},t^\{\\text\{same\}\}\)
34:endif

35:Update

θv\\theta\_\{v\}by minimizing

1\|uv\|​∑BCE​\(gθv​\(uv\),tv\)\\frac\{1\}\{\|u\_\{v\}\|\}\\sum\\mathrm\{BCE\}\\big\(g\_\{\\theta\_\{v\}\}\(u\_\{v\}\),t\_\{v\}\\big\)
36:endfor

37:endfor

38:Save gates

\{gθv\}\\\{g\_\{\\theta\_\{v\}\}\\\}to cache

𝒞\\mathcal\{C\}under lineage keys

K​\(v;fp,arch,F\)K\(v;\\mathrm\{fp\},\\texttt\{arch\},F\)
39:endfor

40:endfor

41:return

ϕ\\phiand cache

𝒞\\mathcal\{C\}

Algorithm 4PredictRoot\(Gr,z,p;𝒞\)\(G\_\{r\},z,p;\\mathcal\{C\}\)0:Rule DAG

Gr=\(V,E\)G\_\{r\}=\(V,E\)with node attrs

mask,x,y\\texttt\{mask\},\\texttt\{x\},\\texttt\{y\}and edge attrsneg\(and optionallyposfor ordering IMPLIES children\)\.

0:Leaf evidence for one sample: feature vector\(s\)

zzand leaf concept probabilities

p∈\(0,1\)Kp\\in\(0,1\)^\{K\}from the leaf bank\.

0:Cache/registry

𝒞\\mathcal\{C\}for subtree gates

gvg\_\{v\}at internal nodes

vv\(load by lineage key in code\)\.

0:Root probability

t^root∈\(0,1\)\\hat\{t\}\_\{\\text\{root\}\}\\in\(0,1\)\.

1:Initialize node feature tensor

h​\[v\]←0h\[v\]\\leftarrow 0and node prob

t^​\[v\]←0\\hat\{t\}\[v\]\\leftarrow 0for all

v∈Vv\\in V\.

2:\(Place leaf probabilities and leaf features by concept id\)

3:foreach node

v∈Vv\\in Vdo

4:if

mask​\[v\]=1\\texttt\{mask\}\[v\]=1then

5:

c←x​\[v\]c\\leftarrow\\texttt\{x\}\[v\]\.

6:if

c\>0c\>0then

7:

t^​\[v\]←pc\\hat\{t\}\[v\]\\leftarrow p\_\{c\}\.

8:

h​\[v\]←zch\[v\]\\leftarrow z\_\{c\}\(in code: take the feature slice associated with conceptcc\)

9:endif

10:endif

11:endfor

12:\(Bottom\-up gate propagation level\-by\-level\)

13:Let

DmaxD\_\{\\max\}be the maximum internal\-node depth in

GrG\_\{r\}; let

𝒱d\\mathcal\{V\}\_\{d\}be nodes at depth

dd\(as precomputed in the trainer\)\.

14:for

d=1d=1to

DmaxD\_\{\\max\}do

15:foreach node

v∈𝒱dv\\in\\mathcal\{V\}\_\{d\}do

16:Load gate

gvg\_\{v\}for

vv\(runtime registry if present; else load from

𝒞\\mathcal\{C\}by the node’s cache key\)\.

17:Query incoming edges:

\(src,eid\)←Gr\.in\_edges​\(v,form=’all’\)\(\\texttt\{src\},\\texttt\{eid\}\)\\leftarrow G\_\{r\}\.\\texttt\{in\\\_edges\}\(v,\\texttt\{form='all'\}\)\.

18:If

o​p=y​\[v\]=2op=\\texttt\{y\}\[v\]=2\(IMPLIES\), reorder the two incoming edges usingposif available \(antecedent first\); otherwise keep the incoming\-edge order\.

19:Build gate input by concatenating each child feature with a negation bit:

20:Create empty vector

uvu\_\{v\}\.

21:for

j=1j=1to

\|src\|\|\\texttt\{src\}\|do

22:

c←src​\[j\]c\\leftarrow\\texttt\{src\}\[j\];

e←eid​\[j\]e\\leftarrow\\texttt\{eid\}\[j\]\.

23:ifnegexists

neg​\[e\]=−1\\texttt\{neg\}\[e\]=\-1then

24:

b←1b\\leftarrow 1\.

25:else

26:

b←0b\\leftarrow 0\.

27:endif

28:Append

\[h​\[c\]∥b\]\\big\[h\[c\]\\ \\\|\\ b\\big\]to

uvu\_\{v\}\(concatenate along feature dimension\)

29:endfor

30:Forward gate:

\(h​\[v\],t^​\[v\]\)←gv​\(uv\)\(h\[v\],\\hat\{t\}\[v\]\)\\leftarrow g\_\{v\}\(u\_\{v\}\)\.

31:endfor

32:endfor

33:Letrootbe the unique node with out\-degree

0\.

34:return

t^​\[root\]\\hat\{t\}\[\\text\{root\}\]\.

Algorithm 5Inference: Rule Satisfaction, Violation Attribution, and Anomaly Scoring0:Test input

xx; trained leaf bank

ϕ\\phi\(and optional temperature

TT\); rule graphs

\{Gr\}\\\{G\_\{r\}\\\}; gate cache

𝒞\\mathcal\{C\}; aggregation modeAgg; implication gate threshold

τ\\tau
0:Anomaly score

s​\(x\)s\(x\); per\-rule scores

\{sr​\(x\)\}\\\{s\_\{r\}\(x\)\\\}; top\-

kkviolated rules

1:Compute encoder feature

z←Eϕ​\(x\)z\\leftarrow E\_\{\\phi\}\(x\)
2:Compute concept probabilities

p←σ​\(ℓ/T\)p\\leftarrow\\sigma\(\\ell/T\)\(use

T=1T\{=\}1if no calibration\)

3:foreach rule graph

GrG\_\{r\}do

4:Load any missing gates from cache

𝒞\\mathcal\{C\}\(by lineage key\) into runtime registry

5:Predict root satisfaction:

pr​\(x\)←PredictRoot​\(Gr,z,p;𝒞\)p\_\{r\}\(x\)\\leftarrow\\mathrm\{PredictRoot\}\(G\_\{r\},z,p;\\mathcal\{C\}\)// bottom\-up gate propagation

6:if

GrG\_\{r\}is an implication rule with antecedent concept

ArA\_\{r\}\(or antecedent subgraph\)then

7:

ar←p​\(Ar\)a\_\{r\}\\leftarrow p\(A\_\{r\}\)// antecedent probability from leaf bank

8:

g​\(ar\)←max⁡\(0,ar−τ\)/\(1−τ\)g\(a\_\{r\}\)\\leftarrow\\max\(0,a\_\{r\}\-\\tau\)/\(1\-\\tau\)//τ=0\\tau=0gives identity

9:Antecedent\-weighted violation:

sr​\(x\)←g​\(ar\)⋅\(1−pr​\(x\)\)s\_\{r\}\(x\)\\leftarrow g\(a\_\{r\}\)\\cdot\(1\-p\_\{r\}\(x\)\)
10:else

11:Violation:

sr​\(x\)←1−pr​\(x\)s\_\{r\}\(x\)\\leftarrow 1\-p\_\{r\}\(x\)
12:endif

13:endfor

14:Aggregate:

s​\(x\)←Agg​\(\{sr​\(x\)\}r=1R\)s\(x\)\\leftarrow\\texttt\{Agg\}\(\\\{s\_\{r\}\(x\)\\\}\_\{r=1\}^\{R\}\)// e\.g\. max/mean/geo/learned

15:Attribution:return top\-

kkrules by descending

sr​\(x\)s\_\{r\}\(x\)\(and optionally the most\-violated internal nodes\)

16:return

s​\(x\)s\(x\)and

\{sr​\(x\)\}\\\{s\_\{r\}\(x\)\\\}

## Appendix BDatasets and concept vocabularies

We evaluate on three vision benchmarks where \(i\) a multi\-label concept inventory is available \(or can be induced\), and \(ii\) logical constraints over these concepts are meaningful\.

##### CLEVR \(images\)\.

We use CLEVR images and their ground\-truth scene annotationsJohnsonet al\.\[[2017a](https://arxiv.org/html/2605.26171#bib.bib20)\]\. Each concept is a unary predicate of the form “∃\\existsobject with attributes matching a filter”\. Concretely, our base concept bank containsK=6K\{=\}6attribute filters \(e\.g\.,blue\_sphere,metal\_any,gray\_cyl\), and an image\-level labelyk∈\{0,1\}y\_\{k\}\\in\\\{0,1\\\}is positive iff at least one object in the scene satisfies the corresponding filter\.

##### CLEVRER \(videos\)\.

We use CLEVRER and its structured annotationsYiet al\.\[[2020](https://arxiv.org/html/2605.26171#bib.bib21)\]\. We reuse the sameK=6K\{=\}6base attribute\-filter concepts for object existence, computed from the per\-video object annotations \(positive iff the video contains at least one object matching the filter\)\. Optionally \(and in the “event\-rich” setting\), we extend the concept vocabulary with event predicates constructed from CLEVRER event annotations:enter\(A\),exit\(A\),collide\(A,B\),collide\_before\_half\(A,B\), andentered\_then\_collided\(A,B\)whereA,BA,Bare attribute filters\. We cap the number of attribute\-pair instantiations \(hyperparametermax\_event\_pairs\) to keep the rule set tractable\.

##### Open Images\.

We use Open Images V4 annotationsKuznetsovaet al\.\[[2018](https://arxiv.org/html/2605.26171#bib.bib22)\]and restrict to a manageable concept set by selecting the top\-KKmost frequent detection classes in the validation bounding\-box CSV \(defaultK=50K\{=\}50\)\. A concept label is positive iff an image has*at least one*bounding box of that class\. Since Open Images does not provide a canonical train/val split for this exact “consistency anomaly” task, we split the validation images into a*rule/gate\-train*partition and an*eval*partition \(defaultval\_train\_frac=0\.9\)\.

##### VidOR\.

We evaluate on the VidOR \(Video Object Relation\) datasetShanget al\.\[[2019](https://arxiv.org/html/2605.26171#bib.bib29)\], a large\-scale collection of10,00010\{,\}000user\-generated videos \(98\.6 hours\) with spatio\-temporal annotations of object trajectories and relation instances \(80 object categories, 50 relation predicates\), using the official split of 7,000/835/2,165 videos for train/val/test\. Since test annotations are not fully available, we follow a train→\\rightarrowval protocol\. From training annotations, we construct a multi\-label concept vocabulary ofKKleaves consisting of \(i\)obj:ccfor the top\-KobjK\_\{\\text\{obj\}\}most frequent object categories and \(ii\)rel:ss\-pp\-oofor the top\-KrelK\_\{\\text\{rel\}\}most frequent relation triplets \(subject categoryss, predicatepp, object categoryoo\), keeping only triplets with at least a minimum support count\. For each video, we uniformly sampleTTframes and form a multi\-hot label vectory∈\{0,1\}Ky\\in\\\{0,1\\\}^\{K\}: an object leaf is positive if its annotated trajectory is present in any sampled frame; a relation leaf is positive if any annotated relation instance is active in at least one sampled frame \(with temporal extent\[tbegin,tend\)\[t\_\{\\text\{begin\}\},t\_\{\\text\{end\}\}\)\)\. Whenever a relation is labeled positive, we additionally mark its subject and object categories as present, enforcing the structural constraintrel:ss\-pp\-oo⇒\\Rightarrow\(obj:ss∧\\wedgeobj:oo\) at the ground\-truth label level\. We train a video leaf bank as a multi\-label classifier with a shared 2D CNN applied per\-frame and mean temporal pooling to obtain a clip embedding, followed byKKsigmoid heads; optional temperature scaling is fit on the training split\. Rules are mined from training labels \(including depth\-2 compounds\) and compiled into logical DAGs; subtree gates are trained level\-wise using internal\-node Boolean supervision\. At evaluation time, we report \(i\) concept prediction metrics on the validation set and \(ii\) consistency anomaly detection, where a clip is labeled anomalous iff it violates at least one rule under hard Boolean evaluation of the ground\-truth concept vector\.

### B\.1Rule construction from annotations

Across datasets, our constraints are expressed as small boolean formula graphs \(simple implications and depth\-2 compound rules\), compiled into DGL graphs and evaluated both \(i\)*hard*on ground\-truth concept labels and \(ii\)*soft*on predicted concept probabilities\.

##### Handwritten seed and compound rules \(CLEVRER\)\.

For CLEVRER we include a small set of*seed*implications and a small set of*compound*depth\-2 formulas \(conjunctions/disjunctions under an implication\)\. We additionally include event\-structure constraints consistent with the event definitions, e\.g\.

collide\(A,B\)⇒collide\_before\_half\(A,B\)\.\\displaystyle\\Rightarrow\\texttt\{collide\\\_before\\\_half\(A,B\)\}\.\(24\)

##### Mined pairwise implication rules\.

We mine high\-confidence implications directly from the training concept labels by co\-occurrence statistics\. For each ordered pair\(A,B\)\(A,B\)we estimate

P^​\(B=1∣A=1\)=\#​\(A∧B\)\#​\(A\)\.\\widehat\{P\}\(B\{=\}1\\mid A\{=\}1\)=\\frac\{\\\#\(A\\wedge B\)\}\{\\\#\(A\)\}\.We keep candidates only ifAAhas sufficient support,\#​\(A\)/N≥support\_thresh\\\#\(A\)/N\\geq\\texttt\{support\\\_thresh\}\(default 0\.05\), and then add: \(i\) an implicationA⇒BA\\Rightarrow BifP^​\(B∣A\)≥confidence\_pos\\widehat\{P\}\(B\\mid A\)\\geq\\texttt\{confidence\\\_pos\}\(default 0\.995\), or \(ii\) an exclusionA⇒¬BA\\Rightarrow\\neg BifP^​\(B∣A\)≤confidence\_neg\\widehat\{P\}\(B\\mid A\)\\leq\\texttt\{confidence\\\_neg\}\(default 0\.005\)\. We cap the number of mined rules \(defaultmax\_rules=25\)\.

##### Taxonomy\- and part\-based rules \+ mined pairs \(Open Images\)\.

We build implication rules from the Open Images class hierarchy and \(optionally\) part\-of relations, restricted to the top\-KKselected classes\. Because closure assumptions can render some “natural direction” constraints tautological \(depending on how labels are completed\), we also evaluate an*inverted*constraint direction that is explicitly nontrivial under label closure \(e\.g\., coarse⇒\\Rightarrowfine, whole⇒\\Rightarrowpart\)\. Additionally, we mine high\-confidence co\-occurrence rules from the bounding\-box validation CSV with thresholdsmin\_support\(default 200 images\) andmin\_conf\(default 0\.99\), and we generate a capped number of sibling\-based compound rules per parent \(per\_parent\_pair\_limit\)\.

##### Upward closure for Open Images supervision and its implications\.

Open Images detection annotations are not exhaustive across the taxonomy: an image may be annotated with a fine\-grained class while omitting its ancestors\. If missing ancestors were treated as negatives, training a multi\-label concept bank would inject systematic false negatives for parent concepts \(e\.g\.,Labrador=1 butDog=0\), forcing the model to suppress parent predictions that are logically entailed by labeled descendants\. To avoid this pathology, we apply an*upward hierarchical closure*to the ground\-truth labels before training: whenever a class is present, all its ancestors in the provided hierarchy are marked present as well\. This makes supervision consistent with the intended semantics that ancestors represent coarse presence\.

A direct consequence is that*child→\\rightarrowparent implications become tautological in the closed label space*\(there are no hard violations by construction\), and therefore cannot define informative “consistency anomalies\.” For Open Images we instead evaluate constraints that remain nontrivial under upward closure, such as*inverted*implications \(coarse→\\rightarrowfine and whole→\\rightarrowpart\) with antecedent gating, and mined high\-confidence co\-occurrence implications\. In this setting, flagged violations should be interpreted as*missing\-detail / semantic inconsistency signals*relative to the learned concept bank and the chosen constraint family, rather than contradictions of the closed taxonomy itself\.

##### Open\-set mixture in evaluation and label\-space caveats \(Open Images\)\.

Although our Open Images concept inventory is restricted to a Top\-KKsubset \(for tractable multi\-label learning\), the*evaluation distribution*is not purely closed\-set: the validation split naturally contains many instances of object subtypes outside the Top\-KKleaf vocabulary \(e\.g\., unmodeled vehicle subclasses\)\. Consequently, our reported AUROCs are measured on a realistic*mixture*of in\-vocabulary and out\-of\-vocabulary content\. This mixed setting introduces a protocol\-specific label issue for implication\-based constraints: when an image contains an out\-of\-Top\-KKdescendant of an antecedent parent, the antecedent may be absent from the Top\-KKground\-truth vector used to compute rule truths \(even after upward closure restricted to Top\-KK\), rendering the implication vacuously true in the evaluation labels\. This creates conservative label noise that can underestimate rule\-violation detection performance\.

To mitigate this, we filter evaluation samples where the image contains an out\-of\-vocabulary descendant of a rule antecedent \(in the full Open Images label space\), but that antecedent is absent in the Top\-KKground\-truth vector used for rule\-truth computation\. This situation arises because upward closure is applied only within the Top\-KKlabel space: descendants outside Top\-KKcannot trigger their Top\-KKancestors, so the antecedent is spuriously missing and the implication is labeled vacuously true\. Removing these cases reduces label\-noise that would otherwise penalize methods for correctly inferring the antecedent from visual evidence\.

A separate caveat concerns*hierarchical closure itself*: upward closure can activate coarse ancestors based on labeled descendants, and depending on the ontology semantics, this may not coincide with “whole object visibly present” in the scene \(and may introduce its own inconsistencies\)\. This is a generic consequence of adopting closure on a given hierarchy \(shared by many closure\-based pipelines\) rather than a pathology specific to our Top\-KKimplication evaluation protocol; our evaluation correction targets the former protocol\-induced vacuity/coverage issue, not the closure assumption\.

## Appendix CDetailed Results Tables

### C\.1CLEVRER – Detailed Results Tables

Table 5:CLEVRER leaf\-bank evaluation on the validation split\. “Prev\.” denotes the number of positive validation videos for the class\.ClassPrev\.ROC\-AUCAPAcc\.@0\.5blue\_sphere22630\.9910\.9890\.956red\_sphere22100\.9880\.9840\.940green\_cube20770\.9480\.8930\.900yellow\_cyl22120\.9630\.9290\.918metal\_any48370\.5440\.9710\.852gray\_cyl21780\.9610\.9250\.864enter\(shape=sphere\)28170\.8180\.8320\.694exit\(shape=sphere\)6520\.9040\.5420\.871enter\(shape=cube\)24290\.6740\.6290\.625exit\(shape=cube\)1990\.8270\.1580\.830enter\(shape=cylinder\)26820\.6820\.6850\.632exit\(shape=cylinder\)3100\.8350\.2570\.808enter\(color=red\)11500\.9570\.8010\.927exit\(color=red\)1310\.9640\.4510\.949enter\(color=green\)11220\.9540\.7820\.915exit\(color=green\)1430\.9460\.3930\.931enter\(color=blue\)11100\.9500\.7350\.906exit\(color=blue\)1730\.9560\.4110\.932enter\(color=yellow\)11440\.9510\.7650\.911exit\(color=yellow\)1230\.9570\.3920\.966enter\(color=gray\)11580\.9400\.7610\.867exit\(color=gray\)1700\.9550\.4340\.949enter\(color=brown\)11300\.9470\.7610\.914exit\(color=brown\)1570\.9520\.3840\.940enter\(color=purple\)11260\.9610\.7830\.913exit\(color=purple\)1430\.9620\.4090\.899enter\(color=cyan\)11840\.9520\.7770\.906exit\(color=cyan\)1710\.9510\.4600\.938collide\(shape=sphere,shape=sphere\)11810\.9520\.8560\.891collide\_before\_half\(shape=sphere,shape=sphere\)7600\.9360\.6940\.879entered\_then\_collided\(shape=sphere,shape=sphere\)8630\.9150\.6430\.868collide\(shape=sphere,shape=cube\)20560\.7580\.6360\.673collide\_before\_half\(shape=sphere,shape=cube\)14700\.7620\.5260\.627entered\_then\_collided\(shape=sphere,shape=cube\)15060\.7090\.4620\.612collide\(shape=sphere,shape=cylinder\)23600\.7450\.6570\.688collide\_before\_half\(shape=sphere,shape=cylinder\)16690\.7420\.5090\.633entered\_then\_collided\(shape=sphere,shape=cylinder\)16890\.6980\.4760\.624collide\(shape=sphere,color=red\)9040\.9620\.8260\.915collide\_before\_half\(shape=sphere,color=red\)5890\.9470\.6530\.902entered\_then\_collided\(shape=sphere,color=red\)6360\.9300\.6020\.887collide\(shape=sphere,color=green\)8750\.9670\.8500\.922collide\_before\_half\(shape=sphere,color=green\)5660\.9540\.6890\.907entered\_then\_collided\(shape=sphere,color=green\)6360\.9350\.5970\.890collide\(shape=sphere,color=blue\)9290\.9620\.8380\.914collide\_before\_half\(shape=sphere,color=blue\)6230\.9440\.6580\.894entered\_then\_collided\(shape=sphere,color=blue\)6500\.9270\.5860\.885collide\(shape=sphere,color=yellow\)9120\.9680\.8570\.914collide\_before\_half\(shape=sphere,color=yellow\)6040\.9520\.6950\.881entered\_then\_collided\(shape=sphere,color=yellow\)6410\.9350\.6130\.886collide\(shape=sphere,color=gray\)8940\.9560\.8210\.906collide\_before\_half\(shape=sphere,color=gray\)5800\.9390\.6380\.913entered\_then\_collided\(shape=sphere,color=gray\)6420\.9260\.6080\.898Macro average: ROC\-AUC=0\.900=0\.900, AP=0\.659=0\.659, Acc\.@0\.5=0\.859=0\.859Table 6:Per\-rule CLEVRER results for theIndependent\-Events Probabilistic evaluator\. For readability, repeated predicate arguments within the same row are shown only on first occurrence\.IDRuleAUROC01\[Chain\]\(collide​\(shape=sphere,color=brown\)→collide\_before\_half\)\(\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=brown\}\)\\to\\text\{collide\\\_before\\\_half\}\)∧\(collide\_before\_half→entered\_then\_collided\)\\wedge\\;\(\\text\{collide\\\_before\\\_half\}\\to\\text\{entered\\\_then\\\_collided\}\)0\.56102\[Chain\]\(collide​\(shape=cube,color=cyan\)→collide\_before\_half\)\(\\text\{collide\}\(\\text\{shape=cube\},\\text\{color=cyan\}\)\\to\\text\{collide\\\_before\\\_half\}\)∧\(collide\_before\_half→entered\_then\_collided\)\\wedge\\;\(\\text\{collide\\\_before\\\_half\}\\to\\text\{entered\\\_then\\\_collided\}\)0\.79503\[Chain\]\(collide​\(shape=sphere,color=green\)→entered\_then\_collided\)\(\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=green\}\)\\to\\text\{entered\\\_then\\\_collided\}\)∧\(entered\_then\_collided→collide\_before\_half\)\\wedge\\;\(\\text\{entered\\\_then\\\_collided\}\\to\\text\{collide\\\_before\\\_half\}\)0\.78004\[Chain\]\(collide​\(shape=cube,color=blue\)→entered\_then\_collided\)\(\\text\{collide\}\(\\text\{shape=cube\},\\text\{color=blue\}\)\\to\\text\{entered\\\_then\\\_collided\}\)∧\(entered\_then\_collided→collide\_before\_half\)\\wedge\\;\(\\text\{entered\\\_then\\\_collided\}\\to\\text\{collide\\\_before\\\_half\}\)0\.84005collide​\(shape=sphere,shape=cube\)→\(collide\_before\_half∧entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\wedge\\text\{entered\\\_then\\\_collided\}\)\[mined\-AND; conf=0\.51, hits=1041, viol=1015\]0\.67606collide​\(shape=sphere,shape=cylinder\)→\(collide\_before\_half∧entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\wedge\\text\{entered\\\_then\\\_collided\}\)\[mined\-AND; conf=0\.49, hits=1148, viol=1212\]0\.68807collide​\(shape=cube,color=cyan\)→\(collide\_before\_half∨entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=cube\},\\text\{color=cyan\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\vee\\text\{entered\\\_then\\\_collided\}\)\[mined\-OR; conf=0\.95, hits=726, viol=42\]0\.67908collide​\(shape=sphere,shape=cube\)→\(collide\_before\_half∨entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\vee\\text\{entered\\\_then\\\_collided\}\)\[mined\-OR; conf=0\.94, hits=1935, viol=121\]0\.51509collide​\(shape=sphere,shape=sphere\)→\(collide\_before\_half→entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\to\\text\{entered\\\_then\\\_collided\}\)\[mined\-IMP; conf=0\.81, hits=961, viol=220\]0\.70010collide​\(shape=sphere,color=green\)→\(collide\_before\_half→entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=green\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\to\\text\{entered\\\_then\\\_collided\}\)\[mined\-IMP; conf=0\.80, hits=704, viol=171\]0\.83911collide​\(shape=sphere,shape=sphere\)→collide\_before\_half​\(shape=sphere,shape=sphere\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)0\.78312collide​\(shape=sphere,shape=sphere\)→entered\_then\_collided​\(shape=sphere,shape=sphere\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)0\.73413collide​\(shape=sphere,shape=cube\)→collide\_before\_half​\(shape=sphere,shape=cube\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)0\.61914collide​\(shape=sphere,shape=cube\)→entered\_then\_collided​\(shape=sphere,shape=cube\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)0\.65815collide​\(shape=sphere,shape=cylinder\)→collide\_before\_half​\(shape=sphere,shape=cylinder\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)0\.62416collide​\(shape=sphere,shape=cylinder\)→entered\_then\_collided​\(shape=sphere,shape=cylinder\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)0\.65117collide​\(shape=sphere,color=red\)→collide\_before\_half​\(shape=sphere,color=red\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=red\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=red\}\)0\.85518collide​\(shape=sphere,color=red\)→entered\_then\_collided​\(shape=sphere,color=red\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=red\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=red\}\)0\.86119collide​\(shape=sphere,color=green\)→collide\_before\_half​\(shape=sphere,color=green\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=green\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=green\}\)0\.85120collide​\(shape=sphere,color=green\)→entered\_then\_collided​\(shape=sphere,color=green\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=green\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=green\}\)0\.77221collide​\(shape=sphere,color=blue\)→collide\_before\_half​\(shape=sphere,color=blue\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=blue\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=blue\}\)0\.83022collide​\(shape=sphere,color=blue\)→entered\_then\_collided​\(shape=sphere,color=blue\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=blue\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=blue\}\)0\.82023collide​\(shape=sphere,color=yellow\)→collide\_before\_half​\(shape=sphere,color=yellow\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=yellow\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=yellow\}\)0\.81124collide​\(shape=sphere,color=yellow\)→entered\_then\_collided​\(shape=sphere,color=yellow\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=yellow\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=yellow\}\)0\.78925collide​\(shape=sphere,color=gray\)→collide\_before\_half​\(shape=sphere,color=gray\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=gray\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=gray\}\)0\.88126collide​\(shape=sphere,color=gray\)→entered\_then\_collided​\(shape=sphere,color=gray\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=gray\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=gray\}\)0\.897Table 7:Per\-rule CLEVRER results for theNeural Evaluator\. For readability, repeated predicate arguments within the same row are shown only on first occurrence\.IDRuleAUROC01\[Chain\]\(collide​\(shape=sphere,color=brown\)→collide\_before\_half\)\(\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=brown\}\)\\to\\text\{collide\\\_before\\\_half\}\)∧\(collide\_before\_half→entered\_then\_collided\)\\wedge\\;\(\\text\{collide\\\_before\\\_half\}\\to\\text\{entered\\\_then\\\_collided\}\)0\.89502\[Chain\]\(collide​\(shape=cube,color=cyan\)→collide\_before\_half\)\(\\text\{collide\}\(\\text\{shape=cube\},\\text\{color=cyan\}\)\\to\\text\{collide\\\_before\\\_half\}\)∧\(collide\_before\_half→entered\_then\_collided\)\\wedge\\;\(\\text\{collide\\\_before\\\_half\}\\to\\text\{entered\\\_then\\\_collided\}\)0\.89103\[Chain\]\(collide​\(shape=sphere,color=green\)→entered\_then\_collided\)\(\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=green\}\)\\to\\text\{entered\\\_then\\\_collided\}\)∧\(entered\_then\_collided→collide\_before\_half\)\\wedge\\;\(\\text\{entered\\\_then\\\_collided\}\\to\\text\{collide\\\_before\\\_half\}\)0\.88304\[Chain\]\(collide​\(shape=cube,color=blue\)→entered\_then\_collided\)\(\\text\{collide\}\(\\text\{shape=cube\},\\text\{color=blue\}\)\\to\\text\{entered\\\_then\\\_collided\}\)∧\(entered\_then\_collided→collide\_before\_half\)\\wedge\\;\(\\text\{entered\\\_then\\\_collided\}\\to\\text\{collide\\\_before\\\_half\}\)0\.89305collide​\(shape=sphere,shape=cube\)→\(collide\_before\_half∧entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\wedge\\text\{entered\\\_then\\\_collided\}\)\[mined\-AND; conf=0\.51, hits=1041, viol=1015\]0\.76006collide​\(shape=sphere,shape=cylinder\)→\(collide\_before\_half∧entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\wedge\\text\{entered\\\_then\\\_collided\}\)\[mined\-AND; conf=0\.49, hits=1148, viol=1212\]0\.72807collide​\(shape=cube,color=cyan\)→\(collide\_before\_half∨entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=cube\},\\text\{color=cyan\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\vee\\text\{entered\\\_then\\\_collided\}\)\[mined\-OR; conf=0\.95, hits=726, viol=42\]0\.86808collide​\(shape=sphere,shape=cube\)→\(collide\_before\_half∨entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\vee\\text\{entered\\\_then\\\_collided\}\)\[mined\-OR; conf=0\.94, hits=1935, viol=121\]0\.76409collide​\(shape=sphere,shape=sphere\)→\(collide\_before\_half→entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\to\\text\{entered\\\_then\\\_collided\}\)\[mined\-IMP; conf=0\.81, hits=961, viol=220\]0\.85610collide​\(shape=sphere,color=green\)→\(collide\_before\_half→entered\_then\_collided\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=green\}\)\\to\(\\text\{collide\\\_before\\\_half\}\\to\\text\{entered\\\_then\\\_collided\}\)\[mined\-IMP; conf=0\.80, hits=704, viol=171\]0\.90411collide​\(shape=sphere,shape=sphere\)→collide\_before\_half​\(shape=sphere,shape=sphere\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)0\.83712collide​\(shape=sphere,shape=sphere\)→entered\_then\_collided​\(shape=sphere,shape=sphere\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{shape=sphere\}\)0\.88113collide​\(shape=sphere,shape=cube\)→collide\_before\_half​\(shape=sphere,shape=cube\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)0\.60214collide​\(shape=sphere,shape=cube\)→entered\_then\_collided​\(shape=sphere,shape=cube\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{shape=cube\}\)0\.69515collide​\(shape=sphere,shape=cylinder\)→collide\_before\_half​\(shape=sphere,shape=cylinder\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)0\.58516collide​\(shape=sphere,shape=cylinder\)→entered\_then\_collided​\(shape=sphere,shape=cylinder\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{shape=cylinder\}\)0\.68617collide​\(shape=sphere,color=red\)→collide\_before\_half​\(shape=sphere,color=red\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=red\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=red\}\)0\.87518collide​\(shape=sphere,color=red\)→entered\_then\_collided​\(shape=sphere,color=red\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=red\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=red\}\)0\.91419collide​\(shape=sphere,color=green\)→collide\_before\_half​\(shape=sphere,color=green\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=green\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=green\}\)0\.87920collide​\(shape=sphere,color=green\)→entered\_then\_collided​\(shape=sphere,color=green\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=green\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=green\}\)0\.91521collide​\(shape=sphere,color=blue\)→collide\_before\_half​\(shape=sphere,color=blue\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=blue\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=blue\}\)0\.86922collide​\(shape=sphere,color=blue\)→entered\_then\_collided​\(shape=sphere,color=blue\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=blue\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=blue\}\)0\.90823collide​\(shape=sphere,color=yellow\)→collide\_before\_half​\(shape=sphere,color=yellow\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=yellow\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=yellow\}\)0\.87724collide​\(shape=sphere,color=yellow\)→entered\_then\_collided​\(shape=sphere,color=yellow\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=yellow\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=yellow\}\)0\.91325collide​\(shape=sphere,color=gray\)→collide\_before\_half​\(shape=sphere,color=gray\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=gray\}\)\\to\\text\{collide\\\_before\\\_half\}\(\\text\{shape=sphere\},\\text\{color=gray\}\)0\.87526collide​\(shape=sphere,color=gray\)→entered\_then\_collided​\(shape=sphere,color=gray\)\\text\{collide\}\(\\text\{shape=sphere\},\\text\{color=gray\}\)\\to\\text\{entered\\\_then\\\_collided\}\(\\text\{shape=sphere\},\\text\{color=gray\}\)0\.905
### C\.2OpenImages – Detailed Results Tables

ClassPrev\.ROC\-AUCAPAcc\.@0\.5ClassPrev\.ROC\-AUCAPAcc\.@0\.5Man37340\.9310\.6910\.885Tower1430\.9940\.6620\.976Tree36110\.9320\.7720\.911Human nose26120\.9520\.7420\.891Human face31240\.9500\.7850\.884Bicycle wheel2450\.9930\.8060\.942Person71440\.8270\.6370\.771Glasses2580\.9700\.5480\.935Woman27750\.9420\.7200\.889Dress7100\.9610\.5520\.922Footwear19070\.9360\.5700\.877Vehicle15450\.6390\.1180\.800Window9040\.9370\.4140\.868Bird5930\.9850\.8010\.941Flower23360\.9900\.9340\.955Street light300\.9910\.3470\.980Wheel44730\.9640\.7940\.879Human mouth19790\.9500\.6410\.868Car50950\.9900\.9600\.923Palm tree1220\.9950\.7500\.977Human hair43250\.9410\.7960\.893Book950\.9880\.6920\.943Human arm35320\.9140\.6170\.866Tableware4110\.9690\.3670\.917Human head37110\.9230\.6680\.860Drink3560\.9670\.3650\.892Girl21350\.9450\.6650\.895Bottle2070\.9790\.5220\.913Building18840\.9060\.5440\.919Bicycle2520\.9860\.7410\.942House5700\.9770\.5720\.957Furniture6260\.9440\.3140\.908Chair3110\.9730\.6070\.926Sculpture1900\.9720\.3190\.924Tire22270\.9390\.5150\.803Flag740\.9510\.5410\.884Suit2630\.9870\.7750\.982Dog15860\.9930\.9370\.900Boy5790\.9380\.3280\.910Dessert6420\.9900\.7290\.918Table5930\.9620\.4550\.927Skyscraper560\.9950\.4610\.978Land vehicle17200\.7910\.1980\.802Boat5670\.9920\.8060\.949Jeans2850\.9550\.3430\.916Human eye19620\.9650\.7080\.908Human hand20560\.9160\.4920\.845Human leg17990\.9250\.5080\.858Toy3080\.9470\.3600\.903Macro average: ROC\-AUC=0\.948=0\.948, AP=0\.596=0\.596, Acc\.@0\.5=0\.904=0\.904Table 8:OpenImages leaf\-bank evaluation on the validation split\. “Prev\.” denotes the number of positive validation images for the class\.Not\-top49 classAncestors in Top\-49TruckLand vehicle, VehicleVanLand vehicle, Vehicle, CarMotorcycleLand vehicle, VehicleHelicopterVehicleWatercraftVehicleCartLand vehicle, VehicleBusLand vehicle, VehicleTrainLand vehicle, VehicleTankLand vehicle, Vehicle, BuildingWheelchairLand vehicle, VehicleTaxiLand vehicle, VehicleLimousineLand vehicle, Vehicle, CarGondolaBoat, VehicleAmbulanceLand vehicle, VehicleSegwayLand vehicle, VehicleBargeBoat, VehicleGolf cartLand vehicle, VehicleSnowmobileLand vehicle, VehicleSeat beltLand vehicle, VehicleTable 9:Examples of classes outside the Top\-49 whose ancestors are nevertheless present within the Top\-49 set\. This illustrates how hierarchical closure can introduce parent\-level signals into evaluation even when the corresponding fine\-grained child class is not itself included in the selected label vocabulary\.IDRuleIndep\. EventsNeural Eval\.001Tableware→\\toBottle0\.8910\.954002Furniture→\\toChair0\.8300\.934003Furniture→\\toTable0\.8190\.934004Building→\\toSkyscraper0\.8690\.911005Building→\\toTower0\.8670\.908006Building→\\toHouse0\.8430\.887007Bicycle→\\toBicycle wheel0\.6680\.925008Tree→\\toPalm tree0\.9020\.940009Vehicle→\\toLand vehicle0\.8370\.798010Land vehicle→\\toBicycle0\.7880\.986011Land vehicle→\\toCar0\.7220\.829012Bicycle wheel→\\toBicyclenannan013Furniture→\\to\(Chair∧\\wedgeTable\)0\.8540\.938014Building→\\to\(Skyscraper∧\\wedgeTower\)0\.8730\.877015Building→\\to\(Tower∧\\wedgeHouse\)0\.8770\.888016Land vehicle→\\to\(Bicycle∧\\wedgeCar\)0\.7650\.989Table 10:Per\-rule OpenImages results \(ROC\-AUC\) comparing the Independent\-Events probabilistic baseline and the full Neural Evaluator\. Rule 012 has undefined AUROC due to degenerate evaluation labels\.
### C\.3VidOR – Detailed Results Tables

Table 11:VidOR leaf\-bank evaluation on the validation split\. “Prev\.” denotes the number of positive validation videos for the class\.ClassPrev\.ROC\-AUCAPAcc\.@thrObject leavesobj:adult6070\.4830\.7160\.727obj:child2990\.6250\.4420\.640obj:toy1150\.6880\.2190\.862obj:baby1220\.6970\.2450\.854obj:chair650\.6410\.1190\.922obj:dog730\.6220\.1220\.913obj:car560\.7690\.1730\.933obj:table750\.7090\.2120\.910obj:cup440\.6960\.1040\.947obj:sofa660\.7390\.1690\.921obj:ball/sports\_ball530\.6650\.1420\.937obj:bottle350\.6500\.0710\.958obj:screen/monitor530\.6980\.1740\.937obj:guitar360\.7400\.0900\.957obj:cat230\.5530\.0330\.972obj:backpack300\.6570\.0680\.964obj:bicycle290\.7160\.0960\.965obj:baby\_seat280\.6680\.0710\.966obj:watercraft90\.8440\.0390\.989obj:bird140\.7040\.0310\.983obj:laptop150\.7580\.0750\.982obj:stool130\.7330\.0510\.984obj:camera220\.4900\.0340\.974obj:dish80\.8060\.0780\.990obj:cellphone200\.5550\.0790\.976obj:handbag250\.6940\.1220\.970obj:duck60\.8000\.0270\.993obj:horse110\.6830\.0290\.987obj:motorcycle90\.7470\.0240\.989obj:piano60\.8380\.0260\.993obj:bench100\.6470\.0250\.988obj:cake60\.7540\.0160\.993obj:ski120\.7550\.0310\.986obj:baby\_walker110\.5050\.0150\.987obj:elephant40\.8450\.0190\.995obj:fish40\.6340\.0080\.995obj:snowboard20\.8930\.0240\.998obj:bat110\.8650\.1110\.987obj:penguin20\.8700\.0130\.998obj:chicken20\.5590\.0040\.998Relation leavesrel:adult\-in\_front\_of\-adult2360\.5340\.3380\.717rel:adult\-next\_to\-adult2390\.5780\.3510\.714rel:adult\-watch\-adult1580\.5770\.2610\.811rel:adult\-behind\-adult1510\.5300\.1900\.819rel:child\-in\_front\_of\-child710\.5620\.1170\.915rel:child\-next\_to\-child650\.6030\.1260\.922rel:child\-in\_front\_of\-adult1270\.6100\.1960\.848rel:adult\-watch\-child870\.5830\.1360\.896rel:adult\-away\-adult980\.6050\.1810\.883rel:child\-watch\-child510\.5540\.0940\.939rel:adult\-in\_front\_of\-child970\.6170\.1520\.884rel:adult\-towards\-adult730\.5280\.1010\.913rel:adult\-next\_to\-child1030\.6160\.1670\.877rel:child\-behind\-child430\.5270\.0560\.949rel:adult\-behind\-child750\.5470\.1050\.910rel:child\-watch\-adult550\.6640\.1160\.934rel:toy\-next\_to\-toy310\.6970\.1030\.963rel:child\-next\_to\-adult770\.6240\.1380\.908rel:cup\-in\_front\_of\-adult300\.7060\.0740\.964rel:adult\-watch\-baby450\.6960\.1040\.946rel:adult\-next\_to\-table400\.7290\.1110\.952rel:child\-next\_to\-toy560\.6990\.1250\.933rel:table\-in\_front\_of\-adult320\.7030\.0710\.962rel:toy\-in\_front\_of\-child450\.6760\.1010\.946rel:adult\-next\_to\-cup330\.7100\.0820\.960rel:baby\-in\_front\_of\-adult590\.6940\.1240\.929rel:dog\-next\_to\-dog200\.7040\.0440\.976rel:child\-watch\-toy340\.6680\.0630\.959rel:guitar\-in\_front\_of\-adult300\.7250\.0780\.964rel:adult\-speak\_to\-adult390\.7540\.1360\.953rel:child\-away\-child260\.4780\.0340\.969rel:dog\-in\_front\_of\-dog220\.6860\.0440\.974rel:toy\-in\_front\_of\-adult210\.6550\.0500\.975rel:child\-towards\-child280\.5580\.0570\.966rel:bottle\-in\_front\_of\-adult210\.7060\.0590\.975rel:dog\-in\_front\_of\-adult260\.7260\.0890\.969rel:adult\-next\_to\-toy270\.6970\.0630\.968rel:child\-away\-adult360\.5470\.0520\.957rel:adult\-next\_to\-bottle220\.6960\.0580\.974rel:child\-towards\-adult330\.5890\.0700\.960rel:baby\-watch\-adult210\.6650\.0460\.975rel:baby\-next\_to\-toy260\.7490\.0730\.969rel:dog\-watch\-dog130\.6400\.0300\.984rel:child\-hold\-toy310\.7000\.0680\.963rel:adult\-in\_front\_of\-baby390\.6540\.0750\.953rel:child\-next\_to\-table300\.7860\.1680\.964rel:cup\-next\_to\-cup100\.6960\.0260\.988rel:adult\-next\_to\-baby370\.7430\.0940\.956rel:child\-behind\-adult290\.6260\.0550\.965rel:adult\-watch\-guitar80\.6780\.0190\.990rel:adult\-in\_front\_of\-screen/monitor250\.7620\.0940\.970rel:adult\-towards\-child190\.6640\.0410\.977rel:toy\-in\_front\_of\-baby180\.6990\.0370\.978rel:adult\-watch\-dog170\.8320\.0940\.980rel:toy\-next\_to\-child330\.7140\.0880\.960rel:adult\-away\-child220\.6220\.0450\.974rel:adult\-next\_to\-guitar170\.7120\.0390\.980rel:ball/sports\_ball\-in\_front\_of\-adult160\.7020\.0670\.981rel:adult\-in\_front\_of\-dog240\.7040\.0650\.971rel:adult\-play\(instrument\)\-guitar280\.7350\.0710\.966rel:adult\-behind\-guitar250\.7200\.0570\.970rel:table\-in\_front\_of\-child230\.8120\.1650\.972rel:adult\-next\_to\-dog200\.6470\.0360\.976rel:baby\-watch\-toy160\.7670\.0500\.981rel:adult\-next\_to\-ball/sports\_ball170\.6220\.0460\.980rel:chair\-in\_front\_of\-adult150\.7040\.0380\.982rel:guitar\-next\_to\-adult160\.7340\.0390\.981rel:dog\-behind\-dog170\.7030\.0410\.980rel:adult\-watch\-ball/sports\_ball120\.6870\.0490\.986rel:bottle\-next\_to\-bottle50\.7350\.0160\.994rel:adult\-next\_to\-backpack160\.7000\.0620\.981rel:cup\-next\_to\-adult150\.6940\.0320\.982rel:table\-next\_to\-adult160\.7700\.0850\.981rel:adult\-speak\_to\-child240\.7160\.0680\.971rel:chair\-next\_to\-chair150\.6340\.0310\.982rel:chair\-next\_to\-adult180\.6740\.0360\.978rel:adult\-lean\_on\-sofa180\.7400\.0470\.978rel:ball/sports\_ball\-in\_front\_of\-child200\.6630\.0510\.976rel:adult\-behind\-baby250\.6970\.0630\.970rel:child\-next\_to\-ball/sports\_ball220\.6050\.0430\.974rel:dog\-away\-dog130\.7240\.0320\.984rel:adult\-next\_to\-chair210\.6480\.0390\.975rel:child\-in\_front\_of\-sofa310\.7580\.1530\.963rel:child\-watch\-ball/sports\_ball200\.6890\.0560\.976rel:dog\-watch\-adult100\.7920\.0450\.988rel:chair\-next\_to\-table90\.6730\.0230\.989rel:sofa\-beneath\-adult180\.7560\.0580\.978rel:camera\-in\_front\_of\-adult140\.6040\.0270\.983rel:adult\-above\-sofa180\.7590\.0580\.978rel:backpack\-behind\-adult130\.7370\.1160\.984rel:toy\-next\_to\-baby160\.7470\.0550\.981rel:chair\-behind\-child80\.5450\.0120\.990rel:cup\-in\_front\_of\-child70\.7810\.0390\.992rel:adult\-in\_front\_of\-chair130\.7130\.0560\.984rel:table\-beneath\-cup90\.8060\.0360\.989rel:cup\-above\-table90\.8080\.0350\.989rel:bottle\-next\_to\-adult130\.7340\.0430\.984rel:adult\-watch\-screen/monitor80\.7250\.0250\.990rel:cat\-next\_to\-cat50\.5120\.0070\.994rel:adult\-carry\-backpack220\.7530\.1050\.974rel:screen/monitor\-in\_front\_of\-adult110\.7180\.0380\.987rel:adult\-in\_front\_of\-sofa80\.7530\.0210\.990rel:car\-in\_front\_of\-adult110\.7440\.0300\.987rel:adult\-in\_front\_of\-guitar100\.7590\.0290\.988rel:cat\-in\_front\_of\-cat30\.4650\.0040\.996rel:child\-next\_to\-cup70\.7800\.0290\.992rel:chair\-behind\-adult130\.7330\.0470\.984rel:dog\-next\_to\-adult140\.6210\.0350\.983rel:child\-next\_to\-chair70\.5370\.0110\.992rel:dog\-towards\-dog60\.6860\.0170\.993rel:chair\-next\_to\-child110\.5090\.0140\.987rel:adult\-hold\-cup150\.5650\.0270\.982rel:sofa\-behind\-child230\.7230\.0570\.972rel:cake\-in\_front\_of\-adult50\.7460\.0130\.994rel:chair\-in\_front\_of\-child60\.5830\.0110\.993rel:laptop\-in\_front\_of\-adult90\.7430\.0270\.989rel:child\-watch\-baby130\.6860\.0290\.984rel:cellphone\-in\_front\_of\-adult120\.5430\.0190\.986rel:adult\-watch\-cake20\.6720\.0050\.998rel:child\-watch\-cake20\.6760\.0050\.998rel:child\-in\_front\_of\-screen/monitor160\.7150\.2260\.981rel:adult\-watch\-toy50\.6220\.0130\.994rel:table\-in\_front\_of\-chair70\.6900\.0290\.992rel:baby\-lean\_on\-adult240\.6550\.0540\.971rel:sofa\-in\_front\_of\-adult50\.7460\.0160\.994rel:adult\-next\_to\-dish60\.8430\.0800\.993rel:adult\-hug\-baby200\.6390\.0420\.976rel:adult\-ride\-watercraft60\.8310\.0270\.993rel:table\-next\_to\-child140\.8030\.1880\.983rel:car\-behind\-adult180\.7860\.0730\.978rel:baby\-next\_to\-adult130\.6340\.0240\.984rel:cat\-watch\-cat30\.3700\.0040\.996rel:adult\-hit\-adult30\.5920\.0060\.996rel:dish\-in\_front\_of\-adult40\.8960\.1030\.995rel:sofa\-next\_to\-child210\.7720\.1610\.975rel:guitar\-next\_to\-guitar70\.6980\.0170\.992rel:screen/monitor\-behind\-adult150\.7300\.0500\.982rel:adult\-hold\_hand\_of\-adult130\.6180\.0240\.984rel:child\-in\_front\_of\-chair90\.5950\.0140\.989rel:adult\-next\_to\-handbag160\.5580\.0250\.981rel:car\-in\_front\_of\-car60\.6680\.0170\.993rel:watercraft\-beneath\-adult60\.8340\.0300\.993rel:child\-away\-chair50\.5010\.0080\.994rel:car\-next\_to\-car50\.7580\.0140\.994rel:adult\-above\-watercraft50\.8400\.0290\.994rel:adult\-next\_to\-cellphone140\.5700\.0220\.983rel:toy\-next\_to\-adult180\.6780\.0500\.978rel:baby\-hold\-toy60\.7480\.0160\.993rel:sofa\-next\_to\-adult60\.7460\.0150\.993rel:child\-next\_to\-bottle60\.6260\.0150\.993rel:car\-behind\-car80\.6780\.0200\.990rel:adult\-hold\-ball/sports\_ball60\.7040\.0600\.993rel:adult\-behind\-dog100\.8920\.0750\.988rel:toy\-behind\-child130\.6950\.0660\.984rel:adult\-hold\-toy70\.6420\.0580\.992rel:adult\-lean\_on\-table140\.7540\.0470\.983rel:bird\-next\_to\-bird50\.7990\.0190\.994rel:dog\-in\_front\_of\-child110\.5230\.0140\.987rel:sofa\-in\_front\_of\-child150\.7110\.1660\.982rel:child\-watch\-dog100\.4510\.0140\.988rel:screen/monitor\-next\_to\-adult130\.6780\.0460\.984rel:sofa\-next\_to\-sofa110\.6960\.1160\.987rel:adult\-hold\_hand\_of\-child150\.5710\.0420\.982rel:table\-behind\-child70\.8550\.0670\.992rel:adult\-speak\_to\-baby80\.6330\.0140\.990rel:adult\-next\_to\-cake60\.7770\.0170\.993rel:dog\-away\-adult110\.6360\.0280\.987rel:adult\-ride\-bicycle120\.8700\.0840\.986rel:backpack\-in\_front\_of\-adult70\.6700\.0140\.992rel:dog\-towards\-adult90\.5250\.0130\.989rel:baby\-watch\-child60\.6180\.0120\.993rel:adult\-in\_front\_of\-laptop90\.7120\.0210\.989rel:toy\-in\_front\_of\-sofa150\.7660\.0780\.982rel:adult\-next\_to\-sofa70\.7270\.0180\.992rel:baby\-lean\_on\-baby\_seat190\.6740\.0570\.977rel:child\-towards\-toy30\.7120\.0080\.996rel:child\-towards\-chair60\.5250\.0110\.993rel:handbag\-next\_to\-adult140\.5970\.0270\.983rel:piano\-in\_front\_of\-adult50\.8350\.0250\.994rel:hamster/rat\-next\_to\-hamster/rat10\.3840\.0020\.999rel:table\-beneath\-bottle50\.7210\.0230\.994rel:child\-lean\_on\-adult110\.6860\.0260\.987rel:ball/sports\_ball\-next\_to\-ball/sports\_ball80\.6210\.0140\.990rel:bottle\-above\-table50\.7230\.0240\.994rel:screen/monitor\-in\_front\_of\-child90\.6530\.2360\.989rel:bottle\-in\_front\_of\-child50\.5790\.0100\.994rel:baby\-in\_front\_of\-child150\.7130\.0350\.982rel:cake\-in\_front\_of\-child40\.7110\.0100\.995rel:toy\-away\-toy60\.5980\.0130\.993rel:child\-away\-toy100\.5450\.0140\.988rel:child\-hold\_hand\_of\-child10\.5600\.0030\.999rel:adult\-away\-car150\.7590\.0490\.982rel:adult\-caress\-dog120\.6680\.0220\.986rel:adult\-hold\-camera130\.5730\.0210\.984rel:dog\-next\_to\-toy90\.5000\.0110\.989rel:sofa\-next\_to\-table100\.8170\.1400\.988rel:adult\-hug\-child160\.6280\.0310\.981rel:adult\-hold\-bottle100\.6610\.0360\.988rel:child\-lean\_on\-table100\.8320\.0690\.988rel:baby\_seat\-beneath\-baby160\.6590\.0380\.981rel:bicycle\-beneath\-adult130\.8470\.0830\.984rel:ball/sports\_ball\-next\_to\-adult60\.5070\.0100\.993rel:adult\-above\-bicycle130\.8460\.0840\.984rel:adult\-behind\-car130\.7100\.0550\.984rel:duck\-next\_to\-duck40\.7200\.0110\.995rel:child\-in\_front\_of\-dog60\.3860\.0070\.993rel:bottle\-next\_to\-cup40\.6590\.0090\.995rel:adult\-next\_to\-car70\.7240\.0260\.992rel:adult\-watch\-cup20\.6720\.0050\.998rel:cup\-next\_to\-bottle40\.6650\.0090\.995rel:adult\-watch\-laptop40\.7340\.0110\.995rel:guitar\-behind\-adult60\.7380\.0170\.993rel:bicycle\-in\_front\_of\-adult90\.5530\.0140\.989rel:adult\-behind\-chair80\.6800\.0210\.990rel:cat\-in\_front\_of\-adult70\.4570\.0080\.992rel:toy\-in\_front\_of\-dog70\.4030\.0080\.992rel:duck\-in\_front\_of\-duck50\.7990\.0350\.994rel:backpack\-next\_to\-adult70\.7140\.0220\.992rel:child\-lean\_on\-sofa160\.7660\.1080\.981rel:table\-behind\-adult110\.7610\.0420\.987rel:adult\-next\_to\-camera110\.5630\.0160\.987rel:adult\-push\-child10\.8290\.0070\.999rel:adult\-in\_front\_of\-camera100\.5590\.0190\.988rel:ball/sports\_ball\-next\_to\-toy40\.6350\.0130\.995rel:child\-away\-sofa50\.6830\.0130\.994rel:child\-behind\-chair60\.4110\.0070\.993rel:toy\-next\_to\-ball/sports\_ball40\.6070\.0130\.995rel:ball/sports\_ball\-away\-adult70\.4430\.0110\.992rel:child\-next\_to\-baby90\.7920\.0290\.989rel:adult\-away\-chair60\.7690\.0310\.993rel:dish\-next\_to\-dish30\.8270\.1220\.996rel:child\-above\-sofa140\.7770\.1210\.983rel:adult\-lift\-baby60\.8270\.0310\.993rel:sofa\-beneath\-child130\.7960\.1240\.984rel:adult\-in\_front\_of\-car130\.7810\.0550\.984rel:adult\-towards\-car60\.6680\.0150\.993rel:child\-hold\-ball/sports\_ball130\.4630\.0160\.984rel:child\-watch\-screen/monitor80\.7510\.2650\.990rel:baby\-next\_to\-child110\.6590\.0210\.987rel:child\-speak\_to\-child20\.8420\.1700\.998rel:bird\-in\_front\_of\-bird30\.7370\.0100\.996rel:adult\-next\_to\-laptop20\.9510\.0350\.998rel:screen/monitor\-behind\-child100\.6650\.1570\.988rel:child\-next\_to\-cake40\.7010\.0090\.995rel:child\-ride\-bicycle70\.6550\.0130\.992rel:child\-next\_to\-dog90\.4260\.0100\.989rel:baby\-watch\-baby10\.7590\.0050\.999rel:adult\-hold\-baby120\.6910\.0350\.986rel:adult\-lean\_on\-chair110\.7450\.0380\.987rel:chair\-in\_front\_of\-chair60\.5530\.0110\.993rel:baby\_seat\-in\_front\_of\-adult50\.5670\.0130\.994rel:watercraft\-next\_to\-watercraft10\.9650\.0330\.999rel:guitar\-in\_front\_of\-child50\.8940\.0540\.994rel:adult\-use\-camera130\.5280\.0320\.984rel:baby\-next\_to\-baby30\.8210\.0140\.996rel:adult\-carry\-handbag120\.6880\.0240\.986rel:toy\-towards\-toy50\.4180\.0060\.994rel:cup\-next\_to\-dish30\.7900\.0190\.996rel:cat\-behind\-cat0nannan1\.000rel:dish\-in\_front\_of\-child10\.5110\.0020\.999rel:sofa\-behind\-adult50\.7760\.0150\.994rel:table\-next\_to\-chair30\.6830\.0140\.996rel:cup\-next\_to\-child10\.5110\.0020\.999rel:car\-next\_to\-adult80\.7600\.0220\.990rel:baby\-next\_to\-ball/sports\_ball70\.6530\.0160\.992rel:adult\-next\_to\-stool50\.8720\.1560\.994rel:laptop\-next\_to\-adult10\.9690\.0370\.999rel:toy\-behind\-baby60\.7320\.0180\.993rel:adult\-grab\-ball/sports\_ball10\.9500\.0230\.999rel:chicken\-next\_to\-chicken10\.8270\.0070\.999rel:child\-in\_front\_of\-baby100\.6770\.0210\.988rel:dish\-next\_to\-cup30\.7900\.0190\.996rel:table\-in\_front\_of\-sofa120\.8180\.0880\.986rel:adult\-next\_to\-cat40\.3580\.0040\.995rel:adult\-hold\-cellphone90\.5620\.0180\.989rel:ball/sports\_ball\-next\_to\-child120\.4920\.0180\.986rel:adult\-behind\-camera110\.6020\.0390\.987rel:child\-towards\-sofa60\.6970\.0130\.993rel:child\-next\_to\-sofa140\.7330\.1080\.983rel:dog\-watch\-child60\.3520\.0070\.993rel:adult\-above\-chair100\.7490\.0450\.988rel:dish\-above\-table40\.7550\.0260\.995rel:dog\-behind\-adult50\.8510\.0320\.994rel:chair\-beneath\-adult100\.7380\.0400\.988rel:bat\-in\_front\_of\-adult80\.8650\.1030\.990rel:adult\-away\-table60\.6270\.0110\.993rel:dog\-bite\-dog20\.4260\.0030\.998rel:bottle\-next\_to\-child30\.6510\.0100\.996rel:ball/sports\_ball\-behind\-child80\.6310\.0160\.990rel:table\-beneath\-dish40\.7550\.0240\.995rel:child\-towards\-ball/sports\_ball110\.7020\.0390\.987rel:dog\-next\_to\-ball/sports\_ball50\.2760\.0050\.994rel:child\-away\-table20\.8090\.0100\.998rel:cellphone\-next\_to\-adult60\.6220\.0130\.993rel:child\-hold\_hand\_of\-adult80\.5450\.0560\.990rel:adult\-watch\-camera60\.4440\.0220\.993rel:dog\-next\_to\-child80\.3360\.0080\.990rel:bicycle\-beneath\-child90\.6610\.0270\.989rel:handbag\-in\_front\_of\-adult80\.8000\.0370\.990rel:child\-above\-bicycle90\.6630\.0240\.989Macro average: ROC\-AUC=0\.679=0\.679, AP=0\.057=0\.057, Acc\.@thr=0\.976=0\.976Table 12:Per\-rule VidOR results for the Independent\-Events Probabilistic evaluator\.IDRuleAUROC001rel:adult\-watch\-adult→rel:adult\-next\_to\-adult\\text\{rel:adult\-watch\-adult\}\\to\\text\{rel:adult\-next\\\_to\-adult\}\[mined rev; sup=1111, conf=0\.792, lift=2\.68\]0\.557002rel:adult\-watch\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-watch\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=1155, conf=0\.823, lift=2\.74\]0\.714003rel:adult\-away\-adult→rel:adult\-next\_to\-adult\\text\{rel:adult\-away\-adult\}\\to\\text\{rel:adult\-next\\\_to\-adult\}\[mined rev; sup=540, conf=0\.770, lift=2\.60\]0\.477004rel:adult\-towards\-adult→rel:adult\-next\_to\-adult\\text\{rel:adult\-towards\-adult\}\\to\\text\{rel:adult\-next\\\_to\-adult\}\[mined rev; sup=507, conf=0\.791, lift=2\.67\]0\.650005rel:adult\-behind\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-behind\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=1203, conf=0\.914, lift=3\.04\]0\.451006rel:adult\-away\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-away\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=593, conf=0\.846, lift=2\.81\]0\.633007rel:adult\-watch\-child→rel:child\-in\_front\_of\-adult\\text\{rel:adult\-watch\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=601, conf=0\.854, lift=5\.51\]0\.563008rel:child\-watch\-adult→rel:adult\-in\_front\_of\-child\\text\{rel:child\-watch\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-child\}\[mined rev; sup=346, conf=0\.783, lift=6\.89\]0\.297009rel:adult\-speak\_to\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-speak\\\_to\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=261, conf=0\.813, lift=2\.71\]0\.629010rel:child\-watch\-adult→rel:child\-in\_front\_of\-adult\\text\{rel:child\-watch\-adult\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=370, conf=0\.837, lift=5\.40\]0\.355011rel:adult\-towards\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-towards\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=581, conf=0\.906, lift=3\.02\]0\.463012rel:adult\-behind\-child→rel:child\-in\_front\_of\-adult\\text\{rel:adult\-behind\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=597, conf=0\.913, lift=5\.89\]0\.513013rel:child\-away\-child→rel:child\-next\_to\-child\\text\{rel:child\-away\-child\}\\to\\text\{rel:child\-next\\\_to\-child\}\[mined rev; sup=223, conf=0\.785, lift=8\.40\]0\.596014rel:child\-towards\-adult→rel:adult\-in\_front\_of\-child\\text\{rel:child\-towards\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-child\}\[mined rev; sup=226, conf=0\.843, lift=7\.43\]0\.279015rel:child\-away\-child→rel:child\-behind\-child\\text\{rel:child\-away\-child\}\\to\\text\{rel:child\-behind\-child\}\[mined rev; sup=214, conf=0\.754, lift=11\.83\]0\.422016rel:child\-towards\-adult→rel:child\-in\_front\_of\-adult\\text\{rel:child\-towards\-adult\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=222, conf=0\.828, lift=5\.34\]0\.209017rel:child\-behind\-child→rel:child\-in\_front\_of\-child\\text\{rel:child\-behind\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-child\}\[mined rev; sup=397, conf=0\.890, lift=9\.53\]0\.438018rel:adult\-next\_to\-baby→rel:baby\-in\_front\_of\-adult\\text\{rel:adult\-next\\\_to\-baby\}\\to\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\[mined rev; sup=292, conf=0\.872, lift=10\.27\]0\.296019rel:child\-away\-adult→rel:child\-in\_front\_of\-adult\\text\{rel:child\-away\-adult\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=247, conf=0\.858, lift=5\.53\]0\.261020rel:adult\-speak\_to\-adult→rel:adult\-next\_to\-adult\\text\{rel:adult\-speak\\\_to\-adult\}\\to\\text\{rel:adult\-next\\\_to\-adult\}\[mined rev; sup=283, conf=0\.882, lift=2\.98\]0\.797021obj:baby\_seat→obj:baby\\text\{obj:baby\\\_seat\}\\to\\text\{obj:baby\}\[mined rev; sup=203, conf=0\.857, lift=5\.61\]0\.157022rel:adult\-in\_front\_of\-baby→rel:baby\-in\_front\_of\-adult\\text\{rel:adult\-in\\\_front\\\_of\-baby\}\\to\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\[mined rev; sup=333, conf=0\.915, lift=10\.78\]0\.467023rel:child\-away\-child→rel:child\-in\_front\_of\-child\\text\{rel:child\-away\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-child\}\[mined rev; sup=246, conf=0\.866, lift=9\.27\]0\.491024rel:adult\-speak\_to\-adult→rel:adult\-watch\-adult\\text\{rel:adult\-speak\\\_to\-adult\}\\to\\text\{rel:adult\-watch\-adult\}\[mined rev; sup=293, conf=0\.913, lift=4\.55\]0\.550025rel:child\-towards\-child→rel:child\-in\_front\_of\-child\\text\{rel:child\-towards\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-child\}\[mined rev; sup=229, conf=0\.895, lift=9\.57\]0\.635026obj:adult→\(rel:adult\-in\_front\_of\-adult∧rel:adult\-next\_to\-adult\)\\text\{obj:adult\}\\to\(\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-next\\\_to\-adult\}\)\[mined inv; ov=0\.71\]0\.468027obj:child→\(rel:child\-in\_front\_of\-adult∧rel:adult\-next\_to\-child\)\\text\{obj:child\}\\to\(\\text\{rel:child\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-next\\\_to\-child\}\)\[mined inv; ov=0\.72\]0\.540028rel:adult\-in\_front\_of\-adult→\(rel:adult\-behind\-adult∧rel:adult\-towards\-adult\)\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\\to\(\\text\{rel:adult\-behind\-adult\}\\wedge\\text\{rel:adult\-towards\-adult\}\)\[mined inv; ov=0\.71\]0\.464029obj:baby→\(rel:baby\-in\_front\_of\-adult∧rel:adult\-watch\-baby\)\\text\{obj:baby\}\\to\(\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-watch\-baby\}\)\[mined inv; ov=0\.94\]0\.532030obj:toy→\(rel:child\-next\_to\-toy∧rel:toy\-in\_front\_of\-child\)\\text\{obj:toy\}\\to\(\\text\{rel:child\-next\\\_to\-toy\}\\wedge\\text\{rel:toy\-in\\\_front\\\_of\-child\}\)\[mined inv; ov=0\.91\]0\.526031obj:dog→\(rel:dog\-in\_front\_of\-adult∧rel:adult\-in\_front\_of\-dog\)\\text\{obj:dog\}\\to\(\\text\{rel:dog\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-in\\\_front\\\_of\-dog\}\)\[mined inv; ov=0\.84\]0\.540032rel:baby\-in\_front\_of\-adult→\(rel:adult\-behind\-baby∧rel:adult\-watch\-baby\)\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\\to\(\\text\{rel:adult\-behind\-baby\}\\wedge\\text\{rel:adult\-watch\-baby\}\)\[mined inv; ov=0\.64\]0\.597033obj:table→\(rel:adult\-next\_to\-table∧rel:table\-in\_front\_of\-adult\)\\text\{obj:table\}\\to\(\\text\{rel:adult\-next\\\_to\-table\}\\wedge\\text\{rel:table\-in\\\_front\\\_of\-adult\}\)\[mined inv; ov=0\.93\]0\.357034obj:bottle→\(rel:adult\-next\_to\-bottle∧rel:bottle\-in\_front\_of\-adult\)\\text\{obj:bottle\}\\to\(\\text\{rel:adult\-next\\\_to\-bottle\}\\wedge\\text\{rel:bottle\-in\\\_front\\\_of\-adult\}\)\[mined inv; ov=0\.90\]0\.593035obj:cup→\(rel:adult\-next\_to\-cup∧rel:cup\-in\_front\_of\-adult\)\\text\{obj:cup\}\\to\(\\text\{rel:adult\-next\\\_to\-cup\}\\wedge\\text\{rel:cup\-in\\\_front\\\_of\-adult\}\)\[mined inv; ov=0\.92\]0\.650Table 13:Per\-rule VidOR results for the Neural Evaluator\.IDRuleAUROC001rel:adult\-watch\-adult→rel:adult\-next\_to\-adult\\text\{rel:adult\-watch\-adult\}\\to\\text\{rel:adult\-next\\\_to\-adult\}\[mined rev; sup=1111, conf=0\.792, lift=2\.68\]0\.653002rel:adult\-watch\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-watch\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=1155, conf=0\.823, lift=2\.74\]0\.771003rel:adult\-away\-adult→rel:adult\-next\_to\-adult\\text\{rel:adult\-away\-adult\}\\to\\text\{rel:adult\-next\\\_to\-adult\}\[mined rev; sup=540, conf=0\.770, lift=2\.60\]0\.677004rel:adult\-towards\-adult→rel:adult\-next\_to\-adult\\text\{rel:adult\-towards\-adult\}\\to\\text\{rel:adult\-next\\\_to\-adult\}\[mined rev; sup=507, conf=0\.791, lift=2\.67\]0\.626005rel:adult\-behind\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-behind\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=1203, conf=0\.914, lift=3\.04\]0\.674006rel:adult\-away\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-away\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=593, conf=0\.846, lift=2\.81\]0\.696007rel:adult\-watch\-child→rel:child\-in\_front\_of\-adult\\text\{rel:adult\-watch\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=601, conf=0\.854, lift=5\.51\]0\.778008rel:child\-watch\-adult→rel:adult\-in\_front\_of\-child\\text\{rel:child\-watch\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-child\}\[mined rev; sup=346, conf=0\.783, lift=6\.89\]0\.505009rel:adult\-speak\_to\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-speak\\\_to\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=261, conf=0\.813, lift=2\.71\]0\.897010rel:child\-watch\-adult→rel:child\-in\_front\_of\-adult\\text\{rel:child\-watch\-adult\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=370, conf=0\.837, lift=5\.40\]0\.657011rel:adult\-towards\-adult→rel:adult\-in\_front\_of\-adult\\text\{rel:adult\-towards\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\[mined rev; sup=581, conf=0\.906, lift=3\.02\]0\.736012rel:adult\-behind\-child→rel:child\-in\_front\_of\-adult\\text\{rel:adult\-behind\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=597, conf=0\.913, lift=5\.89\]0\.570013rel:child\-away\-child→rel:child\-next\_to\-child\\text\{rel:child\-away\-child\}\\to\\text\{rel:child\-next\\\_to\-child\}\[mined rev; sup=223, conf=0\.785, lift=8\.40\]0\.639014rel:child\-towards\-adult→rel:adult\-in\_front\_of\-child\\text\{rel:child\-towards\-adult\}\\to\\text\{rel:adult\-in\\\_front\\\_of\-child\}\[mined rev; sup=226, conf=0\.843, lift=7\.43\]0\.669015rel:child\-away\-child→rel:child\-behind\-child\\text\{rel:child\-away\-child\}\\to\\text\{rel:child\-behind\-child\}\[mined rev; sup=214, conf=0\.754, lift=11\.83\]0\.680016rel:child\-towards\-adult→rel:child\-in\_front\_of\-adult\\text\{rel:child\-towards\-adult\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=222, conf=0\.828, lift=5\.34\]0\.801017rel:child\-behind\-child→rel:child\-in\_front\_of\-child\\text\{rel:child\-behind\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-child\}\[mined rev; sup=397, conf=0\.890, lift=9\.53\]0\.792018rel:adult\-next\_to\-baby→rel:baby\-in\_front\_of\-adult\\text\{rel:adult\-next\\\_to\-baby\}\\to\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\[mined rev; sup=292, conf=0\.872, lift=10\.27\]0\.898019rel:child\-away\-adult→rel:child\-in\_front\_of\-adult\\text\{rel:child\-away\-adult\}\\to\\text\{rel:child\-in\\\_front\\\_of\-adult\}\[mined rev; sup=247, conf=0\.858, lift=5\.53\]0\.698020rel:adult\-speak\_to\-adult→rel:adult\-next\_to\-adult\\text\{rel:adult\-speak\\\_to\-adult\}\\to\\text\{rel:adult\-next\\\_to\-adult\}\[mined rev; sup=283, conf=0\.882, lift=2\.98\]0\.659021obj:baby\_seat→obj:baby\\text\{obj:baby\\\_seat\}\\to\\text\{obj:baby\}\[mined rev; sup=203, conf=0\.857, lift=5\.61\]0\.849022rel:adult\-in\_front\_of\-baby→rel:baby\-in\_front\_of\-adult\\text\{rel:adult\-in\\\_front\\\_of\-baby\}\\to\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\[mined rev; sup=333, conf=0\.915, lift=10\.78\]0\.766023rel:child\-away\-child→rel:child\-in\_front\_of\-child\\text\{rel:child\-away\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-child\}\[mined rev; sup=246, conf=0\.866, lift=9\.27\]0\.785024rel:adult\-speak\_to\-adult→rel:adult\-watch\-adult\\text\{rel:adult\-speak\\\_to\-adult\}\\to\\text\{rel:adult\-watch\-adult\}\[mined rev; sup=293, conf=0\.913, lift=4\.55\]0\.710025rel:child\-towards\-child→rel:child\-in\_front\_of\-child\\text\{rel:child\-towards\-child\}\\to\\text\{rel:child\-in\\\_front\\\_of\-child\}\[mined rev; sup=229, conf=0\.895, lift=9\.57\]0\.855026obj:adult→\(rel:adult\-in\_front\_of\-adult∧rel:adult\-next\_to\-adult\)\\text\{obj:adult\}\\to\(\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-next\\\_to\-adult\}\)\[mined inv; ov=0\.71\]0\.529027obj:child→\(rel:child\-in\_front\_of\-adult∧rel:adult\-next\_to\-child\)\\text\{obj:child\}\\to\(\\text\{rel:child\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-next\\\_to\-child\}\)\[mined inv; ov=0\.72\]0\.731028rel:adult\-in\_front\_of\-adult→\(rel:adult\-behind\-adult∧rel:adult\-towards\-adult\)\\text\{rel:adult\-in\\\_front\\\_of\-adult\}\\to\(\\text\{rel:adult\-behind\-adult\}\\wedge\\text\{rel:adult\-towards\-adult\}\)\[mined inv; ov=0\.71\]0\.717029obj:baby→\(rel:baby\-in\_front\_of\-adult∧rel:adult\-watch\-baby\)\\text\{obj:baby\}\\to\(\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-watch\-baby\}\)\[mined inv; ov=0\.94\]0\.841030obj:toy→\(rel:child\-next\_to\-toy∧rel:toy\-in\_front\_of\-child\)\\text\{obj:toy\}\\to\(\\text\{rel:child\-next\\\_to\-toy\}\\wedge\\text\{rel:toy\-in\\\_front\\\_of\-child\}\)\[mined inv; ov=0\.91\]0\.829031obj:dog→\(rel:dog\-in\_front\_of\-adult∧rel:adult\-in\_front\_of\-dog\)\\text\{obj:dog\}\\to\(\\text\{rel:dog\-in\\\_front\\\_of\-adult\}\\wedge\\text\{rel:adult\-in\\\_front\\\_of\-dog\}\)\[mined inv; ov=0\.84\]0\.737032rel:baby\-in\_front\_of\-adult→\(rel:adult\-behind\-baby∧rel:adult\-watch\-baby\)\\text\{rel:baby\-in\\\_front\\\_of\-adult\}\\to\(\\text\{rel:adult\-behind\-baby\}\\wedge\\text\{rel:adult\-watch\-baby\}\)\[mined inv; ov=0\.64\]0\.795033obj:table→\(rel:adult\-next\_to\-table∧rel:table\-in\_front\_of\-adult\)\\text\{obj:table\}\\to\(\\text\{rel:adult\-next\\\_to\-table\}\\wedge\\text\{rel:table\-in\\\_front\\\_of\-adult\}\)\[mined inv; ov=0\.93\]0\.772034obj:bottle→\(rel:adult\-next\_to\-bottle∧rel:bottle\-in\_front\_of\-adult\)\\text\{obj:bottle\}\\to\(\\text\{rel:adult\-next\\\_to\-bottle\}\\wedge\\text\{rel:bottle\-in\\\_front\\\_of\-adult\}\)\[mined inv; ov=0\.90\]0\.612035obj:cup→\(rel:adult\-next\_to\-cup∧rel:cup\-in\_front\_of\-adult\)\\text\{obj:cup\}\\to\(\\text\{rel:adult\-next\\\_to\-cup\}\\wedge\\text\{rel:cup\-in\\\_front\\\_of\-adult\}\)\[mined inv; ov=0\.92\]0\.747

## Appendix DIndependent events formulas

Assume leaf events areindependent*given*xx, then we compute a soft satisfaction probabilityP​\(φ\)P\(\\varphi\)by recursion using these independent events formulas:

P​\(¬A\)\\displaystyle P\(\\neg A\)=1−pA,\\displaystyle=1\-p\_\{A\},\(25\)P​\(A∧B\)\\displaystyle P\(A\\wedge B\)=pA​pB,\\displaystyle=p\_\{A\}\\,p\_\{B\},\(26\)P​\(A∨B\)\\displaystyle P\(A\\vee B\)=pA\+pB−pA​pB,\\displaystyle=p\_\{A\}\+p\_\{B\}\-p\_\{A\}p\_\{B\},\(27\)P​\(A⇒B\)\\displaystyle P\(A\\Rightarrow B\)=1−P​\(A∧¬B\)=1−pA​\(1−pB\),\\displaystyle=1\-P\(A\\wedge\\neg B\)=1\-p\_\{A\}\(1\-p\_\{B\}\),\(28\)P\(A⇔B\)\\displaystyle P\(A\\Leftrightarrow B\)=P​\(A∧B\)\+P​\(¬A∧¬B\)=pA​pB\+\(1−pA\)​\(1−pB\)\.\\displaystyle=P\(A\\wedge B\)\+P\(\\neg A\\wedge\\neg B\)=p\_\{A\}p\_\{B\}\+\(1\-p\_\{A\}\)\(1\-p\_\{B\}\)\.\(29\)

## Appendix ENeural Algebra of Classifiers \(NAC\) and similar methods, and relation to our evaluator \(extended discussion\)\.

Neural Algebra of Classifiers \(NAC;Santa Cruzet al\.\[[2018](https://arxiv.org/html/2605.26171#bib.bib28)\]\) \(other similar methods are\[Misraet al\.,[2017](https://arxiv.org/html/2605.26171#bib.bib31), Nagarajan and Grauman,[2018](https://arxiv.org/html/2605.26171#bib.bib32), Yanget al\.,[2020](https://arxiv.org/html/2605.26171#bib.bib33), Liet al\.,[2021](https://arxiv.org/html/2605.26171#bib.bib34)\]\)\) is the closest conceptual precedent to our work in that it learns neural modules intended to implement Boolean connectives and composes them along an expression tree\. However, NAC composes classifier parameters \(e\.g\., weight vectors for primitive concept classifiers\) to synthesize a new classifier for a composed expression, and is trained primarily with expression\-level supervision \(labels for the whole composed concept\)\. In contrast, our evaluator composes sample\-level evidence through an explicit rule DAG to output clause\-and rule\-satisfaction probabilities, is trained with internal\-node logical supervision obtained by hard propagation from ground\-truth concept labels, and uses chimera operand mixing to discourage shortcut solutions and enforce operator\-level compositionality locally\. Finally, we introduce lineage\-aware subtree caching \(keyed by symbolic structure and encoder fingerprint\) to reuse learned submodules safely across large rule sets and across runs, addressing scalability and representation drift in a way orthogonal to NAC’s global operator design\.

The difference is that NAC can learn to evaluate, in a fully nonzero\_support\-supervised way, a given set of rules, and then compose to evaluate on new rules not used during training; on the other hand, in our approach one can re\-use only sub\-rules of a bigger rule that was learned to be evaluated, but both rules didn’t need any nonzero support in the training set to be learned\. Thus, the strategy is to simply train our system with a very big rule containing as sub\-rules all of the rules of interest, with the limiting factor only being the trade\-off between the size of the total rule vs\. training time, but not the nature of the training dataset itself w\.r\.t\. to its support of the rules during supervision; nevertheless, at test time, given any sub\-rule, the forward pass is simply an inference along the corresponding graph of classifiers\.

One may speculate that, in NAC, the learned signal \(from a fixed set of supervised training rules\) will degrade at some point in a long chain of compositions, since the novel rules may be reactive to intricate correlations in uncertainty that were simply not captured by the initial set of training rules\. By contrast, our local training signal at every sub\-depth in the graph ensures that this should not be the case\.

## Appendix FQualitative sanity\-check experiment: MNIST contradiction rule

##### Goal\.

This experiment is designed as a*qualitative*diagnostic of the core ideas \(rule\-graphs, negation handling, learned subtree gates, and caching\), not as a benchmark result\. We deliberately choose a rule whose truth value is*identically false*for all inputs\. The only acceptable behavior is that the learned root satisfaction probability stays near zero across the dataset, without spuriously correlating with visual styles or digit morphology\.

##### Setup: concepts and leaf bank\.

We use MNIST digits as a simple controlled perceptual domain\. The concept vocabulary is the 10\-way one\-hot digit identity:

y∈\{0,1\}10,yd=𝕀​\[digit=d\],d∈\{0,…,9\}\.y\\in\\\{0,1\\\}^\{10\},\\qquad y\_\{d\}=\\mathbb\{I\}\[\\text\{digit\}=d\],\\ \\ d\\in\\\{0,\\dots,9\\\}\.We train a lightweight convolutional leaf concept bank with a shared encoder and1010sigmoid heads using multi\-label BCE \(even though labels are one\-hot\)\. This yields \(i\) encoder featuresz=Eϕ​\(x\)z=E\_\{\\phi\}\(x\)and \(ii\) leaf probabilitiesp​\(x\)∈\(0,1\)10p\(x\)\\in\(0,1\)^\{10\}\.

##### Rule graph:A⇔¬AA\\Leftrightarrow\\neg A\.

Fix a target digitn∈\{0,…,9\}n\\in\\\{0,\\dots,9\\\}and define the atomic proposition

A≡\(digit=n\)\.A\\equiv\(\\text\{digit\}=n\)\.We compile a 3\-node DGL DAG with two leaf nodes referencing the*same*concept ID \(digitnn\) and a single root IFF node:

root=A⇔¬A\.\\text\{root\}\\;=\\;A\\Leftrightarrow\\neg A\.Concretely, the graph has edges \(leaf→\\toroot\) with negation flags\(\+1,−1\)\(\+1,\-1\)so that the second child is negated\. Under exact Boolean semantics, this formula is false for every input:

∀x,\(A\(x\)⇔¬A\(x\)\)=0\.\\forall x,\\quad\(A\(x\)\\Leftrightarrow\\neg A\(x\)\)=0\.Therefore, the ground\-truth root label is constant:troot​\(x\)=0t\_\{\\text\{root\}\}\(x\)=0for allxx\.

##### Training: single\-level gate learning\.

Since the rule has depth 1, training reduces to learning*one*subtree gate at the root\. We use the standard level\-wise training procedure:

1. 1\.For each mini\-batch, compute the hard truths for all nodes by bottom\-up propagation from concept labels \(so the root target is always 0\)\.
2. 2\.Initialize both leaves with the encoder feature vectorzz\(in this construction both leaves point to the same concept and thus carry the same base evidence\), and pass the two child features plus negation flags into the root gate\.
3. 3\.Optimize root\-gate BCE loss for a small number of epochs and store the trained gate in the subtree cache keyed by the rule structure and encoder fingerprint\.

A useful qualitative difference emerges when one sorts the test images of a fixed digit class by the score assigned to this contradiction rule\. In this special sanity\-check experiment, we use the rule output itself as the anomaly score for visualization, rather than the usual1−p1\-ptransformation used for rule\-satisfaction scores elsewhere in the paper\. This makes the comparison especially revealing\. Under the SEM variant, training sees only real normal same\-image samples, which is just the full dataset in this case since no real image in it has labels that would make the truth value of this rule \(which is an impossible logical assertion\) nonzero\. As a result, SEM receives no explicit supervisory signal that would force it to organize within\-class variation according to visual abnormality; any such ordering can only arise indirectly from residual classifier uncertainty\. Chimera training, by contrast, augments the same normal data with synthetic contradictory examples that are impossible in real life but are semantically ‘true’ for the rule\. This provides an explicit counterfactual signal for what “abnormal” should look like at the rule level\. Qualitatively, this changes the ranking behavior: although all test images shown in the figure have the same class label, the samples assigned the largest contradiction scores are visually more distorted, less prototypical, or harder to parse than those assigned the smallest scores\. The chimera\-trained evaluator produces a noticeably cleaner progression from normal\-looking digits \(left\) to abnormal\-looking digits \(right\) than SEM, suggesting that the synthetic contradictory supervision helps the model detect within\-class visual abnormality rather than merely reproducing the nominal dataset label\.

![Refer to caption](https://arxiv.org/html/2605.26171v1/imgs/mnistchims.jpg)Figure 5:Qualitative visualization of results for the MNIST contradiction ruleA⇔¬A\{A\\Leftrightarrow\\neg A\}\. In this experiment, the anomaly score is the model output for the contradiction rule itself \(not1−p1\-p\)\. SEM is trained only on real, same\-image normal samples, whereas chimera training additionally introduces synthetic contradictory examples that cannot occur in real data\. For a fixed test class, we sort samples by anomaly score and show the 10 smallest on the left and the 10 largest on the right\. Although all shown images share the same dataset label, the high\-score samples are visually more distorted and less prototypical\. Chimera produces a visibly cleaner separation, tending to place more normal\-looking digits on the left and more abnormal\-looking ones on the right, which suggests that the synthetic contradictory supervision helps the evaluator detect within\-class visual abnormality more effectively\.

## Appendix GQualitative results in images

![Refer to caption](https://arxiv.org/html/2605.26171v1/imgs/anomscorehistos.jpg)Figure 6:Anomaly score histograms corresponding to the results of the experiments in the figures below \(OpenImages \- Rule: Land vehicle→\\to\(Bicycle∧\\wedgeCar\)\), Figs\.[7](https://arxiv.org/html/2605.26171#A7.F7)\-[8](https://arxiv.org/html/2605.26171#A7.F8)\.Left, Indep\.Events;right, Neural Evaluator\.![Refer to caption](https://arxiv.org/html/2605.26171v1/imgs/adexampleOIindEvents.jpg)Figure 7:Showing the images from a random butbalancedsubset of the test set \(i\.e\., same number of normal and abnormal images\), with the index sorted by the anomaly score,sr​\(x\)=1−t^root\(r\)​\(x\)s\_\{r\}\(x\)=1\-\\hat\{t\}^\{\(r\)\}\_\{\\mathrm\{root\}\}\(x\), from low at the top left to high at the bottom right\. Only displaying one every 10 images, starting from index 0\. A perfect detection would show the top half of thetotalpanel as normal \(green framing\) and the anomalies \(red framing\) at the bottom half\.![Refer to caption](https://arxiv.org/html/2605.26171v1/imgs/adexampleOIneuraleval.jpg)Figure 8:Idem previous figure\.
## Appendix HTraining protocol and architecture details

We use the implementation on the VidOR dataset as example\. The implementation on the other datasets uses the same scripts with minimal adaptations to fit them\.

### H\.1Experimental Reproducibility Details

##### Codebase and implementation\.

All experiments use a PyTorch/DGL implementation of a feature\-aware neural rule evaluator\. Each logical rule is represented as a directed acyclic graph whose leaves are semantic concepts and whose internal nodes are Boolean operators\. The evaluator is trained bottom\-up from leaves to root: each internal gate maps child features, together with edge\-level negation flags, to a parent feature and a parent truth probability\. The implementation uses DGL graph traversal for the rule structure, PyTorch modules for the leaf bank and subtree gates, and scikit\-learn for AUROC/AP evaluation\.

##### Rule evaluator and caching\.

For each rule graph, internal modules are trained level\-wise by node depth\. The implementation uses lineage\-aware subtree caching: cache keys include the exact symbolic subtree, edge order/negations, gate architecture tag, feature dimensionality, and a fingerprint of the upstream encoder\. This prevents accidental reuse of subtree gates when the feature\-producing encoder changes\. At inference, the trained or cached gates are applied level\-by\-level until the root truth probability is obtained\.

### H\.2Dataset Construction and Splits

##### VidOR data layout\.

The VidOR experiments assume the following directory layout:

<vidor\_root\>/annotations/\{train,val\}/\*\.json,\\texttt\{<vidor\\\_root\>/annotations/\\\{train,val\\\}/\*\.json\},<vidor\_root\>/videos/<video\_path\>\.\\texttt\{<vidor\\\_root\>/videos/<video\\\_path\>\}\.If the annotation does not directly provide a resolvable video path, the code falls back to a video identifier lookup\. The train split is used for concept mining, rule mining, leaf\-bank training, rule\-gate training, optional calibration, and optional learned aggregation\. The validation split is used for final anomaly scoring and reported AUROC\.

##### Semantic leaf vocabulary\.

The VidOR leaf vocabulary consists of object atoms and relation atoms:

obj:⟨category⟩,rel:⟨subject⟩​\-​⟨predicate⟩​\-​⟨object⟩\.\\mathrm\{obj\}:\\langle\\mathrm\{category\}\\rangle,\\qquad\\mathrm\{rel\}:\\langle\\mathrm\{subject\}\\rangle\\text\{\-\}\\langle\\mathrm\{predicate\}\\rangle\\text\{\-\}\\langle\\mathrm\{object\}\\rangle\.The object vocabulary is selected from the most frequent training categories, and the relation vocabulary is selected from the most frequent training subject–predicate–object triples after applying the minimum relation\-support threshold\. In the reported configuration, the default budgets are

Kobj=40,Krel=200,min​\_​rel​\_​support=20,K\_\{\\mathrm\{obj\}\}=40,\\qquad K\_\{\\mathrm\{rel\}\}=200,\\qquad\\mathrm\{min\\\_rel\\\_support\}=20,unless otherwise stated in the experiment table\.

##### Video preprocessing\.

Each video is represented by uniformly sampled frames\. The default VidOR configuration uses

T=8frames per clip,224×224spatial resolution\.T=8\\quad\\text\{frames per clip\},\\qquad 224\\times 224\\quad\\text\{spatial resolution\}\.At training time, optional augmentation consists of resizing, color jitter, random affine perturbations, tensor conversion, and random erasing\. At validation time, only resizing and tensor conversion are applied\.

### H\.3Model Architecture

##### Leaf bank\.

The VidOR leaf bank is a multi\-label video classifier\. For each clip, a frame\-level CNN backbone is applied to the sampled frames, the resulting frame embeddings are projected to a feature dimensionFF, and the projected frame embeddings are mean\-pooled across time\. The pooled feature is fed intoKKindependent linear heads, one per semantic leaf\. The default backbone is ImageNet\-pretrained ResNet\-18, with a lightweight convolutional alternative available\. The default feature dimension is

##### Subtree gate\.

Each internal node of arityaauses a local gate

gθ:ℝa​\(F\+1\)→ℝF×\(0,1\),g\_\{\\theta\}:\\mathbb\{R\}^\{a\(F\+1\)\}\\rightarrow\\mathbb\{R\}^\{F\}\\times\(0,1\),where the input concatenates each child feature with a scalar edge\-negation flag\. The gate is an MLP with ReLU nonlinearities followed by a sigmoid truth head:

\(hv,p^v\)=gθv​\(\[hc1,b1,…,hca,ba\]\)\.\(h\_\{v\},\\hat\{p\}\_\{v\}\)=g\_\{\\theta\_\{v\}\}\\left\(\[h\_\{c\_\{1\}\},b\_\{1\},\\ldots,h\_\{c\_\{a\}\},b\_\{a\}\]\\right\)\.The outputhvh\_\{v\}is used as the parent feature for higher nodes, andp^v\\hat\{p\}\_\{v\}is the predicted truth probability for the subformula rooted atvv\.

### H\.4Training Protocol and Hyperparameters

##### Optimization\.

The leaf bank is trained with binary cross\-entropy with logits\. When class\-count statistics are enabled, positive\-class weights are computed from training\-set prevalence\. The default optimizer for the leaf bank is Adam with learning rate

ηleaf=10−3,\\eta\_\{\\mathrm\{leaf\}\}=10^\{\-3\},weight decay0unless otherwise stated, and default training duration of33epochs\.

##### Level\-wise evaluator training\.

For every rule graph, subtree gates are trained bottom\-up by depth\. At each level, hard truth targets for all graph nodes are computed from ground\-truth leaf labels using exact Boolean propagation\. The gate at each internal node is optimized with binary cross\-entropy against its node\-level Boolean truth target\. The default evaluator training hyperparameters are:

ηlevel=10−3,epochs​\_​level=2,batch​\_​train=64\.\\eta\_\{\\mathrm\{level\}\}=10^\{\-3\},\\qquad\\mathrm\{epochs\\\_level\}=2,\\qquad\\mathrm\{batch\\\_train\}=64\.

##### Chimera\-only operand training\.

All reported runs use \(except in the ablations that use only the normal class during training\)

\-\-negatives chimeras\_only\.\\texttt\{\-\-negatives chimeras\\\_only\}\.In this mode, the operands of each binary gate are constructed from different samples in the mini\-batch\. For a batch permutationπ\\piwithπ​\(i\)≠i\\pi\(i\)\\neq i, the left operand is taken from sampleiiand the right operand from sampleπ​\(i\)\\pi\(i\)\. The target is computed by applying the Boolean operator to the corresponding hard child truth values:

tichim=op⁡\(tℓ​\(yi\),tr​\(yπ​\(i\)\)\)\.t^\{\\mathrm\{chim\}\}\_\{i\}=\\operatorname\{op\}\\\!\\left\(t\_\{\\ell\}\(y\_\{i\}\),t\_\{r\}\(y\_\{\\pi\(i\)\}\)\\right\)\.Both positive and negative chimera cases are used, depending on whether the operator evaluates to true or false\.

##### Temperature calibration\.

When enabled, post\-hoc temperature scaling is fit on the training split using binary cross\-entropy on the leaf logits\. The learned scalar temperature is saved and reloaded at evaluation time\.

##### Default run configuration\.

Unless explicitly overridden in the experiment table, the VidOR runs use:

BackboneResNet\-18Feature dimension256Frames per clip8Resize224×224Leaf epochs3Rule\-gate epochs per level2Leaf learning rate10−3Rule\-gate learning rate10−3Batch size, train/eval64/64Rule aggregationminImplication gate threshold0\.0Random seed123\.\\begin\{array\}\[\]\{ll\}\\text\{Backbone\}&\\text\{ResNet\-18\}\\\\ \\text\{Feature dimension\}&256\\\\ \\text\{Frames per clip\}&8\\\\ \\text\{Resize\}&224\\times 224\\\\ \\text\{Leaf epochs\}&3\\\\ \\text\{Rule\-gate epochs per level\}&2\\\\ \\text\{Leaf learning rate\}&10^\{\-3\}\\\\ \\text\{Rule\-gate learning rate\}&10^\{\-3\}\\\\ \\text\{Batch size, train/eval\}&64/64\\\\ \\text\{Rule aggregation\}&\\texttt\{min\}\\\\ \\text\{Implication gate threshold\}&0\.0\\\\ \\text\{Random seed\}&123\.\\end\{array\}

### H\.5Evaluation Protocol

##### Leaf\-level evaluation\.

For the leaf bank, we report per\-class ROC\-AUC, average precision, and accuracy at threshold0\.50\.5on the validation split\. Macro summaries are computed over valid classes, ignoring degenerate classes for which ROC\-AUC or AP is undefined\.

##### Rule\-level scores\.

For each rulerr, the evaluator produces a predicted rule\-satisfaction probability

p^r​\(x\)∈\(0,1\)\.\\hat\{p\}\_\{r\}\(x\)\\in\(0,1\)\.The per\-rule violation score is

vr​\(x\)=1−p^r​\(x\)\.v\_\{r\}\(x\)=1\-\\hat\{p\}\_\{r\}\(x\)\.

##### AUROC computation\.

The main anomaly metric is AUROC against the pseudo\-anomaly label

Yanom​\(x\)=1−minr⁡Tr​\(y\),Y\_\{\\mathrm\{anom\}\}\(x\)=1\-\\min\_\{r\}T\_\{r\}\(y\),whereTr​\(y\)∈\{0,1\}T\_\{r\}\(y\)\\in\\\{0,1\\\}is the exact Boolean truth value of rulerrunder ground\-truth leaves\. Per\-rule AUROC is also computed by comparing1−p^r​\(x\)1\-\\hat\{p\}\_\{r\}\(x\)against1−Tr​\(y\)1\-T\_\{r\}\(y\)\. If the validation labels are all normal or all anomalous under the pseudo\-ground\-truth, AUROC is reported as undefined rather than forced to a numeric value\.

### H\.6Compute and Software Environment

##### Software dependencies\.

The experiments require Python with PyTorch, torchvision, DGL, OpenCV, Pillow, scikit\-learn, NumPy, and tqdm\. The code selects CUDA automatically when available and otherwise falls back to CPU\.

##### Hardware\.

The experiments were run on a NVIDIA H100 GPU\.

##### Runtime controls\.

The most important runtime controls are the number of sampled frames, image resize, batch size, number of workers, number of leaf epochs, number of evaluator epochs per level, and the number of retained rules\. The code supports\-\-train\_fracfor controlled training\-set subsampling and\-\-train\_missing\_onlyfor reusing existing cached gates when applicable\.

### H\.7Randomness and Determinism

The code sets the Pythonrandomseed and the PyTorch CPU seed, and also sets the CUDA seed when CUDA is available\. The default seed is

for the VidOR experiments\. Remaining nondeterminism can arise from GPU kernels, data\-loader worker scheduling, video decoding, and randomized data augmentation\.

### H\.8Data, Licenses, and Ethics

##### Dataset access\.

We use existing third\-party datasets and do not introduce a new dataset\. The VidOR files are expected to be obtained from the official dataset distribution and arranged in the directory structure described above\.

##### Annotations and labels\.

The anomaly labels used in our experiments are pseudo\-labels derived from logical inconsistency under retained semantic rules, not human judgments of abnormality\. Therefore, the reported anomaly\-detection results should be interpreted as rule\-consistency detection under a selected concept vocabulary and selected rule set\.

##### Privacy and human subjects\.

The experiments use pre\-existing public video annotations and do not involve newly collected human\-subject data\. No attempt is made to identify individuals\. If any dataset split contains people, the analysis is restricted to the dataset\-provided object and relation categories\.

##### Limitations and potential misuse\.

The method can flag violations of explicit rules, but it inherits errors from the leaf bank, biases from dataset annotations, and biases from mined rule selection\. A high anomaly score should therefore be interpreted as evidence of semantic inconsistency relative to the selected rule set, not as a general\-purpose safety or surveillance judgment\. The system should not be deployed for consequential decisions without validating the rule set, concept vocabulary, calibration, and false\-positive/false\-negative behavior in the target domain\.

Similar Articles

Detecting misbehavior in frontier reasoning models

OpenAI Blog

OpenAI researchers demonstrate that chain-of-thought monitoring can detect misbehavior in frontier reasoning models like o3-mini, but warn that directly optimizing CoT to prevent bad thoughts causes models to hide their intent rather than eliminate the behavior.

A Foundation Model for Zero-Shot Logical Rule Induction

Hugging Face Daily Papers

This paper introduces the Neural Rule Inducer (NRI), a foundation model for zero-shot logical rule induction that uses domain-agnostic statistical properties to generalize across tasks without retraining.