Visual Perceptual to Conceptual First-Order Rule Learning Networks [R]
Summary
This paper introduces gammaILP, a fully differentiable framework for learning first-order rules directly from image data without label leakage, addressing challenges in symbol grounding and predicate invention.
View Cached Full Text
Cached at: 05/08/26, 10:31 AM
# Visual Perceptual to Conceptual First-Order Rule Learning Networks
Source: [https://arxiv.org/html/2604.07897](https://arxiv.org/html/2604.07897)
Kun Gao Zhongguancun Academy Beijing, China gaokun@bza\.edu\.cn &Davide Soldà & Thomas Eiter Vienna University of Technology \(TU Wien\) Vienna, Austria davide\.solda@tuwien\.ac\.at, eiter@kr\.tuwien\.ac\.at
###### Abstract
Learning rules plays a crucial role in deep learning, particularly in explainable artificial intelligence and enhancing the reasoning capabilities of large language models\. While existing rule learning methods are primarily designed for symbolic data, learning rules from image data without supporting image labels and automatically inventing predicates remains a challenge\. In this paper, we tackle these inductive rule learning problems from images with a framework calledγ\\gammaILP, which provides a fully differentiable pipeline from image constant substitution to rule structure induction\. Extensive experiments demonstrate thatγ\\gammaILP achieves strong performance not only on classical symbolic relational datasets but also on relational image data and pure image datasets, such as Kandinsky patterns\.
## 1Introduction
Automatically learning rules is becoming increasingly important with the development of artificial intelligence\. The learned rules serve as interpretable representations that enable systems to generalize better\(Liuet al\.,[2023](https://arxiv.org/html/2604.07897#bib.bib35); Xieet al\.,[2025](https://arxiv.org/html/2604.07897#bib.bib36)\), and provide transparent explanations for the input data\(Kauret al\.,[2023](https://arxiv.org/html/2604.07897#bib.bib37); Gaoet al\.,[2025](https://arxiv.org/html/2604.07897#bib.bib24)\)\. Beyond propositional rules, first\-order rules allow one to describe properties of and relations between constants at a general level; such expressiveness is highly demanded in trustworthy applications\(Dwivediet al\.,[2023](https://arxiv.org/html/2604.07897#bib.bib62)\)\. In the first\-order rule learning domain, most existing methods\(Gaoet al\.,[2024](https://arxiv.org/html/2604.07897#bib.bib23); Hocquetteet al\.,[2024](https://arxiv.org/html/2604.07897#bib.bib67); Cropper and Muggleton,[2016](https://arxiv.org/html/2604.07897#bib.bib68)\)are designed for learning from relational symbolic data\. Despite their efficiency, the growing availability of multimodal data makes learning rules from knowledge graphs with image constants\(Cunningtonet al\.,[2023](https://arxiv.org/html/2604.07897#bib.bib56); Shindoet al\.,[2023](https://arxiv.org/html/2604.07897#bib.bib29)\)increasingly important\.
However, a challenge for inductive rule learning fromrelational imagedomains issymbol groundingwithout label leakage: The inability to ground visual inputs to symbolic variables in formal systems without explicit supervision\(Topanet al\.,[2021](https://arxiv.org/html/2604.07897#bib.bib39); Harnad,[1990](https://arxiv.org/html/2604.07897#bib.bib54)\)\. Hence, when inductively constructing rules from image inputs, existing methods are considered to have access to the label information of image constants\(Evanset al\.,[2021](https://arxiv.org/html/2604.07897#bib.bib33); Evans and Grefenstette,[2018](https://arxiv.org/html/2604.07897#bib.bib28); Shindoet al\.,[2023](https://arxiv.org/html/2604.07897#bib.bib29)\), which is regarded as*label leakage*\. In this paper, we assume that image symbolic labels are neither required nor leaked during inductive learning, reducing human effort and enabling fully automated rule learning from raw data\. Moreover, the absence of relational descriptions for target events often necessitates introducing new predicates, a fundamental challenge known aspredicate inventionininductive logic programming\(ILP\)\(Muggleton and Buntine,[1988](https://arxiv.org/html/2604.07897#bib.bib49); Kok and Domingos,[2007](https://arxiv.org/html/2604.07897#bib.bib38)\)\.
In this paper, we propose a novel inductive rule learning framework,γ\\gammaILP, which learns rules from image\-based constants with both predefined constant relations \(e\.g\., relational image data\) and implicit or undefined constant relations \(e\.g\., Kandinsky image data\)\. When learning from data without relations, we further create suitable concepts as relations in the learned rules for describing image instance classes\. The proposed method is fully differentiable: It takes constant embeddings for neural networks as input and learns rules through analyzing the parameters of the well\-trained neural networks\. In more detail, we use a pre\-trained encoder to embed the image constants and relations when they are defined\. In case relations are missing, we first generate rules using predicate placeholders\. We interpret the semantics of the predicate placeholder by analyzing the image constants represented by the variables and the order of the variables from the output ofγ\\gammaILP\. Furthermore, we employ multimodal LLMs to translate the semantics of these placeholder predicates into natural language format, thereby capturing the relations between constants\.
Briefly summarized, the main contributions of this work are: Firstly, we develop an inductive reasoning process that is fully differentiable and operates in latent space, where constant substitution and rule structure induction are performed via tensor operations on GPUs\. Secondly, we present theγ\\gammaILP framework for learning rules from relational image data without symbolic image constant labels, avoiding label leakage and enabling symbolic grounding\. Thirdly, we tackle predicate invention by analyzing the learned image constants represented by variables in the learned rules, and utilize LLMs as translators to generate symbolic predicate semantics\.
To the best of our knowledge,γ\\gammaILP is the first providing all features from above\. Our experiments show strong performance ofγ\\gammaILP not only on classical symbolic relational datasets, but also on relational image data and pure image dataset Kandinsky patterns\(Müller and Holzinger,[2021](https://arxiv.org/html/2604.07897#bib.bib34)\)\.
Organization\.We review related work on rule learning in Sec\.[2](https://arxiv.org/html/2604.07897#S2), followed by preliminaries on logic programs, ILP, and encoder architectures in Sec\.[3](https://arxiv.org/html/2604.07897#S3)\. In Sec\.[4](https://arxiv.org/html/2604.07897#S4), we present the proposed method, including the knowledge base generator, differentiable substitution mechanism, and predicate invention tasks\. We present experimental results in Sec\.[5](https://arxiv.org/html/2604.07897#S5), conclusions and future works in Sec\.[6](https://arxiv.org/html/2604.07897#S6), and the code in:[drive\.google\.com/drive/folders/10x\-TXo2nJuoZTPKDz\-sbybgBnC\-Rvcwo?usp=sharing](https://drive.google.com/drive/folders/10x-TXo2nJuoZTPKDz-sbybgBnC-Rvcwo?usp=sharing)\.
## 2Related Work
#### ILP methods\.
Inductive logic programming \(ILP\) was introduced byMuggleton \([1991](https://arxiv.org/html/2604.07897#bib.bib25)\)to induce rules that combined with background knowledge derive positive examples\. Symbolic ILP methods\(Cropper and Dumancic,[2022](https://arxiv.org/html/2604.07897#bib.bib52)\)typically adopt top\-down strategies \(e\.g\., FOIL\(Quinlan,[1990](https://arxiv.org/html/2604.07897#bib.bib26)\)\), bottom\-up approaches \(e\.g\., CIGOL\(Muggleton and Buntine,[1988](https://arxiv.org/html/2604.07897#bib.bib49)\)\), or hybrids like Aleph\(Srinivasan,[2001](https://arxiv.org/html/2604.07897#bib.bib27)\)to discover logical rules\. These systems are not integrated with neural networks for scalable learning using GPUs\. Learning from interpretation transition\(Inoueet al\.,[2014](https://arxiv.org/html/2604.07897#bib.bib20)\)is an ILP framework that learns propositional rules from input\-output pairs, which has been integrated into neural networks\(Gaoet al\.,[2022b](https://arxiv.org/html/2604.07897#bib.bib21)\)\.Baughet al\.\([2023](https://arxiv.org/html/2604.07897#bib.bib55);[2025](https://arxiv.org/html/2604.07897#bib.bib57)\)proposed a neural network to learn propositional rules to describe multiclass data\. The challenge here lies in learning first\-order rules\. To leverage GPU computation,Evans and Grefenstette \([2018](https://arxiv.org/html/2604.07897#bib.bib28)\)proposed∂\\partialILP, which learns rules from symbolic inputs using logic templates in differentiable operations\. DFORL\(Gaoet al\.,[2024](https://arxiv.org/html/2604.07897#bib.bib23)\)learns first\-order rules from symbolic data via bottom\-up propositionalization\(Françaet al\.,[2014](https://arxiv.org/html/2604.07897#bib.bib69)\), but its non\-differentiable process on the substitution prevents end\-to\-end training with the rule learning network\. NeurRL\(Gaoet al\.,[2025](https://arxiv.org/html/2604.07897#bib.bib24)\)extends this network to learn rules from raw time series in a differentiable way, yet it overlooks relations between raw image data\(Evans and Grefenstette,[2018](https://arxiv.org/html/2604.07897#bib.bib28)\)\.γ\\gammaILP learns rules in a bottom\-up way without any pre\-defined logic templates in a fully differentiable way from ground substitution to rule induction\.
#### Symbol Grounding\.
α\\alphaILP\(Shindoet al\.,[2023](https://arxiv.org/html/2604.07897#bib.bib29)\)induces logic programs from visual inputs, comprising a trained perception module and a symbolic fact converter\.Cunningtonet al\.\([2024](https://arxiv.org/html/2604.07897#bib.bib58)\)replaced the converter with an LLM, while\(Evans and Grefenstette,[2018](https://arxiv.org/html/2604.07897#bib.bib28); Evanset al\.,[2021](https://arxiv.org/html/2604.07897#bib.bib33)\)used predicted symbolic image labels for differentiable reasoning models\.Wanget al\.\([2019](https://arxiv.org/html/2604.07897#bib.bib40)\)applied neural networks to solve maximum satisfiability with image symbolic labels\. All these approaches rely on image symbolic labels as reasoning module inputs\. In satisfiability,Topanet al\.\([2021](https://arxiv.org/html/2604.07897#bib.bib39)\)emphasized symbol grounding, and showed that we cannot achieve the expected performance without explicit supervision\. Aware of this,γ\\gammaILP induces rules from images without symbolic labels but their representations, preventing label leakage\.
#### LLMs and ILP\.
Creswell and Shanahan \([2022](https://arxiv.org/html/2604.07897#bib.bib42)\); Hanet al\.\([2024](https://arxiv.org/html/2604.07897#bib.bib43)\)discussed the deduction reasoning abilities using LLMs under natural languages\.Liet al\.\([2025](https://arxiv.org/html/2604.07897#bib.bib71)\)test the inductive reasoning abilities of LLMs on observed facts, which are not formally described in first\-order language\. Under the ILP setting,de Souzaet al\.\([2025](https://arxiv.org/html/2604.07897#bib.bib70)\)propose a systematic methodology to analyse the ILP capabilities and limitations of LLMs\. We further test the ILP abilities of the LLMs with the state\-of\-the\-art reasoning abilities\.Gentiliet al\.\([2025](https://arxiv.org/html/2604.07897#bib.bib73)\)utilizes LLMs to rename the predicate placeholder with natural language semantics solely based on the provided logic rules with predicate placeholders\. However,γ\\gammaILP invents the semantics of relations by analyzing the learned constants represented by variables, and we utilize LLMs to translate these semantics to a natural language format\.
## 3Preliminaries
### 3\.1Logic Programs
We consider a*first\-order*languageL=\(R,F,C,V\)L=\(R,F,C,V\)\(Lloyd,[1984](https://arxiv.org/html/2604.07897#bib.bib18)\), whereRR,FF,CC, andVVdenote \(countable\) sets of predicate symbols, function symbols, constants, and variables, respectively\. A*term*ttis a constant, a variable, or an expressionf\(t1,…,tn\)f\(t\_\{1\},\\dots,t\_\{n\}\), whereffis annn\-ary function symbol andt1,…,tnt\_\{1\},\\ldots,t\_\{n\}are terms\. An*atom*is of the formp\(t1,…,tn\)p\(t\_\{1\},\\dots,t\_\{n\}\), whereppis annn\-ary predicate symbol\. A*literal*is an atom or its negation\. A*clause*is a finite disjunction of literals\. A*rule*\(or*definite clause*\) is a clause with exactly one positive literal and can be written as:α0∨¬α1∨⋯∨¬αn\\alpha\_\{0\}\\lor\\neg\\alpha\_\{1\}\\lor\\dots\\lor\\neg\\alpha\_\{n\}or equivalently in implication form as:α0←α1,α2,…,αn\\alpha\_\{0\}\\leftarrow\\alpha\_\{1\},\\alpha\_\{2\},\\dots,\\alpha\_\{n\}, whereα0\\alpha\_\{0\}is called the*head*of the rule \(denotedhead\(r\)\\text\{head\}\(r\)\), and\{α1,…,αn\}\\\{\\alpha\_\{1\},\\dots,\\alpha\_\{n\}\\\}is the*body*\(denotedbody\(r\)\\text\{body\}\(r\)\)\. Eachαi\\alpha\_\{i\}in the body is referred to as a*body atom*\. Variables in the head atom are*head variables*; those only in the body are*auxiliary variables*\. A*fact*is a rule with an empty body\. A*logic program*PPis a set of rules\.
In first\-order logic, a term, atom, clause, etc\. is*ground*if it contains no variables\. A*substitution*is a finite setθ=\{V1/t1,…,Vn/tn\}\\theta=\\\{V\_\{1\}/t\_\{1\},\\dots,V\_\{n\}/t\_\{n\}\\\}, where eachViV\_\{i\}is a distinct variable and eachtit\_\{i\}is a term different fromViV\_\{i\}\. A*ground substitution*includes only ground terms\. For an atomα\\alpha, the expressionαθ\\alpha\\thetadenotes the ground atom obtained by applying a ground substitutionθ\\thetato the variables inα\\alpha\. Additionally, the set of all ground instances of rules in a logic programPPis denoted asground\(P\)\\text\{ground\}\(P\)\. The*Herbrand base*BPB\_\{P\}of a logic programPPis the set of all ground atoms constructable from the predicate symbols and constants inPP\. An*interpretation*is a subsetI⊆BPI\\subseteq B\_\{P\}that contains the ground atoms regarded as true\. The semantics ofPPis based on the*immediate consequence operator*\(van Emden and Kowalski,[1976](https://arxiv.org/html/2604.07897#bib.bib50); Lloyd,[1984](https://arxiv.org/html/2604.07897#bib.bib18)\)TP:2BP→2BPT\_\{P\}:2^\{B\_\{P\}\}\\rightarrow 2^\{B\_\{P\}\}which is defined asTP\(I\)=\{head\(r\)\|r∈ground\(P\),body\(r\)⊆I\}\.T\_\{P\}\(I\)=\\left\\\{\\,\\text\{head\}\(r\)\\;\\middle\|\\;r\\in\\text\{ground\}\(P\),\\;\\text\{body\}\(r\)\\subseteq I\\,\\right\\\}\.
### 3\.2Inductive Logic Programming
In the setting of learning from entailments\(Muggleton and De Raedt,[1994](https://arxiv.org/html/2604.07897#bib.bib51); Evans and Grefenstette,[2018](https://arxiv.org/html/2604.07897#bib.bib28)\), a specific ILP learning task seeks to generate a logic programP\{P\}that derives a goal concept represented by a*target predicate*ptp\_\{t\}, given a tuple\(B,𝒫,𝒩\)\(\{B\},\\mathcal\{P\},\\mathcal\{N\}\)\. Here,B\{B\}denotes a set of ground atoms called*background knowledge*, and𝒫\\mathcal\{P\}and𝒩\\mathcal\{N\}are sets of ground atomspt\(c1,…,cn\)p\_\{t\}\(c\_\{1\},\\ldots,c\_\{n\}\)representing true instances \(*positive examples*\) and false instances \(*negative examples*\), respectively\. A logic programPPis a*solution*of\(B,𝒫,𝒩\)\(\{B\},\\mathcal\{P\},\\mathcal\{N\}\)ifB∪P\{B\}\\cup Pentails all positive examples in𝒫\\mathcal\{P\}and none of the negative examples in𝒩\\mathcal\{N\}\. An atom with the target predicateptp\_\{t\}is called a*target atom*\.
Inpropositionallogic, each atom amounts to a Boolean variable\. When learning propositional logic programs\(Inoueet al\.,[2014](https://arxiv.org/html/2604.07897#bib.bib20)\), an interpretationIIcontains Boolean values of all atoms that appear in the body of any rule in the logic program, while another interpretationJJcontains the Boolean values of the head atoms\. The learned logic programPPsatisfiesTP\(I\)=JT\_\{P\}\(I\)=Jfor all pairs\(I,J\)∈E\(I,J\)\\in E, whereEEis a set of interpretation pairs\.
Gaoet al\.\([2022a](https://arxiv.org/html/2604.07897#bib.bib22)\)used propositionalization to build interpretation transitions through grounding all possible body atoms and the target atom with substitutions for learning first\-order logic programs\. An input interpretation vector𝐱\\mathbf\{x\}represents the Boolean values of all possible non\-ground body atoms under a substitutionθ\\theta, and the outputy\{y\}represents the Boolean values of the target atoms underθ\\theta\. Based on this,Gaoet al\.\([2025](https://arxiv.org/html/2604.07897#bib.bib24)\)proposed NeurRL, a neural architecture that learns the immediate consequence operatorTPT\_\{P\}and extracts a first\-order logic programPPfrom its trained parameters, ensuring thatPPsatisfies the ILP learning setting\. The neural network is designed as follows:
y^=RuleNetwork\(𝐱\)=⋁~\(fm⋅fm−1⋯⋅f1\(𝐱\)\),\\displaystyle\\hat\{\{y\}\}=\\text\{RuleNetwork\}\(\\mathbf\{x\}\)=\\widetilde\{\\bigvee\}\(f\_\{m\}\\cdot f\_\{m\-1\}\\dots\\cdot f\_\{1\}\(\\mathbf\{x\}\)\),\(1\)wherey^\\hat\{y\}is the predicted Boolean value of the target atom, the right\-hand side uses the fuzzy disjunction operator⋁~\(x1,…,xn\)=1−\(1−x1\)⋅…⋅\(1−xn\)\\widetilde\{\\bigvee\}\(x\_\{1\},\\dots,x\_\{n\}\)=1\-\(1\\,\{\-\}\\,x\_\{1\}\)\\cdot\\ldots\\cdot\(1\\,\{\-\}\\,x\_\{n\}\), and theii\-th layerfif\_\{i\}is:
fi\(𝐱\)=11−dReLu\(𝐌i𝐱−d\),\\displaystyle f\_\{i\}\(\\mathbf\{x\}\)=\\frac\{1\}\{1\-d\}\\text\{ReLu\}\(\\mathbf\{M\}\_\{i\}\\mathbf\{x\}\-d\),withddas the fixed bias\. To interpret neural networks into a set of rules, the sum of weights connected to the same hidden node in the next layer is constrained to 1 in each layer\(Gaoet al\.,[2024](https://arxiv.org/html/2604.07897#bib.bib23)\)\. Hence, the softmax activation function is applied on each row of every trainable layer𝐌~i\\tilde\{\\mathbf\{M\}\}\_\{i\}:
𝐌i\[j,k\]=e𝐌~i\[j,k\]∑u=1nine𝐌~i\[j,u\],j∈\[1,nout\],k∈\[1,nin\],\\displaystyle\\mathbf\{M\}\_\{i\}\[j,k\]=\\frac\{e^\{\\tilde\{\\mathbf\{M\}\}\_\{i\}\}\[j,k\]\}\{\\sum\_\{u=1\}^\{n\_\{\\text\{in\}\}\}e^\{\\tilde\{\\mathbf\{M\}\}\_\{i\}\[j,u\]\}\},j\\in\[1,n\_\{\\text\{out\}\}\],k\\in\[1,n\_\{\\text\{in\}\}\],whereninn\_\{\\text\{in\}\}andnoutn\_\{\\text\{out\}\}denote the number of columns and rows of𝐌~i\\tilde\{\\mathbf\{M\}\}\_\{i\}, respectively\. After training, rules are extracted from the*logic program tensor*𝐌P=∏i=1m𝐌i\\mathbf\{M\}\_\{P\}=\\prod\_\{i=1\}^\{m\}\\mathbf\{M\}\_\{i\}; each row corresponds to a rule, and elements in the row correspond to atoms, with atoms exceeding a threshold included in that rule\.
*Generalization*\(Plotkin,[1970](https://arxiv.org/html/2604.07897#bib.bib59); Buntine,[1988](https://arxiv.org/html/2604.07897#bib.bib66)\)applies a substitution to replace specific terms, such as constants, with more general ones, such as variables, within a logic program\. Through it, the rules can be regarded as knowledge that is applicable to a wider range of examples\.
Throughout the paper, let𝔼\\mathbb\{E\}represent the image constant embeddings under the background knowledgeBB, and letℤ\\mathbb\{Z\}represent a latent space\. The symbol grounding problem targets establishing a mapping from an input𝐞∈𝔼\\mathbf\{e\}\\in\\mathbb\{E\}to some latent state𝐳∈ℤ\\mathbf\{z\}\\in\\mathbb\{Z\}that is fed into a predefined symbolic reasoning procedure for producing the final outputyy\. The training data contains only the image constants𝐞\\mathbf\{e\}’s and the correspondingy\{y\}’s\(Liet al\.,[2023](https://arxiv.org/html/2604.07897#bib.bib61)\)\. The labels of latent states𝐳\\mathbf\{z\}are not leaked, putting the problem into a weakly\-supervised setting\(Zhou,[2017](https://arxiv.org/html/2604.07897#bib.bib65)\)\. Predicate invention\(Kok and Domingos,[2007](https://arxiv.org/html/2604.07897#bib.bib38)\)refers to the discovery of new concepts, properties, and relations from data, expressed in terms of observable predicates\.
### 3\.3Encoders
Encoders transfer the raw data into embeddings\. Vision transformers \(ViT\)\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2604.07897#bib.bib14)\)generate the embeddings for images\. A variational autoencoder \(VAE\)\(Kingma and Welling,[2014](https://arxiv.org/html/2604.07897#bib.bib74)\)is a type of generative model in deep learning that learns to encode data into a compressed latent representation and then decode it back to reconstruct the original data\.
Clustering methods group similar embeddings into the same clusters\. We adopt the differentiable clustering approach proposed byFardet al\.\([2020](https://arxiv.org/html/2604.07897#bib.bib16)\)\. LetKKbe the number of clusters,𝐜i∈ℝd\\mathbf\{c\}\_\{i\}\\in\\mathbb\{R\}^\{d\}represent theii\-th cluster center, whereddis the embedding dimension, and𝒞=\{𝐜1,𝐜2,…,𝐜K\}\\mathcal\{C\}=\\\{\\mathbf\{c\}\_\{1\},\\mathbf\{c\}\_\{2\},\\dots,\\mathbf\{c\}\_\{K\}\\\}denote the set of cluster representations\. Then the clustering objective is defined as:
Lcluster=∑𝐞∈𝔼∑i=1Kf\(h\(𝐞\),𝐜i\)⋅Gi,f\(h\(𝐞\),α;𝒞\),\\displaystyle L\_\{\\text\{cluster\}\}=\\sum\_\{\\mathbf\{e\}\\in\\mathbb\{E\}\}\\sum\_\{i=1\}^\{K\}f\(\{h\}\(\\mathbf\{e\}\),\\mathbf\{c\}\_\{i\}\)\\cdot G\_\{i,f\}\(\{h\}\(\\mathbf\{e\}\),\\alpha;\\mathcal\{C\}\),\(2\)wherehhis the encoder function,ffis a distance metric \(e\.g\., mean squared error\), and G is a differentiable weighting function assigning maximum weight to the minimal distance\(Janget al\.,[2017](https://arxiv.org/html/2604.07897#bib.bib17)\):
Gi,f\(h\(𝐞\),α;𝒞\)=exp\(−αf\(h\(𝐞\),𝐜i\)\)∑i′=1Kexp\(−αf\(h\(𝐞\),𝐜i′\)\),\\displaystyle G\_\{i,f\}\(\{h\}\(\\mathbf\{e\}\),\\alpha;\\mathcal\{C\}\)=\\frac\{\\exp\(\{\-\\alpha f\(\{h\}\(\\mathbf\{e\}\),\\mathbf\{c\}\_\{i\}\)\}\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{K\}\\exp\(\{\-\\alpha f\(\{h\}\(\\mathbf\{e\}\),\\mathbf\{c\}\_\{i^\{\\prime\}\}\)\}\)\},withα\\alpha\>\>0\. Largerα\\alphamakesGGcloser to the discrete minimum, while smallerα\\alphasmooths training\.
## 4Method
Figure 1:The pipeline of the learning framework\.EEandE′E^\{\\prime\}indicate the encoder function for image and text, respectively\.We proposeγ\\gammaILP, a fully differentiable ILP framework depicted in Figure[1](https://arxiv.org/html/2604.07897#S4.F1), that learns first\-order logic programs from relational image data or Kandinsky patterns, where explicit relations are undefined\. Learning involves generalizing from image constants to cluster indices, then constructing non\-ground atoms to describe target atoms or image classes\.γ\\gammaILP consists of a deep clustering module serving as a generalization function, a latent knowledge base generator, and a rule learning neural network with a novel differentiable substitution method\. The output ofγ\\gammaILP is a logic program which, in case the relations between the constants are not well\-defined, will contain predicate placeholders, and the semantics of predicate placeholders can be inferred from images represented by the variables\. Additionally,γ\\gammaILP incorporates with LLMs to obtain the semantics of the predicate placeholders in symbolic format\.
### 4\.1Deep Clustering Module and Knowledge Base Generator
In the paper, each constant is in image format, and relations are predicates\. When relationsrrbetween image constantseeare predefined, each instance in a background knowledgeBBis represented asr\(e1,e2\)r\(\{e\}\_\{1\},\{e\}\_\{2\}\), and the data is called relational image data\. We induce the logic programs to describe the target atom\. When such relations are undefined but are essential for characterizing the target atom, we set each image instance as background knowledgeBB, and each object inside the image instance is regarded as an image constant\. The data is regarded as pure image data, and the representative benchmark is Kandinsky patterns\. We induce the logic programPPbased onBBto describe the image class, and the variables inPPcan be substituted with the image object constants\.
In ILP, the generalization functionggreplaces specific terms with generalized terms, e\.g\., constants with variables\. Clustering groups similar constants under a centroid, and the clustering serves as a generalization functiong:𝔼→𝒞;𝐞↦𝐜g:\\mathbb\{E\}\\rightarrow\\mathcal\{C\};\\mathbf\{e\}\\mapsto\\mathbf\{c\}, where𝐜\\mathbf\{c\}is the centroid containing the image constant𝐞\\mathbf\{e\}\. For adaptively learning the clusters of constants, we use the differentiable one defined in Eq\. \([2](https://arxiv.org/html/2604.07897#S3.E2)\)\. Then, we transferBBto alatent knowledge basedenoted asKBK\_\{B\}:
KB=\{\{𝐫⊕g\(𝐞1\)⊕g\(𝐞2\)∣r\(e1,e2\)∈B\},when relations are defined\{g\(𝐞\)∣e∈B\},when relations are not defined,\\displaystyle K\_\{B\}=\\begin\{cases\}\\\{\\ \\mathbf\{r\}\\oplus g\(\\mathbf\{e\}\_\{1\}\)\\oplus g\(\\mathbf\{e\}\_\{2\}\)\\mid r\(e\_\{1\},e\_\{2\}\)\\in B\\\},\\text\{when relations are defined\}\\\\ \\ \\\{g\(\\mathbf\{e\}\)\\mid e\\in B\\\},\\ \\text\{when relations are not defined\}\\end\{cases\},where⊕\\oplusindicates the concatenation between vectors,𝐞\\mathbf\{e\}and𝐫\\mathbf\{r\}denote the embeddings of constanteeand relationrrby encoders, respectively\.
### 4\.2Differentiable Substitution
The differentiable substitution is implemented at the batch level\. In the sequel, we confine to unary and binary predicates, but the method can be extended to predicates of arbitrary arity\. LetNNbe the dimension of the input𝐱\\mathbf\{x\}andddthe number of variables in the learned logic program\. We describe the substitution methods for both the defined and undefined relations as follows\.
#### Relations are predefined\.
LetRbR\_\{b\}andRuR\_\{u\}denote the sets of binary and unary relations, respectively\. When the relations are defined,N=\|Rb\|×P\(d,2\)\+\|Ru\|×d−1,N=\|R\_\{b\}\|\\times P\(d,2\)\+\|R\_\{u\}\|\\times d\-1,whereP\(d,2\)=d\(d−1\)P\(d,2\)=d\(d\-1\)is the number of permutations ofdddistinct elements taken two at a time\. The subtraction of 1 indicates that the target atom is excluded from being considered as a possible body atom\. Algorithm[1](https://arxiv.org/html/2604.07897#alg1)outlines the differentiable substitution procedure when relations are defined, where positive and negative substitution sets \(Θ\+\\Theta^\{\+\},Θ−\\Theta^\{\-\}\) are constructed for supervised learning with labeled data\. If an instancer\(e1,e2\)r\(\{e\}\_\{1\},\{e\}\_\{2\}\)exists inBB, we consider the embeddings𝐞1\\mathbf\{e\}\_\{1\}and𝐞2\\mathbf\{e\}\_\{2\}to beconnected\. We define the functionConnected\(𝐗,𝐘\)\\text\{Connected\}\(\\mathbf\{X\},\\mathbf\{Y\}\)to retrieve all constant embeddings connected to the constant embeddings𝐗\\mathbf\{X\}and𝐘\\mathbf\{Y\}\. The random selection function, denoted asRandom\(𝔼\)\\text\{Random\}\(\\mathbb\{E\}\), returns a randomly chosen element from all image embeddings𝔼\\mathbb\{E\}\. For each substitution inΘ\+\\Theta^\{\+\}, we substitute each head variable with the constant embeddings corresponding to the constant pair that appears in the positive examples\. The auxiliary variables are then replaced with randomly selected embeddings from𝔼\\mathbb\{E\}\. For each substitution inΘ−\\Theta^\{\-\}, we assign the head variables to embeddings of the constant pair not present in the positive examples by replacing the variableXXwith a random embedding\. The auxiliary variables are similarly replaced with randomly selected embeddings from𝔼\\mathbb\{E\}\. In addition, when the target predicateptp\_\{t\}is binary and the number of variablesd=3d=3, the auxiliary variableV3V\_\{3\}in the rules connects the head variablesXXandYYfollowing a forward\-chaining pattern\(Kaminskiet al\.,[2018](https://arxiv.org/html/2604.07897#bib.bib53)\)\. This introduces a language bias of the form:pt\(X,Y\)←p1\(X,V3\),p2\(V3,Y\)\.p\_\{t\}\(X,Y\)\\leftarrow p\_\{1\}\(X,V\_\{3\}\),\\ p\_\{2\}\(V\_\{3\},Y\)\.Note that the variables in the body atoms of this forward\-chaining bias can be interchanged\. Consequently, we replace the random function in Line[12](https://arxiv.org/html/2604.07897#alg1.l12)of Algorithm[1](https://arxiv.org/html/2604.07897#alg1)with𝐞1∈Connected\(𝐞x,𝐞y\)\\mathbf\{e\}\_\{1\}\\in\\text\{Connected\}\(\\mathbf\{e\}\_\{x\},\\mathbf\{e\}\_\{y\}\), selecting embeddings that satisfy the forward\-chain pattern to enhance the learning process\.
#### Relations are undefined\.
When relations are not explicitly defined, the possible body atoms consist of one assigned predicate placeholder for each term list\. Then,N=C\(d,2\)\+d−1,N=C\(d,2\)\+d\-1,whereC\(d,2\)=d\(d−1\)/2C\(d,2\)=d\(d\-1\)/2is the number of combinations ofddelements taken two at a time\. In addition, each image instance corresponds to a knowledge baseBB, and the object represented in an image instance consists of the constant embedding set𝔼\\mathbb\{E\}\. We introduce avariable constraint: the number of variables in the learned logic program is set equal to the number of clusters\. As a result,*each variable denoted asV~\\tilde\{V\}corresponds to a specific group of similar constants*\. Logic programs with variables under constraints are calledconstrained logic programs\. Hence, we can use a functionRetrieveConstants\(V~i\)\\text\{RetrieveConstants\}\(\\tilde\{V\}\_\{i\}\)to retrieve the constants represented byV~i\\tilde\{V\}\_\{i\}from the constrained logic program\. This enables us to interpret the predicate placeholderspip\_\{i\}in atomspi\(V~1,…V~n\)p\_\{i\}\(\\tilde\{V\}\_\{1\},\.\.\.\\tilde\{V\}\_\{n\}\)by analyzing the constants under each constrained variable\. Then, we use the substitution set asθ~=\{V1/𝐜V1,…,Vd/𝐜Vd\}\\tilde\{\\theta\}=\\\{V\_\{1\}/\\mathbf\{c\}\_\{V\_\{1\}\},\\dots,V\_\{d\}/\\mathbf\{c\}\_\{V\_\{d\}\}\\\}, where𝐜Vi\\mathbf\{c\}\_\{V\_\{i\}\}indicates the representation of the centroid corresponding to the constrained variableViV\_\{i\}\. The substitution here refers to replacing variables with cluster centroids, which is regarded as the symbolic assignments for random image constants derived from the clustering\-based generalization functiongg\.
Algorithm 1Differentiable substitution methodInput: VariablesX\(V1\),Y\(V2\),V3,…,VdX\(V\_\{1\}\),Y\(V\_\{2\}\),V\_\{3\},\\dots,V\_\{d\}andd≥1d\\geq 1; the binary or unary target atompt\(X,Y\)p\_\{t\}\(X,Y\)resp\.pt\(X\)p\_\{t\}\(X\); the background knowledgeBB; and the set𝔼\\mathbb\{E\}of all constant embeddings\. Output: Positive substitution setΘ\+\\Theta^\{\+\}and negative substitution setΘ−\\Theta^\{\-\}\.
1:Initialize the substitution sets
Θ\+\\Theta^\{\+\}and
Θ−\\Theta^\{\-\}as empty\.
2:Update the clustering module and get centroid embedding
g\(𝐞\)g\(\\mathbf\{e\}\)for each constant embedding
𝐞∈𝔼\\mathbf\{e\}\\in\\mathbb\{E\}\.
3:whilebatch size is not reacheddo
4:Initialize two substitutions
θ\+\\theta^\{\+\}and
θ−\\theta^\{\-\}as empty sets\.
5:if
ptp\_\{t\}is a binary predicatethen
6:Randomly select the constant pair
\(𝐞x,𝐞y\)\(\\mathbf\{e\}\_\{x\},\\mathbf\{e\}\_\{y\}\)for a positive example
pt\(ex,ey\)∈Bp\_\{t\}\(e\_\{x\},e\_\{y\}\)\\in B, and another embedding
𝐞x−∈𝔼∖\{𝐞x\}\\mathbf\{e\}\_\{x\}^\{\-\}\\in\\mathbb\{E\}\\setminus\\\{\\mathbf\{e\}\_\{x\}\\\}\.
7:Add
X/g\(𝐞x\),Y/g\(𝐞y\)X/g\(\\mathbf\{e\}\_\{x\}\),Y/g\(\\mathbf\{e\}\_\{y\}\)to
θ\+\\theta^\{\+\}and
X/g\(𝐞x−\),X/g\(\\mathbf\{e\}\_\{x\}^\{\-\}\),Y/g\(𝐞y\)Y/g\(\\mathbf\{e\}\_\{y\}\)to
θ−\\theta^\{\-\}\.
8:else
9:Randomly select the constant embedding
𝐞x\\mathbf\{e\}\_\{x\}for a positive example
pt\(ex\)∈Bp\_\{t\}\(e\_\{x\}\)\\in B, and another embedding
𝐞x−∈𝔼∖\{𝐞x\}\\mathbf\{e\}\_\{x\}^\{\-\}\\in\\mathbb\{E\}\\setminus\\\{\\mathbf\{e\}\_\{x\}\\\}\.
10:Add
X/g\(𝐞x\),Y/g\(Random\(𝔼\)\)X/g\(\\mathbf\{e\}\_\{x\}\),Y/g\(\\text\{Random\}\(\\mathbb\{E\}\)\)to
θ\+\\theta^\{\+\}and
X/g\(𝐞x−\),X/g\(\\mathbf\{e\}\_\{x\}^\{\-\}\),Y/g\(Random\(𝔼\)\)Y/g\(\\text\{Random\}\(\\mathbb\{E\}\)\)to
θ−\\theta^\{\-\}\.
11:endif
12:Randomly choose the constant representations
𝐞3\+\\mathbf\{e\}\_\{3\}^\{\+\}, …,
𝐞d\+,𝐞3−\\mathbf\{e\}\_\{d\}^\{\+\},\\mathbf\{e\}\_\{3\}^\{\-\}, …,
𝐞d−∈Random\(𝔼\)\\mathbf\{e\}\_\{d\}^\{\-\}\\in\\text\{Random\}\(\\mathbb\{E\}\)\.
13:Add
V3/g\(𝐞3\+\)V\_\{3\}/g\(\\mathbf\{e\}\_\{3\}^\{\+\}\), …,
Vd/g\(𝐞d\+\)V\_\{d\}/g\(\\mathbf\{e\}\_\{d\}^\{\+\}\)to
θ\+\\theta^\{\+\}and add
V3/g\(𝐞3−\),V\_\{3\}/g\(\\mathbf\{e\}\_\{3\}^\{\-\}\),…,
Vd/g\(𝐞d−\)V\_\{d\}/g\(\\mathbf\{e\}\_\{d\}^\{\-\}\)to
θ−\\theta^\{\-\}\.
14:Add the substitution
θ\+\\theta^\{\+\}and
θ−\\theta^\{\-\}to
Θ\+\\Theta^\{\+\}and
Θ−\\Theta^\{\-\}, respectively\.
15:endwhile
16:return
Θ\+\\Theta^\{\+\}and
Θ−\\Theta^\{\-\}\.
### 4\.3Differentiable Rule Learning Process
Each substitution in the substitution set can be regarded as a tensor with constant embeddings corresponding to all variables in the learnedPP\. Besides, each substitution generates a training example\(𝐱,y\)\(\\mathbf\{x\},y\), where𝐱\\mathbf\{x\}encodes the Boolean values of all possible body atoms\. When relations are defined,yyindicates the Boolean value of the target atom\. Conversely, when relations are not defined,yyindicates the label of the image instance class\. We present how we generate the training examples for the differentiable rule learning module based on each substitution as follows\.
#### Relations are predefined\.
For each non\-ground atomα\\alphawith a predicaterrand term listαT\\alpha\_\{T\}, we concatenate the constant embeddings to build the embedding of the ground atomαθ\\alpha\\thetaas follows:
αθ=𝐫⊕Vi1θ⊕Vi2θ⊕⋯⊕Vinθ,Vij∈αT,θ∈Θ\+∪Θ−\.\\displaystyle\\alpha\\theta=\\mathbf\{r\}\\oplus V\_\{i\_\{1\}\}\\theta\\oplus V\_\{i\_\{2\}\}\\theta\\oplus\\dots\\oplus V\_\{i\_\{n\}\}\\theta,\\ V\_\{i\_\{j\}\}\\in\\alpha\_\{T\},\\theta\\in\\Theta^\{\+\}\\cup\\Theta^\{\-\}\.\(3\)Then, we determine the ground truth ofαθ\\alpha\\thetausing alookup functionLL, whereL\(αθ\)=1L\(\\alpha\\theta\)\\,\{=\}\\,1ifαθ∈KB\\alpha\\theta\\,\{\\in\}\\,K\_\{B\}, andL\(αθ\)L\(\\alpha\\theta\)=0=0otherwise\. Given all possible body atomsα1,…,αN\\alpha\_\{1\},\\dots,\\alpha\_\{N\}and based on each positive substitutionθ\+∈Θ\+\\theta^\{\+\}\\in\\Theta^\{\+\}, we generate the positive input𝐱\+=\[L\(α1θ\+\),\\mathbf\{x\}^\{\+\}=\[L\(\\alpha\_\{1\}\\ \\theta^\{\+\}\),…,L\(αNθ\+\)\]\\dots,L\(\\alpha\_\{N\}\\theta^\{\+\}\)\]and its labely\+=1y^\{\+\}=1\. Similarly, we also generate the negative input𝐱−=\[L\(α1θ−\)\\mathbf\{x\}^\{\-\}=\[L\(\\alpha\_\{1\}\\ \\theta^\{\-\}\),…,L\(αNθ−\)\]\\dots,L\(\\alpha\_\{N\}\\theta^\{\-\}\)\]and its labely−=0y^\{\-\}=0under each negative substitution tensorθ−∈Θ−\\theta^\{\-\}\\in\\Theta^\{\-\}\. With the tensor operations, we can look up the ground truth values for all possible body atoms under a batch of substitutions inΘ\+\\Theta^\{\+\}andΘ−\\Theta^\{\-\}, generate training examples\(𝐱,y\)\(\\mathbf\{x\},y\), and train rule networks on GPUs in parallel\.
#### Relations are undefined\.
When relevant predicates are not present in the training data,γ\\gammaILP can still learn first\-order rules with predicate placeholders\. Letα\\alphabe a possible body atom andαT\\alpha\_\{T\}be its term list\. The lookup functionL\(αθ~\)L\(\\alpha\\tilde\{\\theta\}\)to obtain the Boolean value of the ground atomαθ~\\alpha\\tilde\{\\theta\}is defined as:
L\(αθ~\)=L\(V~i1θ~\)∧L\(V~i2θ~\)∧⋯∧L\(V~inθ~\),V~ij∈αT,\\displaystyle L\(\\alpha\\tilde\{\\theta\}\)=L\(\\tilde\{V\}\_\{i\_\{1\}\}\\tilde\{\\theta\}\)\\land L\(\\tilde\{V\}\_\{i\_\{2\}\}\\tilde\{\\theta\}\)\\land\\dots\\land L\(\\tilde\{V\}\_\{i\_\{n\}\}\\tilde\{\\theta\}\),\\ \\tilde\{V\}\_\{i\_\{j\}\}\\in\\alpha\_\{T\},which indicates that if all symbolic assignments of image constants substituting for all variables inα\\alphaunderθ~\\tilde\{\\theta\}are inKBK\_\{B\}simultaneously, then the Boolean value of the ground atomαθ~\\alpha\\tilde\{\\theta\}with the placeholder predicate is true; otherwise, it is false\. We apply the lookup function to all body atom boolean values under a substitution, then we can obtain one training instancex, and the labelyyindicates the class of the image instance\. For Kandinsky patterns, an image instance includes multiple image constants\. In each epoch, the substitution grounds the variables into random constant representations in an image instance\.
Overall, for the defined relations or undefined relations learning settings, the loss functionHHcan be summarized as follows:
H=MSE\(y,RuleNetwork\(𝐱\)\)\+λ⋅Lcluster,\\displaystyle H=\\text\{MSE\}\(y,\\text\{RuleNetwork\}\(\\mathbf\{x\}\)\)\+\\lambda\{\\cdot\}L\_\{\\text\{cluster\}\},\(4\)whereRuleNetwork\(𝐱\)\\text\{RuleNetwork\}\(\\mathbf\{x\}\)is defined in Eq\. \([1](https://arxiv.org/html/2604.07897#S3.E1)\) andLclusterL\_\{\\text\{cluster\}\}in Eq\. \([2](https://arxiv.org/html/2604.07897#S3.E2)\)\. We jointly train the rule learning network and clustering module, enabling simultaneous adjustment of generalized embeddings for constant embeddings and rule structures\. The rules are extracted from the well\-trained rule networks\.
After training the model, the rules are extracted according to the logic program tensor𝐌P\\mathbf\{M\}\_\{P\}\. When the relations are not well\-defined, we can induce the semantics of predicate placeholder in an atomα\\alphabased on the constrained variable order and the constant under these clusters corresponding to the constrained variables\. As shown by\(Gubelmann,[2024](https://arxiv.org/html/2604.07897#bib.bib60)\), LLMs can infer linguistic meaning based on their pre\-trained knowledge without extra labels\. We further utilize LLMs as a function QueryLLM\(prompt,C\)\(\\text\{prompt\},\{C\}\)to translate from the semantics of predicate placeholders presented in constant imagesCCunder their constrained variables to natural language semantics by a well\-design prompt descried in Algorithm[2](https://arxiv.org/html/2604.07897#alg2)\. Moreover, Algorithm[2](https://arxiv.org/html/2604.07897#alg2)constructs the final logic program P \(Line[11](https://arxiv.org/html/2604.07897#alg2.l11)\) by merging LLM\-induced predicates into a generalized predicate with variables representing arbitrary constants\.
Algorithm 2Inducing semantics of predicate placeholdersInput: Constrained logic programsPtP\_\{t\}with predicate placeholders\. Output: The learned logic programPP\.
1:foreach rule
rrin
PtP\_\{t\}do
2:foreach atom
\_\(Vi,Vj\)\\\_\(V\_\{i\},V\_\{j\}\)do
3:
Ci,Cj=C\_\{i\},C\_\{j\}=RetrieveConstants
\(Vi,Vj\)\(V\_\{i\},V\_\{j\}\)\.
4:
\_←\\\_\\leftarrowQueryLLMs
\(\(“What is the relation between the two ordered sets of images?”
,Ci,Cj\),C\_\{i\},C\_\{j\}\)\.
5:endfor
6:foreach atom
\_\(Vi\)\\\_\(V\_\{i\}\)do
7:
Ci=C\_\{i\}=RetrieveConstants
\(Vi\)\(V\_\{i\}\)\.
8:
\_←\\\_\\leftarrowQueryLLMs
\(\(“What is the common property of the set of images?”
,Ci\),C\_\{i\}\)\.
9:endfor
10:endfor
11:Generalize rules in
PtP\_\{t\}using the induced predicate semantics to obtain the final logic program
PP\.
12:return
PP\.
## 5Experimental Results
Rules are evaluated byprecisionandrecall: precision is the fraction of substitutions satisfying both the body and the head among those satisfying the body, and recall is the fraction of ground\-truth positives correctly induced\(Gaoet al\.,[2024](https://arxiv.org/html/2604.07897#bib.bib23)\)\. Precision reflects the correctness of a rule or logic program in avoiding false positives, while recall reflects completeness in classifying all target labels by avoiding false negatives\. We use the AdamW optimizer\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2604.07897#bib.bib45)\)to trainγ\\gammaILP\. We runγ\\gammaILP on classical ILP datasets\(Evans and Grefenstette,[2018](https://arxiv.org/html/2604.07897#bib.bib28)\)with explicit constant and relation labels to assess its ILP capability, and compareγ\\gammaILP with∂\\partialILP and DFORL\. At the same time, we validate the inductive learning abilities of LLMs \(GPT\-5 and Gemini 2\.5 Pro\) on the classical ILP datasets and compare their results withγ\\gammaILP\. All non\-LLM experiments were run on a Linux server \(7 cores Intel 8362, 245 GB RAM, NVIDIA A100\)\. The results show that Gemini 2\.5 Pro learns correct \(precision 1\) and complete \(recall 1\) logic program,γ\\gammaILP learns correct rules with three variables efficiently, and GPT\-5 correctly learns rules for a reduced number of input instances\. Detailed results and average running times ofγ\\gammaILP are given in Table[4](https://arxiv.org/html/2604.07897#A1.T4)of Appendix[A](https://arxiv.org/html/2604.07897#A1)\.
### 5\.1Reasoning on Relational Image Datasets
We evaluateγ\\gammaILP ’s inductive reasoning ability on relational images using the benchmarks ofEvans and Grefenstette \([2018](https://arxiv.org/html/2604.07897#bib.bib28)\), with MNIST digits as constants and avoiding leaking their labels\. We make the datasets by replacing the constants in the classical ILP datasets with two MNIST images of the corresponding label\. One relational fact is used for training and another for testing\. Relations describing image constants are in text format\. We use pre\-trained VAE as the encoder for relations\. For the image constants, we use pre\-trained ViT and VAE as the encoders for image constants, and the models are denoted asγ\\gammaILP\-ViT andγ\\gammaILP\-VAE, respectively\. Sinceγ\\gammaILP is the first model to learn rules from relational image datasets without symbolic label leakage, and LLMs are also considered as the inductive rule learner without image label as inputs, we here compareγ\\gammaILP with state\-of\-the\-art multimodal LLMs, including Gemini 2\.5 Pro, GPT\-o3, and GPT\-5\. To prevent label leakage when testing LLMs as symbolic solvers, we replace each relation with the same random string\. We setλ=1\\lambda=1and use 10 clusters \(matching the digit classes\)\. We retain only the rules with a precision of 1 in the learned logic program, and the average recall of the learned logic program over ten runs is reported in Table[1](https://arxiv.org/html/2604.07897#S5.T1)\. To validate the inductive reasoning ability of LLMs without leaking relation semantics, we replace each relation with a random string\. Results show that GPT\-5 learns complete rules, while Gemini 2\.5 Pro and GPT\-o3 learn incorrect rules when relation semantics are hidden under some tasks\.γ\\gammaILP learns complete rules except in the Fizz and Buzz datasets\. Under the Fizz and Buzz datasets, the correct rules are:fizz\(X\)←fizz\(Y\),succ\(Y,Z1\),succ\(Z1,Z2\),succ\(Z2,X\)fizz\(X\)\\leftarrow fizz\(Y\),succ\(Y,Z\_\{1\}\),succ\(Z\_\{1\},Z\_\{2\}\),succ\(Z\_\{2\},X\)andbuzz\(X\)←buzz\(Y\),succ\(Y,Z1\),succ\(Z1,Z2\),succ\(Z2,Z3\),succ\(Z3,Z4\),succ\(Z4,X\)buzz\(X\)\\leftarrow buzz\(Y\),succ\(Y,Z\_\{1\}\),succ\(Z\_\{1\},Z\_\{2\}\),succ\(Z\_\{2\},Z\_\{3\}\),succ\(Z\_\{3\},Z\_\{4\}\),succ\(Z\_\{4\},X\), respectively\. Hence, the correct rules require at least 4 and 6 variables, respectively, which exceed the lengthγ\\gammaILP can effectively induce under the time limit\.
Table 1:Recall of the learned logic program on relational image datasets\. Best results are shown in bold\. A ‘–’ indicates that the rules were learned correctly but incompletely\.We evaluate rule\-learning ability ofγ\\gammaILP on the temporal MNIST sequence task\(Evanset al\.,[2021](https://arxiv.org/html/2604.07897#bib.bib33)\), equating annotated relations with unlabeled MNIST images as constants\. In MNIST sequences, each image is assigned a positional index\. Letsucc\(e1,e2\)\\texttt\{succ\}\(e\_\{1\},e\_\{2\}\)denote that the label of imagee2e\_\{2\}is the successor of the label of the imagee1e\_\{1\},start\(e\)=True\\texttt\{start\}\(e\)=Trueindicateeeis the image at the first index, andbeforen\(e1,e2\)\\texttt\{before\}\_\{n\}\(e\_\{1\},e\_\{2\}\)indicate the imagee1e\_\{1\}precedes the imagee2e\_\{2\}bynnindices\. The input image sequence is 0, 5, 1, 5, 2, 5, 3, 5, 4, 5, 5, 5,…\\dots, with training limited to the first 12 images \(Figure[4](https://arxiv.org/html/2604.07897#A2.F4), Appendix[B](https://arxiv.org/html/2604.07897#A2)\)\. Since the Apperception Engine\(Evanset al\.,[2021](https://arxiv.org/html/2604.07897#bib.bib33)\)requires a pre\-trained MNIST model and a logic template, its rule format differs; thus, we present only the rules generated byγ\\gammaILP\. Let the index of the first image be 1\. Assume an imageeehas an even index and we settarget\(e\)\\texttt\{target\}\(e\)to True\. Then, a learned rule with the precision 1 when using VAE or ViT as encoders is the same as follows:target\(X\)←before8\(X,Y\)∧before10\(X,Y\)∧target\(Y\)\.\\texttt\{target\}\(X\)\\leftarrow\\texttt\{before\}\_\{8\}\(X,Y\)\\land\\texttt\{before\}\_\{10\}\(X,Y\)\\land\\texttt\{target\}\(Y\)\.This captures either variableXXorYYrepresenting two images labeled 5 at different indices, and the distance between the two images is 2\. Also, it captures regularities at distances 8 and 10 among images labeled 5\. Then, assume an imageeehas an odd index and we settarget\(e\)\\texttt\{target\}\(e\)to True\. The learned rule with precision and recall 1 by using ViT or VAE inγ\\gammaILP is the same as follows:target\(X\)←succ\(X,Y\)∧before2\(X,Y\)∧target\(Y\)\\texttt\{target\}\(X\)\\leftarrow\\texttt\{succ\}\(X,Y\)\\land\\texttt\{before\}\_\{2\}\(X,Y\)\\land\\texttt\{target\}\(Y\), stating that if two images are two indices apart, the latter’s label succeeds the former’s\.
### 5\.2Reasoning with Predicate Invention
We applyγ\\gammaILP to classify binary Kandinsky patterns\(Müller and Holzinger,[2021](https://arxiv.org/html/2604.07897#bib.bib34)\)and assess its predicate invention ability without leaking constant labels\. Each Kandinsky image instance includes multiple image objects as constants\. Constant relations, though undefined in the instance, are essential for describing positive instances in first\-order logic\. Each constant in Kandinsky instances has a color \(red, blue, yellow\) and a shape \(circle, square, triangle\)\. Three patterns are illustrated in Figs\.[2\(a\)](https://arxiv.org/html/2604.07897#S5.F2.sf1)–[2\(c\)](https://arxiv.org/html/2604.07897#S5.F2.sf3): two\-pair \(two disjoint object pairs of the same shape, one pair sharing color and the other differing\), one\-red \(at least one red object\), and one\-triangle \(at least one triangle\-shaped object\)\. We extract all non\-grey subareas as image constants and learn first\-order rules using placeholder predicates to represent the Kandinsky image instance class\.
\(a\)TP
\(b\)OR
\(c\)OT
\(d\)Two\-pair constants\.
Figure 2:Kandinsky patterns for tasks \(a\) two\-pair \(TP\), \(b\) one\-red \(OR\), and \(c\) one\-triangle \(OT\)\. Constants represented by variables in the learned rules for TP are shown in \(d\)\.We used 30 instances per Kandinsky pattern for training and 30 for testing, with balanced positive and negative instances\. Classification accuracy across models is shown in Table[2](https://arxiv.org/html/2604.07897#S5.T2), including the CNN\-based model ResNet\(Heet al\.,[2016](https://arxiv.org/html/2604.07897#bib.bib46)\), ViT, YOLO v5\(Redmonet al\.,[2016](https://arxiv.org/html/2604.07897#bib.bib47)\)with an MLP layer, and prominent LLMs\. As we evaluate with LLMs, all learning strategies follow the few\-shot strategy\. Each experiment was run ten times, reporting the highest accuracy for best interpretability\. Learned rules from baselines and the sensitivity ofγ\\gammaILP are discussed in Appendices[E](https://arxiv.org/html/2604.07897#A5)and[F](https://arxiv.org/html/2604.07897#A6), respectively\.
Table 2:Classification accuracy on Kandinsky tasks: Two\-pair \(TP\), one\-red \(OR\), and one\-triangle \(OT\)\. VM, RM, GM, YL, G, and G\-o4 refer to vision models, reasoning models, Gemini 2\.5 Pro, YOLO, GPT, and GPT\-o4\-mini\.When interpreting the learned rules, we found that both the ViT\-based encoder and the VAE\-based encoder can recover the correct rules under the best accuracy\. For the two\-pair task, we obtained two rules with predicate placeholdersp1p\_\{1\}andp2p\_\{2\}:Positive←p1\(V,R\)\\texttt\{Positive\}\\leftarrow p\_\{1\}\(V,R\)111For simplicity, we rewrote the first\-order rulePositive\(I\)←p1\(V,R\)∧Include\(V,R,I\)\\texttt\{Positive\}\(I\)\\leftarrow p\_\{1\}\(V,R\)\\land\\texttt\{Include\}\(V,R,I\), where the variableIIcan be substituted by an image instance, including the image object substituting the variableVVandRR\.andPositive←p2\(X,Y\)\.\\texttt\{Positive\}\\leftarrow p\_\{2\}\(X,Y\)\.Figure[2\(d\)](https://arxiv.org/html/2604.07897#S5.F2.sf4)shows the constants represented by the clustersVV,RR,XX, andYY\. The relation of constants under clustersVVandRR\(orXXandYY\) are the same shape but different colors\. More generated rules are given in Appendix[C](https://arxiv.org/html/2604.07897#A3)\. To translate the semantics in natural language of placeholder predicates, we first randomly choose 20 constants from all constants\. Then, we input constants under the constraints variables in the learn rules, along with a well\-defined prompt\. Specifically, the LLMs generated the semantics for predicatesp1p\_\{1\}andp2p\_\{2\}as: “same shape \(triangle\) with different colors” and “same shape \(circle\) with different colors”, respectively\. Generalizing all predicate semantics induced by LLMs instructed by Line[11](https://arxiv.org/html/2604.07897#alg2.l11)in Algorithm[2](https://arxiv.org/html/2604.07897#alg2), we obtain the final rule:Positive←same\_shape\_and\_different\_color\(X,Y\)\\texttt\{Positive\}\\leftarrow\\texttt\{same\\\_shape\\\_and\\\_different\\\_color\}\(X,Y\), whereXXandYYdenote any two constants in the image instance\. The rule states that if two constants in an instance share the same shape but differ in color, the instance is a two\-pair pattern\. The rule achieves a recall of 1 but a precision below 1, as it ignores another pair of constants with the same color and shape\.
For one\-red, two learned rules arePositive←p1\(T\)\\texttt\{Positive\}\\leftarrow p\_\{1\}\(T\)andPositive←p2\(U\)\\texttt\{Positive\}\\leftarrow p\_\{2\}\(U\), with constants represented byTTandUUshown in Figure[3\(a\)](https://arxiv.org/html/2604.07897#S5.F3.sf1)\. LLM\-translated semantics for the placeholders arep1=“shape in square, color in red”p\_\{1\}=\\text\{\`\`shape in square, color in red''\}andp2=“shape in circle or triangle, color in red”p\_\{2\}=\\text\{\`\`shape in circle or triangle, color in red’’\}\. Generalizing these yields the final rule:Positive←color\_in\_red\(U\)\\texttt\{Positive\}\\leftarrow\\texttt\{color\\\_in\\\_red\}\(U\)\. It means if any red constant occurs in an instance, then the instance is a one\-red pattern\. The precision and recall of the rule are both 1\.
\(a\)One\-red pattern\.
\(b\)One\-triangle pattern\.
Figure 3:Constants represented by clusters for the one\-red and one\-triangle patterns\.For the one\-triangle task, a learned rule isPositive←p\(Z\)\\texttt\{Positive\}\\leftarrow p\(Z\), where the constants represented byZZare shown in Figure[3\(b\)](https://arxiv.org/html/2604.07897#S5.F3.sf2)\. LLM\-translated semantics for the unary predicateppis “all constants are in the shape of a triangle”\. Furthermore, we generalize the rule isPositive←shape\_in\_triangle\(Z\)\\texttt\{Positive\}\\leftarrow\\texttt\{shape\\\_in\\\_triangle\}\(Z\)\. The precision and recall of the rule are both 1\.
Specifically, the LLMs used for translating semantics include Gemini 2\.5 Pro, GPT\-5, and GPT\-o3\. They output the same semantics for one predicate placeholder based on Table[5](https://arxiv.org/html/2604.07897#A4.T5)in Appendix[D](https://arxiv.org/html/2604.07897#A4), which indicates that the learned semantics of predicates byγ\\gammaILP are easy to be easily translated by the current LLMs\.
## 6Conclusion
In this work, we presented the fully differentiable rule\-based inductive learning pipelineγ\\gammaILP for relational images and pure images without the symbolic labels of image constants\. Firstly,γ\\gammaILP transforms the symbol grounding process by employing encoders and a clustering module to assign representations to image constants\. Secondly, the differentiable ground substitution facilitates first\-order rule learning with GPUs\. Thirdly, it tackles predicate invention through interpreting the constant represented by the variables in the learned first\-order rules\. For the experimental evaluation, we considered classical ILP datasets, relational image datasets, and pure image datasets such as Kandinsky patterns\. The results show thatγ\\gammaILP effectively learns first\-order logic rules, achieves strong classification performance, and successfully induces predicate semantics\. For future work, we believe learning rules to explain the image with spatial information\(Zhanget al\.,[2019](https://arxiv.org/html/2604.07897#bib.bib72)\), introducing simple language bias for learn longer rules, and considering the multimodal inputs are promising\.
## References
- K\. G\. Baugh, N\. Cingillioglu, and A\. Russo \(2023\)Neuro\-symbolic rule learning in real\-world classification tasks\.InProceedings of the AAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering \(AAAI\-MAKE 2023\), Hyatt Regency, San Francisco Airport, California, USA, March 27\-29, 2023,CEUR Workshop Proceedings, Vol\.3433\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- K\. G\. Baugh, L\. Dickens, and A\. Russo \(2025\)Neural DNF\-MT: A neuro\-symbolic approach for learning interpretable and editable policies\.InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2025, Detroit, MI, USA, May 19\-23, 2025,pp\. 252–260\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- W\. L\. Buntine \(1988\)Generalized subsumption and its applications to induction and redundancy\.Artif\. Intell\.36\(2\),pp\. 149–176\.Cited by:[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p4.1)\.
- W\. W\. Cohen \(1995\)Fast effective rule induction\.InMachine Learning, Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA, July 9\-12, 1995,A\. Prieditis and S\. Russell \(Eds\.\),pp\. 115–123\.Cited by:[Appendix E](https://arxiv.org/html/2604.07897#A5.p2.2)\.
- A\. Creswell and M\. Shanahan \(2022\)Faithful reasoning using large language models\.CoRRabs/2208\.14271\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Cropper and S\. Dumancic \(2022\)Inductive logic programming at 30: A new introduction\.J\. Artif\. Intell\. Res\.74,pp\. 765–850\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- A\. Cropper and S\. H\. Muggleton \(2016\)Learning higher\-order logic programs through abstraction and invention\.InProceedings of the Twenty\-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9\-15 July 2016,S\. Kambhampati \(Ed\.\),pp\. 1418–1424\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1)\.
- D\. Cunnington, M\. Law, J\. Lobo, and A\. Russo \(2023\)Neuro\-symbolic learning of answer set programs from raw data\.InProceedings of the Thirty\-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th\-25th August 2023, Macao, SAR, China,pp\. 3586–3596\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1)\.
- D\. Cunnington, M\. Law, J\. Lobo, and A\. Russo \(2024\)The role of foundation models in neuro\-symbolic learning and reasoning\.InNeural\-Symbolic Learning and Reasoning \- 18th International Conference, NeSy 2024, Barcelona, Spain, September 9\-12, 2024, Proceedings, Part I,Lecture Notes in Computer Science, Vol\.14979,pp\. 84–100\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px2.p1.2)\.
- J\. P\. G\. de Souza, D\. S\. Carvalho, and A\. Freitas \(2025\)Inductive learning of logical theories with llms: A expressivity\-graded analysis\.InAAAI\-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 \- March 4, 2025, Philadelphia, PA, USA,pp\. 23752–23759\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.In9th International Conference on Learning Representations, ICLR,Cited by:[§3\.3](https://arxiv.org/html/2604.07897#S3.SS3.p1.1)\.
- R\. Dwivedi, D\. Dave, H\. Naik, S\. Singhal, O\. F\. Rana, P\. Patel, B\. Qian, Z\. Wen, T\. Shah, G\. Morgan, and R\. Ranjan \(2023\)Explainable AI \(XAI\): Core ideas, techniques, and solutions\.ACM Comput\. Surv\.55\(9\),pp\. 194:1–194:33\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1)\.
- R\. Evans, M\. Bosnjak, L\. Buesing, K\. Ellis, D\. P\. Reichert, P\. Kohli, and M\. J\. Sergot \(2021\)Making sense of raw input\.Artif\. Intell\.299,pp\. 103521\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p2.1),[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px2.p1.2),[§5\.1](https://arxiv.org/html/2604.07897#S5.SS1.p2.21)\.
- R\. Evans and E\. Grefenstette \(2018\)Learning explanatory rules from noisy data\.J\. Artif\. Intell\. Res\.61,pp\. 1–64\.Cited by:[Appendix A](https://arxiv.org/html/2604.07897#A1.p1.5),[§1](https://arxiv.org/html/2604.07897#S1.p2.1),[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2),[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px2.p1.2),[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p1.13),[§5\.1](https://arxiv.org/html/2604.07897#S5.SS1.p1.10),[§5](https://arxiv.org/html/2604.07897#S5.p1.7)\.
- M\. M\. Fard, T\. Thonet, and É\. Gaussier \(2020\)Deep*k*\-means: jointly clustering with*k*\-means and learning representations\.Pattern Recognit\. Lett\.138,pp\. 185–192\.Cited by:[§3\.3](https://arxiv.org/html/2604.07897#S3.SS3.p2.5)\.
- M\. V\. M\. França, G\. Zaverucha, and A\. S\. d’Avila Garcez \(2014\)Fast relational learning using bottom clause propositionalization with artificial neural networks\.Mach\. Learn\.94\(1\),pp\. 81–104\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- K\. Gao, K\. Inoue, Y\. Cao, H\. Wang, and Y\. Feng \(2025\)Differentiable rule induction from raw sequence inputs\.In13th International Conference on Learning Representations, ICLR,Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1),[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2),[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p3.7)\.
- K\. Gao, K\. Inoue, Y\. Cao, and H\. Wang \(2022a\)Learning first\-order rules with differentiable logic program semantics\.InProceedings of the 31st International Joint Conference on Artificial Intelligence, IJCAI,pp\. 3008–3014\.Cited by:[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p3.7)\.
- K\. Gao, K\. Inoue, Y\. Cao, and H\. Wang \(2024\)A differentiable first\-order rule learner for inductive logic programming\.Artif\. Intell\.331,pp\. 104108\.Cited by:[Appendix A](https://arxiv.org/html/2604.07897#A1.p1.5),[Appendix A](https://arxiv.org/html/2604.07897#A1.p3.5),[§1](https://arxiv.org/html/2604.07897#S1.p1.1),[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2),[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p3.13),[§5](https://arxiv.org/html/2604.07897#S5.p1.7)\.
- K\. Gao, H\. Wang, Y\. Cao, and K\. Inoue \(2022b\)Learning from interpretation transition using differentiable logic programming semantics\.Mach\. Learn\.111\(1\),pp\. 123–145\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- E\. Gentili, T\. Ribeiro, F\. Riguzzi, and K\. Inoue \(2025\)Predicate renaming via large language models\.CoRRabs/2510\.25517\.External Links:2510\.25517Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Gubelmann \(2024\)Pragmatic norms are all you need \- why the symbol grounding problem does not apply to LLMs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,pp\. 11663–11678\.Cited by:[§4\.3](https://arxiv.org/html/2604.07897#S4.SS3.SSS0.Px2.p3.4)\.
- S\. Han, H\. Schoelkopf, Y\. Zhao, Z\. Qi, M\. Riddell, W\. Zhou, J\. Coady, D\. Peng, Y\. Qiao, L\. Benson, L\. Sun, A\. Wardle\-Solano, H\. Szabó, E\. Zubova, M\. Burtell, J\. Fan, Y\. Liu, B\. Wong, M\. Sailor, A\. Ni, L\. Nan, J\. Kasai, T\. Yu, R\. Zhang, A\. R\. Fabbri, W\. Kryscinski, S\. Yavuz, Y\. Liu, X\. V\. Lin, S\. Joty, Y\. Zhou, C\. Xiong, R\. Ying, A\. Cohan, and D\. Radev \(2024\)FOLIO: Natural language reasoning with first\-order logic\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,pp\. 22017–22031\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Harnad \(1990\)The symbol grounding problem\.Physica D: Nonlinear Phenomena42\(1\),pp\. 335–346\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p2.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27\-30, 2016,pp\. 770–778\.Cited by:[§5\.2](https://arxiv.org/html/2604.07897#S5.SS2.p2.1)\.
- C\. Hocquette, A\. Niskanen, R\. Morel, M\. Järvisalo, and A\. Cropper \(2024\)Learning big logical rules by joining small rules\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3\-9, 2024,pp\. 3430–3438\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1)\.
- K\. Inoue, T\. Ribeiro, and C\. Sakama \(2014\)Learning from interpretation transition\.Mach\. Learn\.94\(1\),pp\. 51–79\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2),[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p2.6)\.
- E\. Jang, S\. Gu, and B\. Poole \(2017\)Categorical reparameterization with gumbel\-softmax\.In5th International Conference on Learning Representations, ICLR,Cited by:[§3\.3](https://arxiv.org/html/2604.07897#S3.SS3.p2.7)\.
- T\. Kaminski, T\. Eiter, and K\. Inoue \(2018\)Exploiting answer set programming with external sources for meta\-interpretive learning\.Theory Pract\. Log\. Program\.18\(3\-4\),pp\. 571–588\.Cited by:[§4\.2](https://arxiv.org/html/2604.07897#S4.SS2.SSS0.Px1.p1.28)\.
- D\. Kaur, S\. Uslu, K\. J\. Rittichier, and A\. Durresi \(2023\)Trustworthy artificial intelligence: A review\.ACM Comput\. Surv\.55\(2\),pp\. 39:1–39:38\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1)\.
- D\. P\. Kingma and M\. Welling \(2014\)Auto\-encoding variational bayes\.In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14\-16, 2014, Conference Track Proceedings,Cited by:[§3\.3](https://arxiv.org/html/2604.07897#S3.SS3.p1.1)\.
- S\. Kok and P\. M\. Domingos \(2007\)Statistical predicate invention\.InMachine Learning, Proceedings of the Twenty\-Fourth International Conference \(ICML 2007\), Corvallis, Oregon, USA, June 20\-24, 2007,ACM International Conference Proceeding Series, Vol\.227,pp\. 433–440\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p2.1),[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p5.9)\.
- J\. Li, P\. Cao, Z\. Jin, Y\. Chen, K\. Liu, and J\. Zhao \(2025\)MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models\.In13th International Conference on Learning Representations, ICLR,Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Li, Y\. Yao, T\. Chen, J\. Xu, C\. Cao, X\. Ma, and J\. Lü \(2023\)Softened symbol grounding for neuro\-symbolic systems\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,Cited by:[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p5.9)\.
- H\. Liu, Z\. Teng, L\. Cui, C\. Zhang, Q\. Zhou, and Y\. Zhang \(2023\)LogiCoT: logical chain\-of\-thought instruction tuning\.InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6\-10, 2023,pp\. 2908–2921\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1)\.
- J\. W\. Lloyd \(1984\)Foundations of logic programming, 1st edition\.Springer\.External Links:ISBN 3\-540\-13299\-6Cited by:[§3\.1](https://arxiv.org/html/2604.07897#S3.SS1.p1.21),[§3\.1](https://arxiv.org/html/2604.07897#S3.SS1.p2.17)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\-9, 2019,Cited by:[§5](https://arxiv.org/html/2604.07897#S5.p1.7)\.
- S\. H\. Muggleton and W\. L\. Buntine \(1988\)Machine invention of first order predicates by inverting resolution\.InMachine Learning, Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, Michigan, USA, June 12\-14, 1988,pp\. 339–352\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p2.1),[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- S\. H\. Muggleton and L\. De Raedt \(1994\)Inductive logic programming: theory and methods\.J\. Log\. Program\.19/20,pp\. 629–679\.Cited by:[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p1.13)\.
- S\. Muggleton \(1991\)Inductive logic programming\.New generation computing8,pp\. 295–318\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- H\. Müller and A\. Holzinger \(2021\)Kandinsky patterns\.Artif\. Intell\.300,pp\. 103546\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p5.2),[§5\.2](https://arxiv.org/html/2604.07897#S5.SS2.p1.1)\.
- G\. D\. Plotkin \(1970\)A note on inductive generalization\.Machine Intelligence5,pp\. 153–163\.Cited by:[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p4.1)\.
- J\. R\. Quinlan \(1990\)Learning logical definitions from relations\.Machine learning5,pp\. 239–266\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- J\. R\. Quinlan \(1993\)C4\.5: programs for machine learning\.Morgan Kaufmann\.Cited by:[Appendix E](https://arxiv.org/html/2604.07897#A5.p2.2)\.
- J\. Redmon, S\. K\. Divvala, R\. B\. Girshick, and A\. Farhadi \(2016\)You only look once: unified, real\-time object detection\.In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27\-30, 2016,pp\. 779–788\.Cited by:[§5\.2](https://arxiv.org/html/2604.07897#S5.SS2.p2.1)\.
- H\. Shindo, V\. Pfanschilling, D\. S\. Dhami, and K\. Kersting \(2023\)α\\alphaILP: thinking visual scenes as differentiable logic programs\.Mach\. Learn\.112\(5\),pp\. 1465–1497\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1),[§1](https://arxiv.org/html/2604.07897#S1.p2.1),[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px2.p1.2)\.
- A\. Srinivasan \(2001\)The ALEPH manual\.Machine Learning at the Computing Laboratory, Oxford University\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px1.p1.2)\.
- S\. Topan, D\. Rolnick, and X\. Si \(2021\)Techniques for symbol grounding with SATNet\.InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual,pp\. 20733–20744\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p2.1),[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px2.p1.2)\.
- M\. H\. van Emden and R\. A\. Kowalski \(1976\)The semantics of predicate logic as a programming language\.J\. ACM23\(4\),pp\. 733–742\.External Links:[Document](https://dx.doi.org/10.1145/321978.321991)Cited by:[§3\.1](https://arxiv.org/html/2604.07897#S3.SS1.p2.17)\.
- P\. Wang, P\. L\. Donti, B\. Wilder, and J\. Z\. Kolter \(2019\)SATNet: bridging deep learning and logical reasoning using a differentiable satisfiability solver\.InProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9\-15 June 2019, Long Beach, California, USA,Vol\.97,pp\. 6545–6554\.Cited by:[§2](https://arxiv.org/html/2604.07897#S2.SS0.SSS0.Px2.p1.2)\.
- T\. Xie, Z\. Gao, Q\. Ren, H\. Luo, Y\. Hong, B\. Dai, J\. Zhou, K\. Qiu, Z\. Wu, and C\. Luo \(2025\)Logic\-RL: unleashing LLM reasoning with rule\-based reinforcement learning\.CoRRabs/2502\.14768\.Cited by:[§1](https://arxiv.org/html/2604.07897#S1.p1.1)\.
- C\. Zhang, F\. Gao, B\. Jia, Y\. Zhu, and S\. Zhu \(2019\)RAVEN: A dataset for relational and analogical visual reasoning\.InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16\-20, 2019,pp\. 5317–5327\.Cited by:[§6](https://arxiv.org/html/2604.07897#S6.p1.3)\.
- Z\. Zhou \(2017\)A brief introduction to weakly supervised learning\.National Science Review5\(1\),pp\. 44–53\.Cited by:[§3\.2](https://arxiv.org/html/2604.07897#S3.SS2.p5.9)\.
## Appendix AStatistical Information of ILP Datasets
To assess our differentiable substitution method, we evaluateγ\\gammaILP on classical ILP datasets\(Evans and Grefenstette,[2018](https://arxiv.org/html/2604.07897#bib.bib28)\)\. Constants are textual; thus, we setg\(𝐱\)=𝐱g\(\\mathbf\{x\}\)=\\mathbf\{x\}andλ=0\\lambda=0in Eq\. \([4](https://arxiv.org/html/2604.07897#S4.E4)\)\. We use pre\-trained VAE as the encoder for textual relations and constants\. We report results from baseline models, including∂\\partialILP\(Evans and Grefenstette,[2018](https://arxiv.org/html/2604.07897#bib.bib28)\), DFORL\(Gaoet al\.,[2024](https://arxiv.org/html/2604.07897#bib.bib23)\), Gemini 2\.5 Pro, and GPT\-5\. Each experiment was run ten times under different random seeds\. The maximum running time forγ\\gammaILP is set to 5 minutes\. We report the average recall of the learned rules with precision equal to 1\.
For∂\\partialILP, we calculated the recall based on the rules reported in their paper\. To eliminate predicate semantics leakage, we replaced the same predicate with a consistent random string across runs for all models\. The number of constants and relations in the training set under each task of the classical inductive logic programming \(ILP\) datasets is shown in Table[3](https://arxiv.org/html/2604.07897#A1.T3)\. When testing data is available, such as in the Husband and Uncle tasks, we compute the recall of the learned rules on the test set\. Otherwise, recall is computed on facts involving constants not seen during training\.
Table 3:The statistical information of ILP datasets\.The expected rules learned byγ\\gammaILP in classical ILP datasets are the same as\(Gaoet al\.,[2024](https://arxiv.org/html/2604.07897#bib.bib23)\)\. The recall comparison results on the ILP datasets are shown in Table[4](https://arxiv.org/html/2604.07897#A1.T4)\.γ\\gammaILP finds expected rules under classical ILP tasks except for fizz and buzz datasets, matching the performance of DFORL\. In the Fizz and Buzz datasets, the rules with 1 recall contain four and six variables, respectively\. Consequently, the search space forγ\\gammaILP and DFORL becomes enormous without constraints such as the logical template used in∂\\partialILP\. The rules learned byγ\\gammaILP in Fizz and Buzz tasks are presented as follows:
fizz\(X\)\\displaystyle\\texttt\{fizz\}\(X\)←zero\(X\)\.\\displaystyle\\leftarrow\\texttt\{zero\}\(X\)\.\(5\)fizz\(X\)\\displaystyle\\texttt\{fizz\}\(X\)←succ\(Z,Y\),fizz\(Y\),succ\(Z,X\)\.\\displaystyle\\leftarrow\\texttt\{succ\}\(Z,Y\),\\texttt\{fizz\}\(Y\),\\texttt\{succ\}\(Z,X\)\.\(6\)buzz\(X\)\\displaystyle\\texttt\{buzz\}\(X\)←zero\(X\)\.\\displaystyle\\leftarrow\\texttt\{zero\}\(X\)\.\(7\)buzz\(X\)\\displaystyle\\texttt\{buzz\}\(X\)←succ\(Z,Y\),buzz\(Y\),succ\(Z,X\)\.\\displaystyle\\leftarrow\\texttt\{succ\}\(Z,Y\),\\texttt\{buzz\}\(Y\),\\texttt\{succ\}\(Z,X\)\.\(8\)Inγ\\gammaILP, the learned rule can only describe the constant zero, but fails to capture numbers divisible by three and five in the Fizz and Buzz datasets, respectively\. For the rules reported in∂\\partialILP cover the most tasks except the Husband task, where the rules are incomplete\. When using all relational facts as inputs, Gmini 2\.5 Pro can also induce all the expected logic rules in all tasks\. However, without any data preprocessing, GPT\-5 cannot learn from huge datasets with limited scalability under the Husband and Uncle datasets\. However, when we only select the first 50 relational facts as input, GPT\-5 can also induce the correct rules under both the Husband and Uncle datasets\.
Table 4:The recall of the learned rules on the ILP datasets\. For GPT and Gemini, we use the latest reasoning model GPT\-5 and Gemini 2\.5 Pro, respectively\. “Pre”, “GP”, ”Rel”, “UE”, “AdjR”, “TC”, “GC”, “Con”, and “RT” denote Predecessor, Grandparent, Relatedness, Undirected Edge, Adjacent to Red, Two Children, Graph Coloring, Connectedness, and the average running time in second forγ\\gammaILP when it reaches its maximum accuracy, respectively\. The best results are in bold if some models fail to achieve the top recall\. The notation “–” indicates that the model achieves a recall between 0 and 1\.
## Appendix BRules Learned by LLMs from Relational Image Datasets
Figure 4:The MNIST sequence with label0,5,1,5,2,5,3,5,4,5,5,5…0,5,1,5,2,5,3,5,4,5,5,5\\dotsWhen learning from relational image datasets, we replace each predicate with a unique random string and annotate each image with a random identifier\. Then, we replace each constant in the classical ILP benchmarks with the annotation of an image whose label matches the constant’s value\. Next, we use a fixed\-format prompt to induce logic programs using LLMs\. For example, in the Fizz task, we have the fact: zero\(0, 0\), succ\(0, 1\), succ\(1, 2\), succ\(2, 3\), succ\(3, 4\), succ\(4, 5\), succ\(5, 6\), fizz\(0, 0\), fizz\(3, 3\), fizz\(6, 6\)\. Note that for atoms with unary predicates, we rewrite them by duplicating the constant to form a binary structure\. We replace thezeropredicate with “4WY”, thesuccpredicate with “7h0”, and thefizzpredicate with “xoh”\. We then randomly select MNIST images and assign each one a random annotation\. Then, the formatted prompt for LLMs is: “If you have the following images and their annotations, you also have the fact set inr\(a1,a2\)r\(a\_\{1\},a\_\{2\}\), whererrindicates the relation between the image annotated witha1a\_\{1\}and the image annotated witha2a\_\{2\}\. All facts are: 4WY\(KI5, fRB\), 7hO\(t01, 1kn\), 7hO\(Yjp, NRS\), 7hO\(ySZ, 4bI\), 7hO\(bL1, qfI\), 7hO\(4md, VY4\), 7hO\(qOg, IdR\), xoh\(4Qu, kUw\), xoh\(uNT, 3HN\), xoh\(99c, Fwv\)\. Can you learn a first\-order logic program to describe the “xoh” relation with existing relations and images?” Then, we present the rules learned by Gemini 2\.5 Pro and GPT\-o3 in the relational MNIST image datasets\.
#### Even Task\.
The logic programPPlearned by Gemini 2\.5 Pro for the Even task is:
even\(A\)←succ\(A,B\),succ\(B,C\)\.\\displaystyle\\texttt\{even\}\(A\)\\leftarrow\\texttt\{succ\}\(A,B\),\\texttt\{succ\}\(B,C\)\.The rule learned by Gemini 2\.5 Pro is not expected because the precision is not 1\.
However, GPT\-o3 can learn the expected rules for the Even task\.
even\(X\)←\\displaystyle\\texttt\{even\}\(X\)\\leftarrowzero\(X\)\.\\displaystyle\\texttt\{zero\}\(X\)\.even\(C\)←\\displaystyle\\texttt\{even\}\(C\)\\leftarrowsucc\(A,B\),succ\(B,C\),even\(A\)\\displaystyle\\texttt\{succ\}\(A,B\),\\texttt\{succ\}\(B,C\),\\texttt\{even\}\(A\)
#### Odd Task\.
When we use Gemini 2\.5 Pro to learn odd predicates in the Odd task\. The learned expected rules are:
odd\(A\)←\\displaystyle\\texttt\{odd\}\(A\)\\leftarrowsucc\(B,A\),zero\(B\)\.\\displaystyle\\texttt\{succ\}\(B,A\),\\texttt\{zero\}\(B\)\.odd\(A\)←\\displaystyle\\texttt\{odd\}\(A\)\\leftarrowsucc\(B,A\),succ\(C,B\),zero\(C\)\.\\displaystyle\\texttt\{succ\}\(B,A\),\\texttt\{succ\}\(C,B\),\\texttt\{zero\}\(C\)\.
However, GPT\-o3 fails to learn any generalized rules to describe the Odd task, and it only outputs the facts:
odd\(W8h\)\.odd\(Pay\)\.odd\(UhR\)\.odd\(8km\)\.odd\(7Oh\)\.\\displaystyle\\texttt\{odd\}\(W8h\)\.\\texttt\{odd\}\(Pay\)\.\\texttt\{odd\}\(UhR\)\.\\texttt\{odd\}\(8km\)\.\\texttt\{odd\}\(7Oh\)\.In the above, W8h, Pay, UhR, 8km, and 7Oh are the annotations corresponding to the images with labels 1, 3, 5, 7, 9, respectively\. In addition, these output facts are also already occurring in the prompt\.
When we indicate the predicate name for each relation annotation, GPT\-o3 can learn the expected rules for the Odd task:
odd\(X\)←\\displaystyle\\texttt\{odd\}\(X\)\\leftarrowsucc\(Z,X\),zero\(Z\)\\displaystyle\\texttt\{succ\}\(Z,X\),\\texttt\{zero\}\(Z\)odd\(X\)←\\displaystyle\\texttt\{odd\}\(X\)\\leftarrowsucc\(Z,X\),succ\(Y,Z\),odd\(Y\)\\displaystyle\\texttt\{succ\}\(Z,X\),\\texttt\{succ\}\(Y,Z\),\\texttt\{odd\}\(Y\)
#### Predecessor Task\.
When we use Gemini 2\.5 Pro to learn the predecessor predicate in the Predecessor task, the learned rules are:
pre\(A,B\)←\\displaystyle\\texttt\{pre\}\(A,B\)\\leftarrowdigit\(A,Val1\),digit\(B,Val2\),\\displaystyle\\texttt\{digit\}\(A,\\text\{Val\}1\),\\texttt\{digit\}\(B,\\text\{Val\}2\),Val1isVal2\+1\.\\displaystyle\\text\{Val\}1\\ \\text\{is\}\\ \\text\{Val\}2\+1\.Considering the generalization ability of LLMs, we believe this rule is correct, as it can generate all relevant examples despite lacking the formalized predicates observed during training\.
For GPT\-o3, the learned rule is also correct:
pre\(X,Y\)←succ\(Y,X\)\.\\displaystyle\\texttt\{pre\}\(X,Y\)\\leftarrow\\texttt\{succ\}\(Y,X\)\.
#### Lessthan Task\.
When we use Gemini 2\.5 Pro to learn thelessthanpredicate in the Lessthan task, the learned rule is:
lessthan\(X,Y\)←\\displaystyle\\texttt\{lessthan\}\(X,Y\)\\leftarrowis digit\(X,Dx\),\\displaystyle\\texttt\{ is digit\}\(X,Dx\),is digit\(Y,Dy\),DyisDx\+1,\\displaystyle\\texttt\{is digit\}\(Y,Dy\),Dy\\ \\text\{is\}\\ Dx\+1,which is marked as incomplete\.
For GPT\-o3, the model learns the following correct rules for the Lessthan task, demonstrating the generalization ability of LLMs:
same\(A,B\)←\\displaystyle\\texttt\{same\}\(A,B\)\\leftarrowA=B\.\\displaystyle A=B\.same\(A,B\)←\\displaystyle\\texttt\{same\}\(A,B\)\\leftarrowzero\(A,B\)\.\\displaystyle\\texttt\{zero\}\(A,B\)\.bigger\(A,B\)←\\displaystyle\\texttt\{bigger\}\(A,B\)\\leftarrowsucc\(A,B\)\.\\displaystyle\\texttt\{succ\}\(A,B\)\.bigger\(A,B\)←\\displaystyle\\texttt\{bigger\}\(A,B\)\\leftarrowsucc\(A,C\),bigger\(C,B\)\.\\displaystyle\\texttt\{succ\}\(A,C\),\\texttt\{bigger\}\(C,B\)\.lessthan\(X,Y\)←\\displaystyle\\texttt\{lessthan\}\(X,Y\)\\leftarrowsame\(X,X1\),\\displaystyle\\texttt\{same\}\(X,X1\),same\(Y,Y1\),\\displaystyle\\texttt\{same\}\(Y,Y1\),bigger\(X1,Y1\)\.\\displaystyle\\texttt\{bigger\}\(X1,Y1\)\.
#### Fizz Task\.
We also use Gemini 2\.5 Pro to learn the Fizz predicate in the Fizz task\. The learned rules are incorrect, where “not” is negation:
fizz\(A\)←notrelation\_participant\(A\)\.\\displaystyle\\texttt\{fizz\}\(A\)\\leftarrow\\text\{not\}~\\texttt\{relation\\\_participant\}\(A\)\.relation\_participant\(A\)←succ\(A,\_\)\.\\displaystyle\\texttt\{relation\\\_participant\}\(A\)\\leftarrow\\texttt\{succ\}\(A,\\\_\)\.relation\_participant\(A\)←succ\(\_,A\)\.\\displaystyle\\texttt\{relation\\\_participant\}\(A\)\\leftarrow\\texttt\{succ\}\(\\\_,A\)\.relation\_participant\(A\)←zero\(A\)\.\\displaystyle\\texttt\{relation\\\_participant\}\(A\)\\leftarrow\\texttt\{zero\}\(A\)\.~~~
In addition, GPT\-o3 learns an incorrect rule to describe the Fizz task:
fizz\(A\)←succ\(A,\_\),succ\(\_,A\),zero\(A\),\\displaystyle\\texttt\{fizz\}\(A\)\\leftarrow\\texttt\{succ\}\(A,\\\_\),\\texttt\{succ\}\(\\\_,A\),\\texttt\{zero\}\(A\),where the placeholder\_\\\_indicates any constants\.
#### Buzz Task\.
When using Gemini 2\.5 Pro to learn the Buzz predicate in the Buzz task, the resulting rule is incorrect:
Reg\(A\)←\\displaystyle\\texttt\{Reg\}\(A\)\\leftarrow~notzero\(A\),notsucc\(A,\_\),notsucc\(\_,A\)\.\\displaystyle\\text\{not\}~\\texttt\{zero\}\(A\),\\text\{not\}~\\texttt\{succ\}\(A,\\\_\),\\text\{not\}~\\texttt\{succ\}\(\\\_,A\)\.where\_\\\_is a placeholder for any image annotation\.
However, the output by GPT\-o3 for the Buzz task is correct:
plus5\(X,Y\)←\\displaystyle\\texttt\{plus5\}\(X,Y\)\\leftarrow~succ\(X,A1\),succ\(A1,A2\),\\displaystyle\\texttt\{succ\}\(X,A1\),\\texttt\{succ\}\(A1,A2\),succ\(A2,A3\),succ\(A3,A4\),\\displaystyle\\texttt\{succ\}\(A2,A3\),\\texttt\{succ\}\(A3,A4\),succ\(A4,Y\)\.\\displaystyle\\texttt\{succ\}\(A4,Y\)\.buzz\(X\)←\\displaystyle\\texttt\{buzz\}\(X\)\\leftarrow~plus5\(Y,X\),zero\(Y\)\.\\displaystyle\\texttt\{plus5\}\(Y,X\),\\texttt\{zero\}\(Y\)\.buzz\(X\)←\\displaystyle\\texttt\{buzz\}\(X\)\\leftarrow~zero\(X\)\.\\displaystyle\\texttt\{zero\}\(X\)\.
## Appendix CMoreγ\\gammaILP\-Generated Rules from the Two\-Pair Task
In this section, more rules for describing the two\-pair task in Kandinsky patterns generated byγ\\gammaILP are presented as follows\.
positive←\\displaystyle\\texttt\{positive\}\\leftarrowp\(Y,R\)\\displaystyle p\(Y,R\)\(9\)positive←\\displaystyle\\texttt\{positive\}\\leftarrowp1\(X\),p2\(Y\)\\displaystyle p\_\{1\}\(X\),p\_\{2\}\(Y\)\(10\)In these rules, the constants represented by the variablesRR,XX, andYYare presented in Figure[5](https://arxiv.org/html/2604.07897#A3.F5)\. By querying the semantics of the predicate placeholders using LLMs, we obtain the following interpretations:ppindicates “different color, shape in triangle”,p1p\_\{1\}indicates “shape in triangle and color in yellow”, andp2p\_\{2\}indicates “shape in triangle and color in red”\. These rules only capture the pair with the same shape but different colors, while the other required pair, with the same shape and color, is not covered by the rule\. Hence, the precision of the rule in \([9](https://arxiv.org/html/2604.07897#A3.E9)\) and the rule in \([10](https://arxiv.org/html/2604.07897#A3.E10)\) is 0\.5 and 0\.75, respectively\. The recall values of the two specific rules are less than 1\.
Figure 5:The image constants represented by the clustersRR,XX, andYY\.In addition, we also generate the following rule from the two\-pair task:
positive←p1\(Y\),p2\(Z\),p3\(T\)\.\\displaystyle\\texttt\{positive\}\\leftarrow p\_\{1\}\(Y\),p\_\{2\}\(Z\),p\_\{3\}\(T\)\.\(11\)For this rule, the constants represented by the clustersTT,YY, andZZare presented in Figure[6](https://arxiv.org/html/2604.07897#A3.F6)\. This rule covers that the constants in one pair have the same color but different shapes \(see the constants under the clustersTTandYY\), and the constants in another pair have the same color and shape \(see the image constants under the clusterZZ\)\. The precision of this rule is 1, which ensures that the images are classified without any false positive instances\. However, the recall of this rule is less than 1, as it captures only one specific pattern combination in the two\-pair task\.
Figure 6:The constants represented by the clustersTT,YY, andZZ\.
## Appendix DLLMs as Translators
We test the translated semantics of predicate placeholders in natural language by using LLMs such as GPT\-5, GPT\-3o, and Gemini 2\.5 Pro in Table[5](https://arxiv.org/html/2604.07897#A4.T5)\. We use the above LLMs on their official user interface\. The translated results are the same for GPT5, GPT\-3o, and Gemini 2\.5 Pro, which indicates the semantics of the predicate placeholder learned byγ\\gammaILP are not sensitive to different LLMs\.
Table 5:The semantics translation results from LLMs
## Appendix EExplanations Obtained by LLMs and Other Learning Models on Kandinsky Patterns
This section evaluates predictions from reasoning\-capable LLMs \(Gemini 2\.5 Pro, GPT\-o3, GPT\-5\) alongside hybrid methods that pair clustering with RIPPER or C4\.5\.
When there are no well\-defined relationships in the instance, we also use RIPPER\(Cohen,[1995](https://arxiv.org/html/2604.07897#bib.bib63)\)and C4\.5 with ViT as encoder\(Quinlan,[1993](https://arxiv.org/html/2604.07897#bib.bib64)\)to replace the first\-order rule learning module inγ\\gammaILP\. They classify the instance classes based on the centroid indices of image constants in the instances\. The accuracies are presented in Table[6](https://arxiv.org/html/2604.07897#A5.T6)\. The learned propositional rules have less interpretability compared with the rules learned byγ\\gammaILP\.
#### Rules generated from Gmini 2\.5 Pro\.
The inductive explanation from Gemini 2\.5 Pro for the two\-pair task is: “The number of constants of each shape type is an even number \(i\.e\., 2 or 4 of each shape present\), and the image contains constants of at least two different colors\.” However, this explanation is incorrect with very low precision\. In contrast, for the one\-red and one\-triangle tasks, the explanations are correct and complete\. These results suggest that while the state\-of\-the\-art reasoning model Gemini 2\.5 Pro performs well on simpler reasoning tasks, it struggles with more complex tasks, such as two\-pair, which require analyzing the composition of multiple constants\. In such cases, the model fails to generate correct inductive explanations\.
#### Rules generated from GPT\-5\.
For GPT\-5, the latest reasoning model from OpenAI, the explanation for the one\-red task is: “The rule is simple — a panel is true if it contains any red shape\.” For the one\-triangle task, the learned rule is: “An image is labeled true if it contains at least one triangle; it’s labeled false if it contains no triangles\.” For the two\-pair task, the induced rule by GPT\-5 is: “An image is true if it contains an even number of triangles\(0,2,4,…\)\(0,2,4,\\dots\)\. It’s false if the number of triangles is odd\.” We conclude that GPT\-5 can induce correct and complete rules in one\-red and one\-triangle tasks\. However, the ability to use this knowledge directly and classify the test images is not mature enough now \(see Table[2](https://arxiv.org/html/2604.07897#S5.T2)\)\. In addition, for more complex tasks, such as two\-pair, the rule induced by GPT\-5 has low precision and low recall\. Hence, GPT\-5 cannot classify two\-pair Kandinsky patterns with 100% accuracy\.
#### Rules generated from GPT\-o3\.
For GPT\-o3, the model generates the following explanations for the two\-pair task: “Positive = all foreground shapes share exactly one colour\. Negative = two or more different foreground colours appear\.” However, this explanation has precision 0 and recall 0\. For the one\-red task, it outputs: “Positive examples satisfy the ‘all\-four\-shapes\-same\-size’ condition; negatives include any other case \(different sizes and/or a count different from 4\)\.” This rule also has low precision and recall\. For the one\-triangle task, the model explains: “An image is positive when it contains the same number of circles and squares; otherwise, it is negative\.” This explanation also has low precision and recall\.
#### Rules generated from RIPPER\.
Table 6:Classification accuracy on Kandinsky tasks: Two\-pair, one\-red, and one\-triangle\.For the clustering method with RIPPER, the learned rule under the one\-red task is:
positive←\#6\.\\displaystyle\\leftarrow\\\#6\.\(12\)positive←\#7\.\\displaystyle\\leftarrow\\\#7\.\(13\)positive←\#3\.\\displaystyle\\leftarrow\\\#3\.\(14\)where each body atom represents a cluster centroid index\. The image constants represented by these centroids are presented in Figure[7](https://arxiv.org/html/2604.07897#A5.F7)\. We can also infer that the red color, no matter the shape, is key to the information to determine the classes of the one\-red pattern images\.
Figure 7:The constants represented by the cluster centroids \#3, \#6, and \#7\.The learned rules under the one\-triangle task are:
positive←\#7\.\\displaystyle\\texttt\{positive\}\\leftarrow\\\#7\.\(15\)positive←\#6\.\\displaystyle\\texttt\{positive\}\\leftarrow\\\#6\.\(16\)where the constants represented by the centroids are presented in Figure[8](https://arxiv.org/html/2604.07897#A5.F8)\. Hence, the shape of a triangle, regardless of its color, is the key information to determine the classes of one\-triangle pattern images\.
Figure 8:The constants represented by the cluster centroids \#6 and \#7\.In the two\-pair task, RIPPER generates two rules:
positive←not\#8∧not\#6∧not\#0\.\\displaystyle\\leftarrow\\text\{not\}\\ \\\#8\\land\\text\{not\}\\ \\\#6\\land\\text\{not\}\\ \\\#0\.\(17\)positive←not\#1∧not\#7∧\#8\.\\displaystyle\\leftarrow\\text\{not\}\\ \\\#1\\land\\text\{not\}\\ \\\#7\\land\\\#8\.\(18\)The image constants represented by the body atoms in rule \([17](https://arxiv.org/html/2604.07897#A5.E17)\) are presented in Figure[9\(a\)](https://arxiv.org/html/2604.07897#A5.F9.sf1)and the image constants represented by the body atoms in rule \([18](https://arxiv.org/html/2604.07897#A5.E18)\) are presented in Figure[9\(b\)](https://arxiv.org/html/2604.07897#A5.F9.sf2)\. However, based on these two rules, we cannot induce any possible predicates based on the rules in the propositional format, nor explain the two\-pair pattern\.
\(a\)The constants represented by the cluster centroids \#0, \#6, and \#8\.
\(b\)The constants represented by the cluster centroids \#1, \#7, and \#8\.
Figure 9:The constants represented by the cluster centroids \#0, \#1, \#6, \#7, and \#8\.
#### Rules generated from C4\.5\.
We use the same clustering assignment for running C4\.5 and RIPPER, the rules under the one\-red task are the same as rules \([12](https://arxiv.org/html/2604.07897#A5.E12)\) to \([14](https://arxiv.org/html/2604.07897#A5.E14)\), and the constants represented by the body atoms are presented in Figure[7](https://arxiv.org/html/2604.07897#A5.F7)\. The rules under the one\-triangle task are the same as the rules \([15](https://arxiv.org/html/2604.07897#A5.E15)\) to \([16](https://arxiv.org/html/2604.07897#A5.E16)\), and the constants represented by the cluster centroids are also presented in Figure[8](https://arxiv.org/html/2604.07897#A5.F8)\. For the two\-pair task, the learned rule is:
positive←not\#1∧not\#7∧not\#8∧not\#6∧\#2,\\displaystyle\\texttt\{positive\}\\leftarrow\\text\{not\}\\ \\\#1\\land\\text\{not\}\\ \\\#7\\land\\text\{not\}\\ \\\#8\\land\\text\{not\}\\ \\\#6\\land\\\#2,where the cluster centroids are presented in Figure[10](https://arxiv.org/html/2604.07897#A5.F10)\. Given this rule, we are unable to induce the knowledge required to classify two\-pair patterns\.
Figure 10:The constants represented by the cluster centroids \#1, \#2, \#6, \#7, and \#8\.
## Appendix FAblation and Analysis
In this section, we test the sensitivity of our model on Kandinsky patterns to various hyperparameters, including the number of clusters, the learning rate of the rule learning network, the learning rate for the differentiable clustering method, the hyperparameterα\\alphadefined in the differentiable clustering method described by Eq\. \([2](https://arxiv.org/html/2604.07897#S3.E2)\), and the hyperparameterγ\\gammadescribed in Eq\. \([4](https://arxiv.org/html/2604.07897#S4.E4)\)\. In this setting, we increase the training instances in both the training and testing datasets\. Due to the total instance number in each Kandinsky pattern task being 100, we set 80 and 20 instances to the training dataset and the testing dataset, with balanced positive and negative instances, respectively\. For the one\-red and one\-triangle tasks, the base parameter values for the number of clusters, learning rate of the rule learning network, the learning rate of the differentiable clustering method,α\\alpha, andγ\\gammaare 10, 0\.05, 0\.5, 20, and 4, respectively\. For the two\-pair tasks, the base parameter values for the number of clusters, learning rate of the rule learning network, the learning rate of the differentiable clustering method,α\\alpha, andγ\\gammaare 10, 0\.5, 0\.1, 20, and 4, respectively\. We compare accuracy across different values of a single hyperparameter while keeping all other hyperparameters fixed\. We ran each experiment five times under each setting and recorded the best results for each setting\. The accuracies are presented in Figure[11](https://arxiv.org/html/2604.07897#A6.F11)\. The experimental results show that the differentiable cluster learning rate andα\\alphahave a smaller impact on accuracy compared with the number of clusters, the learning rate of the rule learning network, andλ\\lambda\. Moreover, adjusting the centroids during training \(whenλ\>0\\lambda\>0\) leads to higher accuracy than training with fixed centroids \(whenλ=0\\lambda=0\) on both the one\-red and two\-pair tasks\.
\(a\)Different DCM LR
\(b\)Different cluster counts\.
\(c\)Different rule learning network LR\.
\(d\)Differentα\\alphavalues\.
\(e\)Differentλ\\lambdavalues\.
Figure 11:The accuracies under different hyperparameters\. DCM and LR indicate the differentiable clustering method and learning rate, respectively\.Furthermore, we analyze theγ\\gammaILP’s stability on accuracy in Kandinsky patterns\. We collect accuracies under 10 runs with different seeds\. For the one\-red and one\-triangle tasks, the cluster number is 8, the rule learning rate is 0\.05, the differentiable clustering method learning rate is 0\.5, the value ofα\\alphais 20, and the value ofλ\\lambdais set to 4\. For the two\-pair task, the cluster number is 10, the rule learning rate is 0\.5, the differentiable clustering method learning rate is 0\.1, the value ofα\\alphais 5, and the value ofλ\\lambdais set to 4\. The stability results are presented in Figure[12](https://arxiv.org/html/2604.07897#A6.F12)\.
Figure 12:The stability ofγ\\gammaILP in Kandinsky patterns\.Similar Articles
A Foundation Model for Zero-Shot Logical Rule Induction
This paper introduces the Neural Rule Inducer (NRI), a foundation model for zero-shot logical rule induction that uses domain-agnostic statistical properties to generalize across tasks without retraining.
Visual Reasoning through Tool-supervised Reinforcement Learning
Introduces ToolsRL, a two-stage reinforcement learning framework that teaches multimodal LLMs to use simple visual tools for complex visual reasoning tasks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.
PRISM: Perception Reasoning Interleaved for Sequential Decision Making
This paper introduces PRISM, a framework that integrates Vision-Language Models and Large Language Models through a dynamic question-answering pipeline to improve sequential decision-making in embodied AI tasks.
Logic-Regularized Verifier Elicits Reasoning from LLMs
Introduces LoVer, an unsupervised verifier that uses logical rules (negation consistency, intra-group and inter-group consistency) to improve LLM reasoning without labeled data, achieving performance close to supervised verifiers on reasoning benchmarks.