Data-driven Machine Learning Cannot Reach Symbolic-level Logical Reasoning -- The Limit of the Scaling Law

arXiv cs.AI 06/26/26, 04:00 AM Papers
Summary
The paper argues that data-driven machine learning systems, including GPT-5, cannot achieve symbolic-level logical reasoning through scaling alone, due to inherent limitations in distinguishing logical structures from statistical regularities.
arXiv:2606.26454v1 Announce Type: new Abstract: Sphere neural networks have achieved symbolic level syllogistic reasoning without training data, raising the question of where the limit of the scaling law for logical reasoning lies, i.e., whether data-driven machine learning systems can achieve the same level by increasing training data and training time. We show two methodological limitations that prevent supervised deep learning from reaching the symbolic-level syllogistic reasoning: (1) training data can not distinguish all 24 types of valid syllogistic reasoning; (2) end-to-end mapping from premises to conclusion introduces contradictory training targets between neural components for pattern recognition and logical reasoning. Beside theoretical analysis, we experimentally illustrate that Euler Net cannot achieve rigorous syllogistic reasoning. We further challenge the most recent ChatGPTs (GPT-5-nano and GPT-5) to determine the satisfiability of syllogistic statements in four surface forms (patterns): words, double words, simple symbols, and long random symbols, showing that surface forms affect the reasoning performance and that ChatGPT GPT-5 may reach 100% accuracy but still provide incorrect explanations. As empirical training processes are stopped after achieving 100% accuracy, we conclude that supervised machine learning systems will not attain the rigour of symbolic logical reasoning.
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:12 AM
# Data-driven Machine Learning Cannot Reach Symbolic-level Logical Reasoning – The Limit of the Scaling Law
Source: [https://arxiv.org/html/2606.26454](https://arxiv.org/html/2606.26454)
1\]\\orgnameThe Alan Turing Institute,\\orgaddress\\street96 Euston Road,\\cityLondon,\\postcodeNW1 2DB,\\countryUK

2\]\\orgdivDepartment of Computer Science and Technology,\\orgnameUniversity of Cambridge,\\orgaddress\\street15 JJ Thomson Ave\.,\\cityCambridge,\\postcodeCB3 0FD,\\countryUK

###### Abstract

By promoting vectors to spheres and adopting explicit model construction as the reasoning mechanism, Sphere Neural Networks \(SphNNs\) achieve symbolic\-level syllogistic reasoning without relying on training data\. This raises a fundamental question: can data\-driven machine learning systems attain the same level of rigour merely by increasing the amount of training data and training time? More specifically, can the scaling law guarantee the emergence of rigorous syllogistic reasoning—the foundation of logical reasoning and a microcosm of human rationality? Supervised deep learning systems may learn syllogistic reasoning from either symbolic or image\-based inputs\. In the symbolic setting, however, no theoretical guarantee exists that a trained model will generalise its reasoning ability to out\-of\-distribution symbols\. This limitation is consistent with recent evaluations of large language models on syllogistic reasoning tasks\. Alternatively, several recent approaches have treated syllogistic reasoning as an image recognition problem and reported accuracies approaching 100%\. We therefore focus on whether this less\-explored paradigm can eventually achieve symbolic\-level reasoning by further scaling training data and computation\. Regardless of the underlying neural architecture, we identify two methodological obstacles that prevent image\-input supervised learning systems from reaching symbolic\-level syllogistic reasoning\. First, training data alone cannot distinguish all 24 valid forms of Aristotelian syllogistic reasoning, as correctness is determined by logical structure rather than statistical regularities\. Second, the end\-to\-end mapping from premises to conclusions introduces a conflict of objectives: it is inevitable that the pattern\-recognition component recognises the whole from its parts, whereas it is forbidden for the reasoning component to do so\. These objectives are not necessarily aligned\. To further investigate this issue, we challenge the latest generation of ChatGPT \(GPT\-5\) to determine the satisfiability of syllogistic statements and to justify its decisions\. While GPT\-5 may achieve 100% decision accuracy on a benchmark, it can still produce incorrect or inconsistent explanations\. Since empirical training procedures typically terminate once perfect accuracy is reached, no further optimisation pressure exists to correct such reasoning defects\. We therefore argue that supervised machine learning systems, whether trained on symbolic or image\-based inputs, cannot be guaranteed to attain the rigour of symbolic syllogistic reasoning through scaling alone\. Consequently, achieving more sophisticated forms of logical reasoning remains an even greater challenge\.

## 1Introduction

The historical success of neural networks, particularly LLMs, has been witnessed in various applications, such as human\-like communication\[chatgpt\_nature2023\], playing games\[alphaGo2017,alphaGo2020\], predicting gene structures\[AlphaFold3\], and solving mathematical tasks\[Davies21,alphaproof2024\]\. By increasing the amount of training data and training time\[scalinglaw2020,scalinglaw24\]and breaking complex tasks into multiple steps\[creswell2022selectioninference,wei2023COT,lightman2023lets\], data\-driven machine learning systems may steadily enhance their reasoning capabilities\. However, their reasoning abilities are still limited, even for simple logical reasoning\[chatgpt\_nature2023\], for example, the syllogistic reasoning system\[Eisape2024,syllogism24,kim2025\], where the reasoning process is primitive and cannot be broken into multiple steps\. Recently, by promoting vector embeddings into spheres and introducing the method ofreasoning through explicit model construction and inspection\[LairdByrne91,knauf03,GoodwinLaird05,Knauff09\], Sphere Neural Networks \(SphNN\) successfully go out of the paradigm of data\-driven machine learning and achieve the rigour of symbolic syllogistic reasoning\[djl2024sphere,djl2025\]\. This is not surprising, as RNNs are Turing complete\[turingcomp23,transformerREP24\]and SphNN is a special RNN\. However, this raises the question of whether data\-driven machine learning systems can reach \(or be infinitely close to\) the same performance by increasing the amount of training data and training time\.

This paper is structured as follows: Section[2](https://arxiv.org/html/2606.26454#S2)introduces the criterion of symbolic\-level syllogistic reasoning\. Section[3](https://arxiv.org/html/2606.26454#S3)surveys supervised neural syllogistic reasoning, recent assessments of LLMs in syllogistic reasoning, neural logical proving, and ends with our research question\. Section[4](https://arxiv.org/html/2606.26454#S4)presents two limitations that prevent image\-input supervised learning systems from reaching symbolic\-level syllogistic reasoning: \(1\) Training data cannot distinguish every valid type of syllogistic reasoning; \(2\) End\-to\-end mapping introduces contradictory targets between neural components of pattern recognition and logical reasoning\. Using Euler Net as a representative image\-input supervised neural network for syllogistic reasoning, section[6](https://arxiv.org/html/2606.26454#S6)shows that composition tables cannot distinguish syllogistic reasonings with the same premises but different conclusions, and what kinds of unintended inputs an end\-to\-end mapping process will generate\. With recent GPT\-5\-nano and GPT\-5, we experimentally demonstrate their unstable performances in syllogistic reasoning across four surface forms: words, double words, simple symbols, and random symbols\. Experiments with Euler Net and two GPT versions convergently show that they follow the Scaling Law in increasing syllogistic reasoning performances, but can not achieve the symbolic level\. Section[7](https://arxiv.org/html/2606.26454#S7)concludes the work and lists several research directions\.

## 2Syllogistic Reasoning: The Foundation of Logical Reasoning

The central notion of logical reasoning, from the origin of logic research in history till now, is the notion of “following from”, or more formally, “logical consequence from the premises” – what can we know from the premises? Syllogistic reasoning, developed by Aristotle over 2,000 years ago, is the start of the history of logical reasoning\[historyLogic17\]\. From syllogistic reasoning, logicians developed propositional logic in the Medieval period and later first\-order logic\.

Aristotelian syllogistic reasoning is a deduction with the form of two premises and one conclusion\. A syllogistic deduction only contains three terms \(X,Y, andZ\) and four possible relations: \(1\)universal affirmative: allXXareYY; \(2\)particular affirmative: someXXareYY; \(3\)universal negative: noXXareYY; \(4\)particular negative: someXXare notYY\. Let two premises besome lawyers are presidentsandno presidents are scientists, the conclusion and its negation will besome lawyers are not scientistsandall lawyers are scientists, as shown in Figure[1](https://arxiv.org/html/2606.26454#S2.F1)\(e\)\.

The four syllogistic relations can be interpreted through set relations in Euler diagrams, shown in Figure[1](https://arxiv.org/html/2606.26454#S2.F1)\(a\-d\)\. For example,some X are Ycan be interpreted as the relation “set X \(𝒪X\\mathcal\{O\}\_\{X\}\) intersects with set Y \(𝒪Y\\mathcal\{O\}\_\{Y\}\)”, which corresponds to three possible diagrammatic relations: \(1\)𝒪X\\mathcal\{O\}\_\{X\}partially overlaps with𝒪Y\\mathcal\{O\}\_\{Y\}, \(2\)𝒪X\\mathcal\{O\}\_\{X\}contains𝒪Y\\mathcal\{O\}\_\{Y\}, \(3\)𝒪Y\\mathcal\{O\}\_\{Y\}contains𝒪X\\mathcal\{O\}\_\{X\}\. We can merge the three possible relations into one relation:𝒪X\\mathcal\{O\}\_\{X\}does not disconnect from𝒪Y\\mathcal\{O\}\_\{Y\},¬𝐃\(𝒪X,𝒪Y\)\\neg\\mathbf\{D\}\(\\mathcal\{O\}\_\{X\},\\mathcal\{O\}\_\{Y\}\), as shown in Figure[1](https://arxiv.org/html/2606.26454#S2.F1)\(c\)\. Formally, we define𝒪X\\mathcal\{O\}\_\{X\}disconnecting from𝒪Y\\mathcal\{O\}\_\{Y\}as that there is no𝒪Z\\mathcal\{O\}\_\{Z\}that is part of𝒪X\\mathcal\{O\}\_\{X\}and𝒪Y\\mathcal\{O\}\_\{Y\}\.

𝐃\(𝒪X,𝒪Y\)≜∄𝒪Z𝐏\(𝒪Z,𝒪X\)∧𝐏\(𝒪Z,𝒪Y\)\\mathbf\{D\}\(\\mathcal\{O\}\_\{X\},\\mathcal\{O\}\_\{Y\}\)\\triangleq\\nexists\\mathcal\{O\}\_\{Z\}\\mathbf\{P\}\(\\mathcal\{O\}\_\{Z\},\\mathcal\{O\}\_\{X\}\)\\land\\mathbf\{P\}\(\\mathcal\{O\}\_\{Z\},\\mathcal\{O\}\_\{Y\}\)We can define syllogistic relations through the primitive diagrammatic relation𝐏\\mathbf\{P\}\[Smith96\]and establish a one\-to\-one mapping \(⇔\\Leftrightarrow\) between syllogistic and diagrammatic relations as follows\.

- •“allXXareYY”⇔\\Leftrightarrow“Circle𝒪X\\mathcal\{O\}\_\{X\}is part of Circle𝒪Y\\mathcal\{O\}\_\{Y\}”,𝐏\(𝒪X,𝒪Y\)\\mathbf\{P\}\(\\mathcal\{O\}\_\{X\},\\mathcal\{O\}\_\{Y\}\);
- •“someXXareYY”⇔\\Leftrightarrow“Circle𝒪X\\mathcal\{O\}\_\{X\}does not disconnect from Circle𝒪Y\\mathcal\{O\}\_\{Y\}”,¬𝐃\(𝒪X,𝒪Y\)\\neg\\mathbf\{D\}\(\\mathcal\{O\}\_\{X\},\\mathcal\{O\}\_\{Y\}\);
- •“noXXareYY”⇔\\Leftrightarrow“Circle𝒪X\\mathcal\{O\}\_\{X\}disconnects from Circle𝒪Y\\mathcal\{O\}\_\{Y\}”,𝐃\(𝒪X,𝒪Y\)\\mathbf\{D\}\(\\mathcal\{O\}\_\{X\},\\mathcal\{O\}\_\{Y\}\);
- •“someXXare notYY”⇔\\Leftrightarrow“Circle𝒪X\\mathcal\{O\}\_\{X\}is not part of Circle𝒪Y\\mathcal\{O\}\_\{Y\}”,¬𝐏\(𝒪X,𝒪Y\)\\neg\\mathbf\{P\}\(\\mathcal\{O\}\_\{X\},\\mathcal\{O\}\_\{Y\}\)\.

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/bsyllogism_intro.png)Figure 1:\(a\-d\) Four syllogistic relations and their spatial relations; \(e\) from the two premises, the logical conclusion issome lawyers are scientists, its negation isall lawyers are scientists; \(f\) spatial statements of the syllogistic statements; \(g\) no sphere configuration satisfies the premises and the conclusion𝐏\\mathbf\{P\}\(lawyers, scientists\); there is a sphere configuration that satisfies the premises and the conclusion¬𝐏\\neg\\mathbf\{P\}\(lawyers, scientists\)A syllogistic reasoning can besatisfiable, unsatisfiable, valid, orinvalid\. Beingsatisfiablemeans there is a case in which both the premises and the conclusion are true\. Beingvalidmeans the conclusion is true in every case its premises are true\[jeffrey81\]\. For avalidreasoning, the negation of its conclusion isunsatisfiable; for aninvalidreasoning, the negation of its conclusion issatisfiable\. Diagrammatically, syllogistic reasoning issatisfiable, if and only if we can construct an Euler diagram, e\.g\., three circles satisfying the diagrammatic relations of the premises and conclusion; otherwise, this reasoning will beunsatisfiable\. In Figure[1](https://arxiv.org/html/2606.26454#S2.F1)\(g\), we successfully constructed an Euler diagram of the premises and the conclusionsome lawyers are not scientists, so this reasoning issatisfiable\. But, we cannot construct an Euler diagram of the premises and the conclusionall lawyers are scientists, so this conclusion isunsatisfiable, and therefore, its negation isvalid\.

If we allow two terms in premises to change positions and fix the order of terms in the conclusion statement, there will be 256 different forms of Aristotelian syllogistic reasoning, among which 24 types \(listed in Table[3](https://arxiv.org/html/2606.26454#A1.T3)in the Appendix\) arevalid\[laird2012\]\. A reasoning network reaches the rigour of syllogistic reasoning if it can correctly determine for sure anyvalidsyllogistic reasoning and construct counter\-examples forinvalidones\. This criterion also applies to neural networks reasoning with out\-of\-distribution data \(unintended inputs\)\.

## 3Research Questions and State of the Art

As a basic logical deduction, syllogistic reasoning is straightforward for symbolic methods\[VukmirovicBCS19,BentkampBTV21\]\. However, developing neural syllogistic models is extremely challenging, to the point that it was regarded as utopian a decade ago\[laird2012\]\. Considering

1. 1\.the scaling law\[scalinglaw2020,scalinglaw24\];
2. 2\.the huge training costs \(in terms of data, GPUs, and training time\) of LLMs;
3. 3\.the Turing Completeness of recurrent neural nets\[SIEGELMANN1995,turingcomp23\];
4. 4\.Sphere Neural Networks\[djl2024sphere,djl2025\]that achieve symbolic\-level syllogistic reasoning without using training data\.

our research question can be stated as follows: Can data\-driven neural networks reach or be infinitely close to this level if the amount of training data increases to infinity? If the answer is negative, it follows that supervised neural networks cannot achieve the rigour of symbolic\-level logical reasoning, for syllogistic reasoning underpins more complex forms of logical reasoning\.

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/EN1.png)Figure 2:An overview of the Euler Net for syllogistic reasoning\[WangJL18,WangJL20\]\. Inputs are two simple images, each consisting of two circles; the output is a vector representing all possible relations between the subject and the predicate\.Supervised deep learning systems may learn syllogistic reasoning from either symbolic or image\-based inputs\. In the symbolic setting, the large family of data\-driven neural networks, Large Language Models \(LLMs\), can be applied for syllogistic reasoning\[MaTengyu2025,goedelprover2025,MaTengyu2025ICLR\], e\.g\., Goedel\-Prover\[goedelprover2025\], the Self\-play LLM Theorem Provers\[MaTengyu2025\]\. However, in these systems, the correctness of formal states is not determined by LLMs, but by humans or symbolic provers, such as Isabelle, LEAN\. And no theoretical guarantee exists that a trained model will generalise its reasoning ability to out\-of\-distribution symbols\. This limitation is consistent with recent evaluations of large language models on syllogistic reasoning tasks\. Several studies have explored the syllogistic reasoning \(as single\-step reasoning\) performance of LLMs\.\[Eisape2024\]evaluated PaLM 2 family LLMs\[palm22023\]and Llama 2 family LLMs\[llama22023\], showing that PaLM 2\-Small achieved the best accuracy about 75%, better than PaLM 2\-Large, which does not strictly follow the Scaling Law\.\[syllogism24\]evaluated PaLM 2 LLMs and GPT\-3\.5\[openAI23\], concluding that LLMs may achieve above\-chance performances in familiar situations but exhibit numerous imperfections in abstract reasoning, including syllogism\.\[syllobio2025\]examined Mistral LLMs\[mistral7b,mistral23\], Gemma LLMs\[gemma2024\], Llama\-3 LLMs\[llama3\]\), and BioMistral LLMs\[biomistral2024\], with conclusions that zero\-shot LLMs achieved an average accuracy between 70% ongeneralised modus ponensand 23% ondisjunctive syllogism, and both zero\- and few\-shot LLMs are sensitive to surface\-level lexical variations\. Thus, they are far from achieving the reliability required for high\-stakes biomedical applications, let alone attaining the rigour of symbolic\-level reasoning\.\[djl2025\]evaluated GPT\-3\.5\-turbo and GPT\-4 in determining the validity of all types of classic syllogistic reasoning in three lexical forms: \(1\) meaningful words, \(2\) simple symbols, and \(3\) long random symbols, showing thatChatGPT\(GPT\-3\.5\-turbo\) reached the best performance \(correct decision and explanation\) of46\.9%46\.9\\%using statements with simple symbols, andChatGPT\(GPT\-4o\) reached the best performance of82\.4%82\.4\\%with long random symbols\. The method of Chain\-of\-Thought \(CoT\)\[wei2023COT,MaTengyu2025ICLR\]is a strategy to improve the reasoning performance of neural networks by breaking a task into several intermediate steps\. This, however, does not affect the performance of single\-step reasoning\.

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/syllogism_ed.jpg)Figure 3:Training data for Euler Net is generated by this combination table\. Each column and each row are images representing premises\. The intersection cell lists all possible conclusions\. Euler Net learns syllogistic reasoning by remembering the combinations of Euler diagrams\.Alternatively, a less popular supervised learning approach used image inputs for syllogistic reasoning and achieved almost 100% accuracy, namely, Euler Net\[WangJL18\], as illustrated in Figure[2](https://arxiv.org/html/2606.26454#S3.F2)\. Euler Net is a supervised convolutional deep learning network to solve syllogistic reasoning by using Euler diagrams\[WangJL18,WangJL20\]\. The inputs of Euler Net are two Euler diagrams, each representing one premise of a syllogism, the output is a vector presenting the syllogistic conclusion\(s\) following the combination table, as shown in Figure[3](https://arxiv.org/html/2606.26454#S3.F3)\. For example, let two premises be \(1\) “all blue are green”; \(2\) “all green are red”, the conclusion will be “all blue are red”, that is, the red circle contains the blue circle, represented as\[1,0,0,0\]\[1,0,0,0\]\(each element represents a syllogistic relation, as illustrated in Figure[2](https://arxiv.org/html/2606.26454#S3.F2)\)\. With 80000 training patterns and 8000 validation patterns, Euler Net reached 99\.8% accuracy in 8000 testing data, leaving a0\.2%0\.2\\%gap to the determinacy of neural syllogistic deduction\. Can Euler Net attain symbolic\-level syllogistic reasoning by continuously increasing the amount of training data and extending the training time indefinitely?

## 4Image\-Input Supervised Learning Cannot Achieve Symbolic\-Level Syllogistic Reasoning

In this section, we disclose two methodological deficits that prevent Euler Net from achieving the symbolic level of syllogistic reasoning\.

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/euler_diagram.jpg)Figure 4:Euler diagrams representing 4 possible relationships between non\-empty setsWWandVV\.![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/syl_table.png)Figure 5:The combination table establishes associations between inputs \(premises\) and output \(conclusion\)\. Premises ofSome … are \(not\) …occupy three columns or rows\. The cell with the green boundary hosts 5 valid types of syllogistic reasoning\.### 4\.1Training data cannot distinguish each valid syllogism

Euler Net interprets syllogistic relation into four basic set relations in the forms of Euler diagrams\[hammer98\]: \(1\)WWis contained byVV\(W⊂VW\\subset V\), \(2\)WWcontainsVV\(V⊂WV\\subset W\), \(3\)WWpartially overlaps withVV\(W∩V≠∅W\\cap V\\neq\\emptyset\), and \(4\)WWdisjoints fromVV\(W∩V=∅W\\cap V=\\emptyset\), as shown in Figure[4](https://arxiv.org/html/2606.26454#S4.F4)\. Consequently, the syllogistic relationsomeWWareVVhas three possible set\-theoretic relations:W⊂VW\\subset V,V⊂WV\\subset W,W∩V≠∅W\\cap V\\neq\\emptyset; andsomeWWare notVVhas another three possible set\-theoretic relations:W∩V≠∅W\\cap V\\neq\\emptyset,V⊂WV\\subset W,W∩V=∅W\\cap V=\\emptyset\. This results in the fact that the combination table cannot distinguish each valid type of syllogistic reasoning, as shown in Figure[5](https://arxiv.org/html/2606.26454#S4.F5)– the two syllogistic relationssome … are …andsome … are not …occupy three rows, columns, and cells\. Nine table cells contain various syllogistic conclusions\. Though Euler Net demonstrates close to 100% accuracy in the benchmark datasets, its performance in determining the correctness of each valid syllogistic reasoning ranges from 50% to 100%, as listed in Table[1](https://arxiv.org/html/2606.26454#S4.T1)\. This deficit is independent of the neural architecture\.

Table 1:Performances of SupEN \(after 19 loops of improvements\) for each valid type of syllogistic reasoning\.Valid TypeAccuracyValid TypeAccuracyValid TypeAccuracyBARBARA100%BARBARI50%BAROCO66\.7%BAMALIP50%BOCARDO75%CALEMES100%CAMESTROS50%CELARENT100%CESARO50%CALEMO50%CESARE100%CELARONT50%DARAPTI100%DARII75%DISAMIS75%FESAPO100%DATISI75%DIMATIS75%FELAPTON100%FERIO83\.3%FERISON83\.3%CAMESTRES100%FRESISON83\.3%FESTINO83\.3%
### 4\.2End\-to\-end learning introduces contradictory training targets

The architecture of image\-input supervised neural networks is an end\-to\-end pipeline from a pattern recognition component to a reasoning component\. The pattern recognition component recognises objects in input images\. The reasoning component integrates recognised objects in the two input images into one model and predicts the relation between target objects from the model\. A well\-trained deep\-learning pattern recognition system can recognise an object from its parts, which is a desired feature in Computer Vision\[MAE2022\]– Siamese architecture was used to recover frames in video recognition\[siammae2023\], and can recover the whole image of the next frame, given the current image and only 5% of the image of the next frame, as illustrated in Figure[6](https://arxiv.org/html/2606.26454#S4.F6)\. However, logical deduction is to identify information implicitly in the premises\[Simon19\]; thus, injecting new objects into the premises is not allowed\. This is the second deficit: an end\-to\-end pipeline that maps the premises to the conclusion introduces contradictory training targets between the neural components of pattern recognition and logical reasoning – the pattern recognition component may inject new objects that do not exist in the input images, and the reasoning component can neither stop nor notice this\. For example, the Siamese networks \(pattern recognition components\) of Euler Net may inject red and blue circles into the input images, causing the reasoning component to output\[0\.0051,0\.9906,0\.0153,0\.0051\]\[0\.0051,0\.9906,0\.0153,0\.0051\], which means “blue circlecontainsred circle”, while the input images have only two singlegreen circles, as shown in Figure[7](https://arxiv.org/html/2606.26454#S4.F7)\.

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/pw_image.png)Figure 6:\(a\) The MAE architecture can reconstruct the whole image from its part, even when the part is only 25% of the original image; \(b\) The two Encoders are Siamese neural networks, sharing the same parameters\. Two frames are similar snapshots with a very small temporal interval in a video\. This Siamese neural architecture can reconstruct the whole of the second frame, even when 95% of it is covered\. The pictures are partially copied from\[MAE2022,siammae2023\]\.![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/bluered.png)Figure 7:Euler Net may inject blue and red circles into the inputs and predict “bluecontainsred”\.

## 5Self\-Attention is a Learned Combination Table

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/qkv.jpg)Figure 8:\(a\) The self\-attention mechanism of transformers\.QQ,KK, andVVstructures a combination table of words in a sentence; \(b\) The self\-attention values among words in the sentence “Your journey starts with one step”\[build\-llms24\]\.The self\-attention mechanism is the heart of Transformers\[Vaswani17\]and LLMs\[build\-llms24\]\. It learns a combination table \(a relation matrix\) among the words in a sentence\. Each entry in the matrix quantifies the probability that the row word attends to, or is associated with, the column word\. The softmax operation converts attention scores into a probability distribution whose values sum to one, as illustrated in Figure[8](https://arxiv.org/html/2606.26454#S5.F8)\(a,b\)\. By learning statistical regularities from large\-scale text corpora, LLMs generate text through next\-token prediction, selecting words or tokens according to their estimated probabilities\. For example, givenall A are B\. all B are C\. Therefore, \_\_ A are C, a well\-trained LLM will completeall A are C\. This is a well\-known type \(BARBARA\) of syllogistic reasoning\. Another valid type \(BARBARI\) has the same premise but a weaker conclusion:some A are C\.111We list all valid types and their names of syllogistic reasoning in the Appendix[A](https://arxiv.org/html/2606.26454#A1)\.Because the softmax operation constrains all candidate words to share a total probability of one, the self\-attention mechanism alone cannot predict that bothallandsomeare 100% accuracy in that context, and will not reach the symbolic level of syllogistic reasoning\.

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/exEN.jpg)Figure 9:The architecture of Super Euler Net\.
## 6Experiments

In addition to the theoretical analysis above, we conducted a series of experiments to examine whether increasing the training data would enable Euler Net\[WangJL18,WangJL20\]to improve its performance to levels arbitrarily close to the symbolic level\.

In the setting of Euler Net, all image inputs are circles; therefore, the correct outputs of syllogistic reasoning can be computed\. We extend Euler Net into Super Euler Net \(SupEN\) that can automatically identify incorrect output of Euler Net and create new training data, as illustrated in Figure[9](https://arxiv.org/html/2606.26454#S5.F9): SupEN randomly generates images, and checks whether the output of Euler Net is correct \(that is, the binary cross\-entropy loss between the network output and the correct output is less than a threshold\)\. If not, a new piece of input\-output pair for training will be created\. The main procedure is outlined in Algorithm[1](https://arxiv.org/html/2606.26454#algorithm1)\.

Input:Euler Net:

EN\\mathrm\{EN\};

The maximum size of unintended data set:maxSize;

The timer:Timer;

The maximum time:maxTime;

The threshold to be unintended:Threshold

Output:A new training data:newData

newData←∅\\textnormal\{\{newData\}\}\\leftarrow\\emptyset;

DataSize←0\\textnormal\{\{DataSize\}\}\\leftarrow 0;

Timer←set\_timer\(\)\\textnormal\{\{Timer\}\}\\leftarrow\\textnormal\{\{set\\\_timer\}\}\(\);

while*DataSize<<maxSize∧\\landTimer<<maxTime*do

Input

←\\leftarrowrandomly\_generate\_one\_input\(\)

ENOutput

←\\leftarrowoutput\_of\_network\(

EN\\mathrm\{EN\},Input\);

Output

←\\leftarrowcompute\_correct\_output\(Input\)

if**loss*\(*ENOutput*,*Output*\)\>*Threshold*\\textnormal\{\{loss\}\}\(\\textnormal\{\{ENOutput\}\},\\textnormal\{\{Output\}\}\)\>\\textnormal\{\{Threshold\}\}*then

newData

←newData∪\{\(Input,Output\)\}\\leftarrow\\textnormal\{\{newData\}\}\\cup\\\{\(\\textnormal\{\{Input\}\},\\textnormal\{\{Output\}\}\)\\\}
DataSize

←DataSize\+1\\leftarrow\\textnormal\{\{DataSize\}\}\+1
returnnewData

Algorithm 1Automatic generating new training data![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/new_class.png)Figure 10:\(a\) Super Euler Net may automatically complete half circle into full circle \(with the output\[0\.0002,0\.9317,0\.0003,0\]\[0\.0002,0\.9317,0\.0003,0\]\)\. As we decrease the length of the arc to120∘120^\{\\circ\},60∘60^\{\\circ\}, and0∘0^\{\\circ\}, it decreases this value accordingly\. \(b\) Super Euler Net may automatically ignore the half circle and only take one green circle as input \(with the output\[0\.0001,0\.0029,0\.0009,0\.0002\]\[0\.0001,0\.0029,0\.0009,0\.0002\]\)\. As we increase the length of the arc to240∘240^\{\\circ\},300∘300^\{\\circ\}, and360∘360^\{\\circ\}, it increases this value accordingly\.### 6\.1Experiment 1

#### The aim

The pattern recognition component of Euler Net may automatically recognise the whole from the parts \(Figure[7](https://arxiv.org/html/2606.26454#S4.F7)\)\. If we introduce a new classunintended inputsfor all the parts, we check whether increasing training data can exhaust the parts\.

#### Setting of the experiment

We define a new output vector\[0,0,0,0\]\[0,0,0,0\]representing unintended inputs and train SupEN to classify single\-circled inputs into\[0,0,0,0\]\[0,0,0,0\]till it reaches 100% accuracy\. In the end, SupEN can perform syllogistic reasoning for regular inputs and classify single\-circle inputs as unintended\. Then, we create a new dataset, in which one input image consists of a green circle and a half red circle, and the other image consists of a half blue circle and a green circle\.

#### Experiment results

Experiments show that sometimes SupEN completes the two half\-circles into two whole circles, and concludes\[0,1,0,0\]\[0,1,0,0\]the blue circle contains the red circle, as shown in Figure[10](https://arxiv.org/html/2606.26454#S6.F10)\(a\)\. In this case, if we decrease the arc length of the two half\-circles, the confidence value flaggingthe blue circle contains the red circlewill decrease, as shown in the input images and outputs from Figure[10](https://arxiv.org/html/2606.26454#S6.F10)\(a\) to \(b\)\. Sometimes, SupEN simplified one green circle and a half circle into one green circle \(half circles are neglected\) and concludes the inputs are unintended\[0,0,0,0\]\[0,0,0,0\], as shown in Figure[10](https://arxiv.org/html/2606.26454#S6.F10)\(c\)\. In this case, if we increase the arc length of the two half\-circles, the confidence value flaggingthe blue circle contains the red circlewill increase correspondingly, as shown in input images and outputs from Figure[10](https://arxiv.org/html/2606.26454#S6.F10)\(c\) to \(d\)\. This, however, will automatically create another kind of unintended pattern: one green circle and a partial circle with\(180∘\+360∘\)/2=270∘\(180^\{\\circ\}\+360^\{\\circ\}\)/2=270^\{\\circ\}\. This loop will never end\.

#### Conclusion

Training data can not exhaust unintended inputs, for new training data generates new unintended inputs\. Thus, SupEN will not reach symbolic\-level reasoning if we do not restrict inputs\.

### 6\.2Experiment 2

#### The aim

If we restrict all inputs to be intended \(either two complete circles or one single circle\) and repeatedly increase training data, we check whether SupEN will increase performance and beinfinitely closeto the symbolic level\.

#### Setting of the experiment

As SupNN can automatically identify reasoning errors and generate new training data, we let it repeatedly switch between searching for new training data and training using new training data\. In the search procedure, the central point and the radius of a circle are random, with two restrictions as follows: \(1\) circles are fully inside the boundary of an image; \(2\) the minimum radius is 0\.1\. We allow all possible combinations between two circles\. Based on these criteria, SupEN creates a new test data set𝒟1\\mathcal\{D\}\_\{1\}\.

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/exp1.jpg)Figure 11:𝒟1\\mathcal\{D\}\_\{1\}is a newly created dataset;𝒟2\\mathcal\{D\}\_\{2\}is randomly selected from the standard training dataset\.In the training stage, SupEN creates a training data𝒟T\\mathcal\{D\}\_\{T\}that consists of newly created training data𝒟1\\mathcal\{D\}\_\{1\}and original training data𝒟2\\mathcal\{D\}\_\{2\},𝒟T=𝒟1∪𝒟2\\mathcal\{D\}\_\{T\}=\\mathcal\{D\}\_\{1\}\\cup\\mathcal\{D\}\_\{2\}, as shown in Figure[11](https://arxiv.org/html/2606.26454#S6.F11)\. For example,𝒟1\\mathcal\{D\}\_\{1\}is a newly created dataset with one\-circle images\. The size of𝒟2\\mathcal\{D\}\_\{2\}is 9 times larger than that of𝒟1\\mathcal\{D\}\_\{1\}\.

#### Tesing with random dataset

Different from the standard deep learning paradigm, in which testing data and training data shall have the same distribution\[Bengio22\], we need to evaluate whether SupEN can reach \(or be infinitely close to\) 100% accuracy for new testing data\. Thus, in this experiment, the testing data are randomly generated\. We let SupEN loop 20 times through the searching\-training process to improve its reasoning performance\.

#### Experiment result

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/Accuracy.jpg)Figure 12:The accuracy of SupEN at each iteration with newly generated training data\.Being tested with randomly generated testing data, SupEN reaches56\.0%56\.0\\%, before the loop \(This SupEN corresponds to a well\-trained Euler Net\)\. Through repeated training of newly created datasets, the accuracy improves steadily and reaches a peak value of97\.8%97\.8\\%after the 19\-th iteration, as illustrated in Figure[12](https://arxiv.org/html/2606.26454#S6.F12)\. The increase of accuracy support the law of scaling\. The oscillation is because we randomly search instead of searching with gradual descent operations\.

all elementary\_particle\.n\.01 are natural\_object\.n\.01\.no artifact\.n\.01 are natural\_object\.n\.01\.no elementary\_particle\.n\.01 are artifact\.n\.01\.∴\\thereforeall elementary\_particle\.n\.01 are natural\_object\.n\.01\.some artifact\.n\.01 are natural\_object\.n\.01\.some elementary\_particle\.n\.01 are artifact\.n\.01\.∴\\therefore

#### Tesing with data covering all valid types of syllogistic reasoning

Reaching symbolic\-level syllogistic reasoning requires to reach or be infinitely close to 100% accuracy for each valid type of syllogistic reasoning\. Thus, we created another testing dataset covering all 24 valid types as follows: We group equivalent syllogistic statements into the same group, for example,no x are yandno y are xare in the same group\. This reduces 24validsyllogism types into 14 groups\. For each group, we created 500 different premises by extracting hypernym relations from WordNet\-3\.0\[MillerWordNet95\]\. For each premise, we deduce its valid logical conclusion and its negation, totalling 14000 syllogism reasoning tasks\. In the hypernym structure,elementary\_particle\.n\.01is a descendent ofnatural\_object\.n\.01andartifact\.n\.01is not a descendent ofnatural\_object\.n\.01\. So, we create the valid syllogistic reasoning and its negation, as shown below\. We use the pre\-processing tool of the original Euler Net to transform premises into coloured circles, and conclusions into vectors\.

#### Experiment result

We fed the new dataset to a well\-trained SupEN \(with 19\-th loop of improvements\)\. It works very well if a task falls into a valid syllogistic structure: For 8 syllogistic structures, it reaches 100% accuracy, namely, BARBARA, CELARENT, CESARE, DARAPTI, CALEMES, CAMESTRES, FELAPTON, and FESAPO\. Accuracies of the remaining 16 types range from50%50\\%to83\.3%83\.3\\%\. The overall accuracy is 76%, as shown in Table[1](https://arxiv.org/html/2606.26454#S4.T1)\(in Section 4\.1\)\. This performance is consistent with\[Eisape2024\]’s evaluation with PaLM 2 and Llama 2 family LLMs — the best performance was achieved by PaLM 2\-Small with accuracy about 75%, better than PaLM 2\-Large\. These results suggest that the reasoning performance may not be infinitely close to the symbolic level solely by increasing training data and training time \(in terms of the number of loops\)\.

### 6\.3Experiments3

#### The aim

When Euler Net reasons correctly, it is taken for granted that the Siamese Neural Network has learned the geometric relations between the two circles\. As shown in Figure[13](https://arxiv.org/html/2606.26454#S6.F13)\(a\), it is well accepted that the latent feature vectors encode geometric relations, such asa green circle contains a red circle, and that Euler Net is learning one syllogistic reasoning type\. The aim of this experiment is to show that this may be a mirage, an unrequited longing of human observers\.

![Refer to caption](https://arxiv.org/html/2606.26454v1/figures/image2vec.png)Figure 13:\(a\) We may believe that the two vectors encode geometric structures in the input images; \(b\) The accuracy of a well\-trained Euler Net drops when the green circle is changed to other colours\. The more difference between the new colour and the original green, the worse the accuracy will be\.
#### Setting of the experiment

We suppose that Euler Net learns geometric relations between circles, encodes them in the latent feature vectors, and uses them to get syllogistic results\. Under this assumption, changing the colour of two green circles will not significantly affect performance\.

#### Results

Our experimental results show that the performance drops with the increase of differences between the new colour and the original green colour, as illustrated in Figure[13](https://arxiv.org/html/2606.26454#S6.F13)\(b\)\. When the colour differs greatly from green, the accuracy will drop to8\.1%8\.1\\%\. This result refutes our assumption and suggests that even if human observers can easily identify spatial structures in pixel images, supervised deep learning systems may not be able to abstract geometric relations from pixel images; however, this does not prevent them from delivering correct outputs within the benchmark dataset\. This experiment result suggests that we should take a more critical attitude in analysing and interpreting latent feature vectors of traditional neural networks\.

### 6\.4Experiment 4

#### The aim

We evaluate two versions of the most recentOpen AI GPTs, GPT\-5 and GPT\-5\-nano, in syllogistic reasoning, to examine whether the scaling law may guarantee the performance to reach \(or be infinitely close to\) the symbolic\-level\. Concretely, we examine whether surface lexical patterns can affect their reasoning performance \(deficit 2\)\.

#### Setting of the experiment

We follow the method in\[djl2025\]that used syllogistic statements with four surface lexical patterns: \(1\) meaningful words, e\.g\.Socrates, \(2\) doublele words, e\.g\.City\_Socrates, \(3\) simple symbols, e\.g\.S, and \(4\) long random symbols, e\.g\.VnWKvqcBsEh1, to determine the satisfiability of all 256 types of classic Aristotelian syllogistic reasoning\. The motivation for introducing the new pattern of double words is to enable the meaning of words to support reasoning, while reducing the bias inherent in single words\.

Table 2:Syllogistic reasoning performance of OpenAI GPT\-5\-nano and GPT\-5\. ‘✓\\checkmarkEXPL’ for correct explanation, ‘✗ H’ for hallucinative explanation\. The ‘\#correct decision\-✗ H’ column means a correct decision with a wrong explanation; the ‘\#wrong decision\-✓\\checkmarkEXPL’ column means a wrong decision with a correct explanation\.
#### Evaluation metrics

We use two evaluation metrics: \(1\) the normal metrics in terms of accuracy \(the \#simple acc column in Table[2](https://arxiv.org/html/2606.26454#S6.T2)\); \(2\) the metrics of reaching symbolic\-level logical reasoning, namely, a response is correct if both the decision and the explanation are correct \(the \#correct decision\-✓\\checkmarkEXPL column in Table[2](https://arxiv.org/html/2606.26454#S6.T2)\)\.

#### Results

The experiment’s results, measured in normal metrics, range from 97\.7% to 100\.0%, confirming the high performance of both OpenAI’s GPT\-5\-nano and GPT\-5 in syllogistic reasoning\. Meanwhile, eight experiments show that surface lexical patterns can affect reasoning performance, and each version made at least 15 correct decisions with incorrect explanations\. Being fed with double\-word statements, GPT\-5\-nano achieved 89\.4% correct decisions with correct explanations, better than using other forms\. With single\-word statements, GPT\-5 achieved 93\.4% correct decisions with correct explanations, better than using other forms\. In particular, GPT\-5 achieved 100% correct decisions with long random symbols, but 25 correct decisions were supported by wrong explanations\. As 100% accuracy is the maximum performance guided by the scaling law, usually accompanied by the stop of training, reaching symbolic\-level performance will be beyond this limit\.

## 7Conclusion

Given the simplicity of syllogistic reasoning, one might assume that deep neural networks can readily master it\. Recent studies of neural syllogistic reasoning have primarily focused on symbolic\-input systems, such as large language models, and have investigated their internal reasoning mechanisms\. The emerging consensus is that, despite impressive empirical performance, these systems do not attain the rigour of symbolic\-level syllogistic reasoning\. In contrast, we examine a less popular method for syllogistic reasoning: supervised deep learning using image inputs\. We identify two methodological limitations that prevent such systems from achieving symbolic\-level syllogistic reasoning\. Our analysis focuses on training data — the fuel of all supervised learning systems — and concludes that no desired fuel can provide sufficient information to attain symbolic\-level syllogistic reasoning\. Since syllogistic reasoning constitutes the foundation of logical inference, this limitation has broader implications for the reasoning ability of supervised neural networks and reveals a fundamental boundary of the scaling law\.

## Acknowledgement

We gratefully acknowledge Björn Gintzel for his coding and implementation support\.

## References

## Appendix AThe list of 24 valid types of syllogistic reasoning

Table 3:List of all 24 valid syllogisms, each having a name whose vowels indicate types of moods, e\.g\., vowels in ‘CELARENT’ indicateuniversal negative\(E\),universal affirmative\(A\), anduniversal negative\(E\)\.
Data-driven Machine Learning Cannot Reach Symbolic-level Logical Reasoning -- The Limit of the Scaling Law

Similar Articles

Why scaling alone will not give us rational AI

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

The brute force approach to ai logic is genuinely hitting a ceiling

The "just add more compute" argument for ai reasoning is getting exhausting

The Scaling Properties of Implicit Deductive Reasoning in Transformers

Submit Feedback

Similar Articles

Why scaling alone will not give us rational AI
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
The brute force approach to ai logic is genuinely hitting a ceiling
The "just add more compute" argument for ai reasoning is getting exhausting
The Scaling Properties of Implicit Deductive Reasoning in Transformers