ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

arXiv cs.LG 05/26/26, 04:00 AM Papers
llm-reasoning dynamical-systems logical-reasoning benchmark evaluation calibration chaos-theory
Summary
ChaosBench-Logic v2 is a large-scale benchmark of 40,886 questions over 165 dynamical systems that evaluates LLMs' logical reasoning abilities, revealing near-random performance on regime transition reasoning and systematic failure modes even in frontier models.
arXiv:2605.24305v1 Announce Type: new Abstract: Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime-transition reasoning remains near random (MCC = 0.05) even for frontier models, whereas FOL deduction with given premises reaches MCC = 0.52. Per-family decomposition shows that the proprietary-model advantage concentrates on cross-indicator (+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion-matrix analysis.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:03 AM
# Evaluating LLM Logical Reasoning over Dynamical Systems at Scale
Source: [https://arxiv.org/html/2605.24305](https://arxiv.org/html/2605.24305)
Noel Thomas Mohamed bin Zayed University of Artificial Intelligence Abu Dhabi, UAE noel\.thomas@mbzuai\.ac\.ae

###### Abstract

Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter\-dependent dynamics\. We present ChaosBench\-Logic v2, a 40,886\-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE \(Calibration\- and Adversarial\-Robust Evaluation\), a protocol that surfaces these pathologies\. Evaluating 14 models, we find that regime transition reasoning remains near\-random \(MCC = 0\.05\) even for frontier models, while FOL deduction with given premises reaches MCC = 0\.52; per\-family decomposition shows the proprietary advantage concentrates on cross\-indicator \(\+0\.40\) and consistency tasks, while open\-source Qwen 2\.5\-32B dominates indicator diagnostics \(0\.91 vs\. 0\.45\)\. Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti\-correlation via confusion matrix analysis\.

## 1Introduction

Large language models achieve strong performance on mathematical and logical reasoning benchmarks\(Weiet al\.,[2022](https://arxiv.org/html/2605.24305#bib.bib8); Cobbeet al\.,[2021](https://arxiv.org/html/2605.24305#bib.bib3)\), yet their capacity for*logically consistent*reasoning over scientific domains remains poorly understood\. Dynamical systems present a particularly demanding testbed: chaos is deterministic but not random, exhibits sensitive dependence on initial conditions, and requires positive Lyapunov exponents\(Strogatz,[2018](https://arxiv.org/html/2605.24305#bib.bib14)\)\. These formal distinctions must be maintained across multi\-step inferences, a requirement that probes deeper than pattern matching\.

Existing benchmarks target mathematical problem\-solving\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.24305#bib.bib4); Cobbeet al\.,[2021](https://arxiv.org/html/2605.24305#bib.bib3)\), scientific QA\(Clarket al\.,[2018](https://arxiv.org/html/2605.24305#bib.bib5); Wanget al\.,[2023a](https://arxiv.org/html/2605.24305#bib.bib12)\), propositional logic\(Liuet al\.,[2020](https://arxiv.org/html/2605.24305#bib.bib6); Hanet al\.,[2022](https://arxiv.org/html/2605.24305#bib.bib7)\), or synthetic FOL reasoning\(Saparov and He,[2023](https://arxiv.org/html/2605.24305#bib.bib20)\), but none combine \(i\) a domain\-specific FOL ontology with \(ii\) ground\-truth labels derived from axiom entailment over \(iii\) real scientific systems at scale\.

We introduce ChaosBench\-Logic v2, a 66×\\timesscale\-up of v1\(Thomas,[2026](https://arxiv.org/html/2605.24305#bib.bib1)\): from 621 to 40,886 questions, 27 to 165 systems, and 7 to 11 task families\. Our contributions:

1. 1\.Benchmark\.40,886 questions, 11 task families, 27 predicates, 78 FOL axiom edges, 165 dynamical systems \(135 fromdysts\(Gilpin,[2021](https://arxiv.org/html/2605.24305#bib.bib19)\)\)\.
2. 2\.CARE protocol\.A calibration\- and robustness\-aware evaluation framework \(MCC, macro\-family MCC, calibration diagnostics, consistency, coverage\) that exposes failure modes hidden by accuracy\.
3. 3\.Diagnostic findings\.A knowledge\-type boundary between rule\-following and parameter\-dependent reasoning, per\-family decomposition of the proprietary–OSS gap, and systematic prediction biases including negative MCC\.

## 2Related Work

#### LLM reasoning benchmarks\.

GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.24305#bib.bib3)\)and MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.24305#bib.bib4)\)test mathematical reasoning; ARC\(Clarket al\.,[2018](https://arxiv.org/html/2605.24305#bib.bib5)\)tests science QA; LogiQA\(Liuet al\.,[2020](https://arxiv.org/html/2605.24305#bib.bib6)\)and FOLIO\(Hanet al\.,[2022](https://arxiv.org/html/2605.24305#bib.bib7)\)test logical reasoning; BIG\-Bench\(Srivastavaet al\.,[2023](https://arxiv.org/html/2605.24305#bib.bib2)\)includes some logical tasks\. PrOntoQA\(Saparov and He,[2023](https://arxiv.org/html/2605.24305#bib.bib20)\)is closest to our work: it tests compositional FOL reasoning over synthetic ontologies, finding that LLMs struggle with longer inference chains\. LogicBench\(Parmaret al\.,[2024](https://arxiv.org/html/2605.24305#bib.bib21)\)evaluates 25 logical reasoning patterns and confirms failures on complex reasoning with negation\. Our benchmark differs by grounding FOL axioms in a real scientific domain where ground\-truth labels derive from physical properties rather than synthetic constructions\.

#### Consistency and robustness\.

TruthfulQA\(Linet al\.,[2022](https://arxiv.org/html/2605.24305#bib.bib11)\)measures truthfulness; self\-consistency decoding\(Wanget al\.,[2023b](https://arxiv.org/html/2605.24305#bib.bib9)\)improves CoT reliability; ReClor\(Yuet al\.,[2020](https://arxiv.org/html/2605.24305#bib.bib10)\)tests logical reading comprehension\. Our consistency\_paraphrase and perturbation families extend these ideas to a scientific domain with formal ground truth\.

#### Scientific reasoning and dynamical systems\.

SciBench\(Wanget al\.,[2023a](https://arxiv.org/html/2605.24305#bib.bib12)\)evaluates college\-level scientific problem\-solving\. Thedystslibrary\(Gilpin,[2021](https://arxiv.org/html/2605.24305#bib.bib19)\)provides 135 standardized dynamical systems originally designed for forecasting benchmarks; we build on it for system diversity, preserving provenance and verifying predicate annotations against our axiom system\. ChaosBench\-Logic v1\(Thomas,[2026](https://arxiv.org/html/2605.24305#bib.bib1)\)introduced a 621\-question benchmark; we scale by two orders of magnitude\.

## 3Benchmark Design

### 3\.1Ontology

The benchmark is grounded in a first\-order logic ontology of 27 unary predicates in three tiers: 11core predicatescharacterizing dynamical regimes \(Chaotic,Deterministic,PosLyap,Sensitive,StrangeAttr,PointUnpredictable,StatPredictable,QuasiPeriodic,Random,FixedPointAttr,Periodic\), 4topological predicates\(Dissipative,Bounded,Mixing,Ergodic\), and 12structural predicates\(HyperChaotic,Conservative,ContinuousTime,DiscreteTime, etc\.\)\.

These predicates are connected by78 directed axiom edges\(31 implication, 47 exclusion\), enabling reasoning chains of up to 5–6 hops\. For example, the Chaotic predicate entails:

∀s:Chaotic\(s\)\\displaystyle\\forall s:\\;\\text\{Chaotic\}\(s\)⇒Deterministic\(s\)∧PosLyap\(s\)∧Sensitive\(s\)∧Mixing\(s\)\\displaystyle\\Rightarrow\\text\{Deterministic\}\(s\)\\land\\text\{PosLyap\}\(s\)\\land\\text\{Sensitive\}\(s\)\\land\\text\{Mixing\}\(s\)∧¬Random\(s\)∧¬Periodic\(s\)∧¬QuasiPeriodic\(s\)\\displaystyle\\quad\\land\\;\\neg\\text\{Random\}\(s\)\\land\\neg\\text\{Periodic\}\(s\)\\land\\neg\\text\{QuasiPeriodic\}\(s\)\(1\)The full specification is in Appendix[F](https://arxiv.org/html/2605.24305#A6)\.

### 3\.2Systems

The benchmark covers 165 dynamical systems:111Regime transition and FOL inference families additionally use synthetic parameterizations not counted in this total\.30 manually curated \(Lorenz\-63\(Lorenz,[1963](https://arxiv.org/html/2605.24305#bib.bib15)\), Rössler, Hénon, logistic map, Brusselator, Ornstein\-Uhlenbeck, etc\.\) and 135 from thedystslibrary\(Gilpin,[2021](https://arxiv.org/html/2605.24305#bib.bib19)\)\. Each system carries ground\-truth values for all 27 predicates, verified against the axiom system\.

### 3\.3Task Families

Table 1:Dataset composition by task family, ordered by difficulty\.N<100N<100families are interpreted qualitatively\.Multi\-hopquestions chain 2–6 steps through the axiom graph \(e\.g\., “FluidTrampoline is strongly mixing⇒\\Rightarrowweakly mixing⇒\\Rightarrowergodic⇒\\Rightarrowbounded\. Is it bounded?”\)\.Regime transitionquestions require specific bifurcation thresholds \(e\.g\., “Atα\\alpha=15\.6, is Chua’s circuit chaotic?”\)\.FOL inferencequestions present premises for deductive conclusions\.Adversarialquestions include misleading premises irrelevant to the answer\.Indicator diagnosticquestions require interpreting numerical chaos indicators \(0\-1 test\(Gottwald and Melbourne,[2009](https://arxiv.org/html/2605.24305#bib.bib17)\), permutation entropy\(Bandt and Pompe,[2002](https://arxiv.org/html/2605.24305#bib.bib18)\), MEGNO\(Cincotta and Simó,[2000](https://arxiv.org/html/2605.24305#bib.bib23)\)\)\. Representative examples with model predictions are in Appendix[E](https://arxiv.org/html/2605.24305#A5)\.

### 3\.4Evaluation Protocol: CARE

Standard accuracy on binary classification benchmarks can be misleading\. SinceAcc=p⋅TPR\+\(1−p\)⋅TNR\\text\{Acc\}=p\\cdot\\text\{TPR\}\+\(1\-p\)\\cdot\\text\{TNR\}whereppis the TRUE prevalence, a model with high TNR but low TPR inflates accuracy by exploiting class priors\. LLaMA 3\.1\-8B illustrates this: its TPR = 0\.32 and TNR = 0\.88 yield 60\.2% accuracy \(vs\. 50\.5% for always\-FALSE\), but MCC = 0\.24 and balanced accuracy = 0\.60 correctly signal near\-chance performance\. To surface such pathologies, we proposeCARE\(Calibration\- andAdversarial\-RobustEvaluation\), a protocol for reasoning benchmarks\.

CARE reports five diagnostics:

1. 1\.MCC\(primary\)\(Matthews,[1975](https://arxiv.org/html/2605.24305#bib.bib13); Chicco and Jurman,[2020](https://arxiv.org/html/2605.24305#bib.bib16)\): penalizes prior collapse and is invariant to class balance\. Ranges from−1\-1\(anti\-correlation\) through0\(random\) to\+1\+1\(perfect\): MCC=TP⋅TN−FP⋅FN\(TP\+FP\)\(TP\+FN\)\(TN\+FP\)\(TN\+FN\)\\text\{MCC\}=\\frac\{TP\\cdot TN\-FP\\cdot FN\}\{\\sqrt\{\(TP\+FP\)\(TP\+FN\)\(TN\+FP\)\(TN\+FN\)\}\}\(2\)
2. 2\.Macro\-family MCC: mean MCC across task families, preventing dominant families \(atomic: 62% of questions\) from masking failures on hard families\.
3. 3\.Calibration: predicted TRUE rate vs\. ground\-truth TRUE rate\. Flags prior collapse \(e\.g\., LLaMA 3\.1\-8B predicts TRUE only 21\.6% vs\. 49\.5% ground truth\)\.
4. 4\.Consistency: MCC onconsistency\_paraphraseandperturbationfamilies, measuring whether predictions survive surface\-level variation\.
5. 5\.Coverage: invalid rate \(unparseable responses counted as incorrect\) and per\-family coverage, flagging instruction\-following failures\.

Table[2](https://arxiv.org/html/2605.24305#S3.T2)demonstrates CARE on four models\. Accuracy ranks LLaMA 3\.1\-8B at 60\.2% \(above chance\), but CARE flags prior collapse \(21\.6% predicted TRUE vs\. 49\.5% ground truth\) and asymmetric recall \(TPR = 0\.32, TNR = 0\.88\): the model achieves accuracy by defaulting to FALSE, not by reasoning\. Mistral\-7B’s 61\.3% accuracy masks a 1\.1% invalid rate and consistency MCC below 0\.30\. Only Claude Sonnet 4\.6 triggers no CARE flags\.

Table 2:CARE diagnostics for four models\. Flags:pc= prior collapse \(\|pred TRUE%−49\.5\|\>10\|\\text\{pred TRUE\\%\}\-49\.5\|\>10\);ar= asymmetric recall \(\|TPR−TNR\|\>0\.3\|\\text\{TPR\}\-\\text\{TNR\}\|\>0\.3\);cv= coverage \(\>\>0\.5% invalid\);ic= inconsistent \(consistency MCC<<0\.30\)\.All standard models use temperature = 0; reasoning models \(o3\-mini, GPT\-5\.2\) do not accept a temperature parameter and are deterministic by design \(Appendix[I](https://arxiv.org/html/2605.24305#A9)\)\. Most models receive max\_tokens = 16; reasoning models and Gemini use 1024 because their architectures consume output tokens for internal reasoning\.

## 4Experiments

We evaluate 14 models: 7 proprietary \(Claude Sonnet 4\.6, GPT\-5\.2, GPT\-4o, GPT\-4o\-mini, o3\-mini, Gemini 2\.5 Flash, DeepSeek\-Chat\) and 7 open\-source \(Qwen 2\.5\-\{7B,14B,32B\}, LLaMA 3\.3\-70B, LLaMA 3\.1\-8B, Gemma2\-9B, Mistral\-7B\) served via Ollama\. Ten models complete the full dataset \(N = 40,886\); four use subsets \(5k or 1k\) due to compute constraints, reported in a separate table to avoid cross\-NNranking\.

## 5Results

### 5\.1Overall Performance

Table 3:Full canonical \(N = 40,886\), ranked by MCC\.Claude Sonnet 4\.6 leads with MCC = 0\.601 \(Table[3](https://arxiv.org/html/2605.24305#S5.T3)\)\. The proprietary–OSS gap is 0\.12 MCC, but Qwen 2\.5\-32B \(0\.478\) outperforms GPT\-4o \(0\.450\) and Gemini 2\.5 Flash \(0\.458\)\. Subset evaluations \(o3\-mini MCC = 0\.608 on 5k; Gemma2\-9B 0\.280 on 5k; GPT\-4o\-mini 0\.272 on 1k; Qwen 2\.5\-7B 0\.268 on 1k\) are reported in Appendix[A](https://arxiv.org/html/2605.24305#A1)\.

### 5\.2Task Family Hardness

![Refer to caption](https://arxiv.org/html/2605.24305v1/x1.png)Figure 1:Mean MCC by task family across 10 full\-canonical models\. Error bars: min–max\.Figure[1](https://arxiv.org/html/2605.24305#S5.F1)ranks families by mean MCC\.Easy\(MCC\>\>0\.5\):extended\_systems\(0\.81, ceiling effect\),indicator\_diagnostic\(0\.59\),fol\_inference\(0\.52\)\.Medium\(0\.25–0\.5\):multi\_hop\(0\.48\),adversarialfamilies \(0\.44–0\.45\),atomic\(0\.32\)\.Hard\(<<0\.25\):perturbation\(0\.26\),consistency\_paraphrase\(0\.25\),cross\_indicator\(0\.18\),regime\_transition\(0\.05\)\. Regime transition is near\-random for all models: these questions require specific bifurcation thresholds \(e\.g\., logistic map atr≈3\.57r\\approx 3\.57\) not recoverable from logical rules\.

### 5\.3Per\-Model Family Analysis

![Refer to caption](https://arxiv.org/html/2605.24305v1/x2.png)Figure 2:Per\-family MCC for 10 models\. Families ordered by hardness \(left = hardest\)\. Red cells indicate negative MCC \(anti\-correlation\)\.Figure[2](https://arxiv.org/html/2605.24305#S5.F2)reveals that family\-level performance is not monotonic with overall MCC\. Qwen 2\.5\-32B achieves MCC = 0\.91 onindicator\_diagnostic\(exceeding GPT\-4o at 0\.89 and Claude Sonnet at 0\.45\), while Claude Sonnet leads onmulti\_hop\(0\.64\) andatomic\(0\.62\)\. LLaMA 3\.1\-8B scores perfectly onextended\_systems\(45 factual\-recall questions, ceiling effect\) but near\-zero oncross\_indicator\.

Two models producenegative MCCon regime\_transition: LLaMA 3\.3\-70B \(−0\.17\-0\.17; TP = 9, FP = 17, TN = 20, FN = 22; balanced accuracy 0\.42\) and Mistral\-7B \(−0\.10\-0\.10\)\. Negative MCC means systematic anti\-correlation: these models have learned heuristics that are reliably wrong on bifurcation questions\. Confusion matrices are in Appendix[G](https://arxiv.org/html/2605.24305#A7)\.

### 5\.4Where the Proprietary Advantage Concentrates

![Refer to caption](https://arxiv.org/html/2605.24305v1/x3.png)Figure 3:Per\-familyΔ\\DeltaMCC \(Claude Sonnet 4\.6−\-Qwen 2\.5\-32B\)\. Green: Sonnet leads\. Red: Qwen leads\.The 0\.12 overall gap is not uniform \(Figure[3](https://arxiv.org/html/2605.24305#S5.F3)\)\. Sonnet’s advantages concentrate oncross\_indicator\(Δ\\Delta= \+0\.40\),consistency\_paraphrase\(\+0\.22\), andregime\_transition\(\+0\.19\): families requiring integration of quantitative signals or robustness to surface variation\. Near\-parity onperturbation\(0\.00\),multi\_hop\(\+0\.01\),fol\_inference\(−\-0\.01\)\. Qwen 2\.5\-32B leads decisively onindicator\_diagnostic\(−\-0\.46\)\.

### 5\.5Prediction Bias

![Refer to caption](https://arxiv.org/html/2605.24305v1/x4.png)Figure 4:Predicted vs\. ground\-truth TRUE rate \(49\.5%\)\. LLaMA 3\.1\-8B predicts TRUE only 21\.6%\.The ground\-truth TRUE rate is 49\.5%\. Most models predict TRUE 42–50%, but LLaMA 3\.1\-8B predicts TRUE only 21\.6% \(TNR = 0\.88, TPR = 0\.32\), explaining its low MCC despite moderate balanced accuracy\. Mistral\-7B shows the opposite: 56\.2% predicted TRUE \(Figure[4](https://arxiv.org/html/2605.24305#S5.F4)\)\.

## 6Discussion

#### A knowledge\-type boundary\.

The central finding is a dissociation between two types of reasoning\. FOL inference \(MCC = 0\.52\) tests whether models can apply deductive rules when premises are explicitly stated; regime transition \(MCC = 0\.05\) tests whether they can supply numerical premises themselves \(e\.g\., the logistic map transitions to chaos atr≈3\.57r\\approx 3\.57\)\. This is not a scaling problem: within the Qwen 2\.5 family, increasing from 7B to 32B parameters improves multi\-hop and FOL inference but leaves regime transition near\-random \(Appendix[B](https://arxiv.org/html/2605.24305#A2)\)\. The gap identifies a precise boundary between what LLMs can learn from text \(logical rule\-following\) and what requires numerical grounding \(parameter\-dependent dynamics\)\.

This complements findings from PrOntoQA\(Saparov and He,[2023](https://arxiv.org/html/2605.24305#bib.bib20)\), which showed that LLMs struggle with longer synthetic reasoning chains\. Our results show that even short chains succeed when premises are given \(FOL inference\), but the difficulty shifts from chain length to premise availability: models cannot generate the quantitative facts needed for bifurcation reasoning\.

#### Consistency failures expose fragile retrieval\.

Consistency\_paraphrase \(MCC = 0\.25\) and perturbation \(MCC = 0\.26\) are not knowledge gaps\. Models answer “Is Lorenz\-63 chaotic?” correctly at the atomic level \(MCC = 0\.32–0\.62\) but flip when the same fact is rephrased\. The knowledge exists; the retrieval is sensitive to surface form\.

#### Proprietary vs\. open\-source: not a monolithic gap\.

The per\-family decomposition \(Figure[3](https://arxiv.org/html/2605.24305#S5.F3)\) shows the gap concentrates on cross\-indicator reasoning \(\+0\.40\) and consistency \(\+0\.22\), while formal deduction and perturbation robustness show near\-parity\. Qwen 2\.5\-32B’s MCC = 0\.91 on indicator\_diagnostic \(interpreting Lyapunov exponents, permutation entropy, MEGNO\) exceeds every proprietary model, suggesting that training data composition matters more than the proprietary/open\-source divide for quantitative threshold reasoning\.

#### MaxSAT axiom repair\.

We apply a MaxSAT post\-processor \(RC2\(Yeet al\.,[2023](https://arxiv.org/html/2605.24305#bib.bib22)\)\) that repairs per\-system predicate assignments to satisfy all 78 axiom edges with minimal prediction flips\. On atomic questions, repair eliminates all FOL violations and improves MCC substantially for weak models: LLaMA 3\.1\-8B gains \+0\.11 MCC \(0\.22→\\rightarrow0\.33\) with 11\.2% of predictions flipped; Mistral\-7B gains \+0\.11 \(0\.17→\\rightarrow0\.28\) with 19\.0% flipped\. Claude Sonnet 4\.6 is barely affected \(−\-0\.006 MCC\)\. When we propagate the repaired predicates to compositional families \(multi\_hop, fol\_inference\), the picture reverses: repair*degrades*MCC on these families \(e\.g\.,−\-0\.20 on multi\_hop for LLaMA\), because compositional questions encode reasoning that goes beyond per\-predicate consistency\. This reveals a separation between two error types:*axiom\-inconsistent*errors \(fixable by constraint enforcement\) and*reasoning*errors \(requiring deeper inference\)\. Solver augmentation helps the first type but not the second, identifying a precise boundary for hybrid LLM\-solver approaches\.

#### Chain\-of\-thought preliminary\.

We tested CoT prompting on the two hardest families \(regime\_transition, cross\_indicator; 135 questions\) using LLaMA 3\.1\-8B\. CoT dramatically increased the invalid rate: 66/68 regime\_transition responses and 57/67 cross\_indicator responses were unparseable \(the model generates reasoning text but fails to produce a final TRUE/FALSE\)\. Among the few parseable CoT responses, regime\_transition remained at MCC = 0\.0\. This suggests that for small models, CoT on hard scientific reasoning families introduces a format\-compliance problem without solving the underlying reasoning gap\.

### 6\.1Limitations

Three families haveN<100N<100\(regime\_transition, cross\_indicator, extended\_systems\); results on these carry high variance\. The CoT experiment uses only one small model \(8B\); larger models with CoT may behave differently\. The TRUE/FALSE format cannot distinguish correct reasoning from correct guessing\. Well\-known systems \(Lorenz\-63\) may be memorized from pretraining rather than reasoned about; the extended\_systems ceiling effect is consistent with this\.

## 7Conclusion

ChaosBench\-Logic v2 and the CARE evaluation protocol together reveal that apparent LLM reasoning performance hides three pathologies: prior collapse \(LLaMA 3\.1\-8B achieves 60% accuracy with only 32% TPR\), surface\-form fragility \(consistency MCC = 0\.25\), and inability to reason about parameter\-dependent dynamics \(regime transition MCC = 0\.05\)\. The knowledge\-type boundary between rule\-following and parametric reasoning does not close with scale\. Our MaxSAT repair experiment reveals two distinct error types: axiom\-inconsistent errors \(fixable by constraint enforcement, \+0\.11 MCC on atomic questions for weak models\) and reasoning errors on compositional families \(degraded by repair\), identifying a precise boundary for solver\-augmented approaches\. Three directions follow\. First, chain\-of\-thought evaluation: preliminary results from v1 of this benchmark\(Thomas,[2026](https://arxiv.org/html/2605.24305#bib.bib1)\)found that CoT*decreased*overall accuracy by 2–6 percentage points for both GPT\-4 and LLaMA\-3, suggesting that explicit reasoning introduces errors on scientific questions where zero\-shot retrieval is more reliable\. Whether this pattern holds on v2’s harder families \(regime\_transition, cross\_indicator\) remains an open question\. Second, deeper solver integration: our MaxSAT repair fixes axiom\-inconsistent errors but not reasoning errors; coupling LLMs with numerical integrators could address the compositional gap\. Third, fine\-tuning on scientific corpora could test whether consistency failures reflect missing knowledge or architectural limitations\. The benchmark, CARE protocol, and released artifacts are publicly available at[https://github\.com/11NOel11/ChaosBench\-Logic](https://github.com/11NOel11/ChaosBench-Logic)and[https://huggingface\.co/datasets/11NOel11/ChaosBench\-Logic](https://huggingface.co/datasets/11NOel11/ChaosBench-Logic)\.

## Reproducibility Statement

All evaluations use deterministic inference \(temperature = 0 where supported; reasoning models are deterministic by design\)\. Metrics are computed from raw confusion matrices in prediction logs; every number was verified against these logs\. The dataset is generated deterministically from the FOL axiom system\. Code, canonical dataset files, and released artifacts are available at[https://github\.com/11NOel11/ChaosBench\-Logic](https://github.com/11NOel11/ChaosBench-Logic)and[https://huggingface\.co/datasets/11NOel11/ChaosBench\-Logic](https://huggingface.co/datasets/11NOel11/ChaosBench-Logic)\.

## References

- Permutation entropy: a natural complexity measure for time series\.Physical Review Letters88\(17\),pp\. 174102\.Cited by:[§3\.3](https://arxiv.org/html/2605.24305#S3.SS3.p1.4)\.
- D\. Chicco and G\. Jurman \(2020\)The advantages of the Matthews correlation coefficient \(MCC\) over F1 score and accuracy in binary classification evaluation\.BMC Genomics21\(1\),pp\. 6\.Cited by:[item 1](https://arxiv.org/html/2605.24305#S3.I1.i1.p1.3)\.
- P\. M\. Cincotta and C\. Simó \(2000\)Simple tools to study global dynamics in non\-axisymmetric galactic potentials – I\.Astronomy and Astrophysics Supplement Series147,pp\. 205–228\.Cited by:[§3\.3](https://arxiv.org/html/2605.24305#S3.SS3.p1.4)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p2.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p1.1),[§1](https://arxiv.org/html/2605.24305#S1.p2.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Gilpin \(2021\)Chaos as an interpretable benchmark for forecasting and data\-driven modelling\.InAdvances in Neural Information Processing Systems \(Datasets and Benchmarks Track\),Cited by:[item 1](https://arxiv.org/html/2605.24305#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.24305#S3.SS2.p1.1)\.
- G\. A\. Gottwald and I\. Melbourne \(2009\)On the implementation of the 0\-1 test for chaos\.SIAM Journal on Applied Dynamical Systems8\(1\),pp\. 129–145\.Cited by:[§3\.3](https://arxiv.org/html/2605.24305#S3.SS3.p1.4)\.
- S\. Han, H\. Schoelkopf, Y\. Zhao, Z\. Qi, M\. Schmitt, H\. Schütze, V\. Tresp, and N\. Peng \(2022\)FOLIO: natural language reasoning with first\-order logic\.arXiv preprint arXiv:2209\.00840\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p2.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.Advances in Neural Information Processing Systems34,pp\. 37914–37927\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p2.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics,pp\. 3214–3252\.Cited by:[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Liu, L\. Cui, H\. Liu, D\. Huang, Y\. Wang, and Y\. Zhang \(2020\)LogiQA: a challenge dataset for machine reading comprehension with logical reasoning\.InProceedings of the Twenty\-Ninth International Joint Conference on Artificial Intelligence,pp\. 3622–3628\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p2.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px1.p1.1)\.
- E\. N\. Lorenz \(1963\)Deterministic nonperiodic flow\.Journal of the Atmospheric Sciences20\(2\),pp\. 130–141\.Cited by:[§3\.2](https://arxiv.org/html/2605.24305#S3.SS2.p1.1)\.
- B\. W\. Matthews \(1975\)Comparison of the predicted and observed secondary structure of T4 phage lysozyme\.Biochimica et Biophysica Acta \(BBA\) – Protein Structure405\(2\),pp\. 442–451\.Cited by:[item 1](https://arxiv.org/html/2605.24305#S3.I1.i1.p1.3)\.
- M\. Parmar, N\. Patel, N\. Varshney, M\. Nakamura, M\. Luo, S\. Mashetty, A\. Mitra, and C\. Baral \(2024\)LogicBench: towards systematic evaluation of logical reasoning ability of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Saparov and H\. He \(2023\)Language models are greedy reasoners: a systematic formal analysis of chain\-of\-thought\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p2.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.24305#S6.SS0.SSS0.Px1.p2.1)\.
- A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso,et al\.\(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px1.p1.1)\.
- S\. H\. Strogatz \(2018\)Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering\.2nd edition,CRC Press\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p1.1)\.
- N\. Thomas \(2026\)ChaosBench\-logic: a benchmark for logical and symbolic reasoning on chaotic dynamical systems\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p3.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2605.24305#S7.p1.1)\.
- X\. Wang, Z\. Hu, P\. Lu, Y\. Zhu, J\. Zhang, S\. Subramaniam, A\. R\. Loomba, S\. Zhang, Y\. Sun, and W\. Wang \(2023a\)SciBench: evaluating college\-level scientific problem\-solving abilities of large language models\.arXiv preprint arXiv:2307\.10635\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p2.1),[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023b\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.24305#S1.p1.1)\.
- X\. Ye, Q\. Chen, I\. Dillig, and G\. Durrett \(2023\)SatLM: satisfiability\-aided language models using declarative prompting\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§6](https://arxiv.org/html/2605.24305#S6.SS0.SSS0.Px4.p1.4)\.
- W\. Yu, Z\. Jiang, Y\. Dong, and J\. Feng \(2020\)ReClor: a reading comprehension dataset requiring logical reasoning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.24305#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AFull Leaderboard and Subset Evaluations

![Refer to caption](https://arxiv.org/html/2605.24305v1/x5.png)Figure 5:MCC by model\. Blue: proprietary\. Orange: open\-source\.Table 4:Subset evaluations\. 5k tracks full\-canonical MCC within 0\.01 \(validated on 3 models with both\); 1k has higher variance\.
## Appendix BModel Size Scaling

![Refer to caption](https://arxiv.org/html/2605.24305v1/x6.png)Figure 6:Scaling within Qwen 2\.5\. MCC increases from 0\.27 \(7B, 1k subset\) to 0\.48 \(32B\), but regime\_transition remains near\-random at all scales\.
## Appendix CSubset Validation

![Refer to caption](https://arxiv.org/html/2605.24305v1/x7.png)Figure 7:5k vs\. full canonical per\-family MCC\. Circles:N≥30N\\geq 30\. Crosses:N<30N<30\(high variance\)\.For models with both evaluations \(Qwen 2\.5\-32B, Qwen 2\.5\-14B, Mistral\-7B\), mean\|ΔMCC\|<0\.01\|\\Delta\\text\{MCC\}\|<0\.01overall; per\-family mean\|ΔMCC\|=0\.050\|\\Delta\\text\{MCC\}\|=0\.050for families withN≥30N\\geq 30in the 5k subset\.

## Appendix DFamily Discrimination

![Refer to caption](https://arxiv.org/html/2605.24305v1/x8.png)Figure 8:Mean MCC vs\. MCC range across 10 models\. Upper\-left: hard and discriminating\. Lower\-left: floor effects\.
## Appendix EQuestion Examples

Table 5:Example questions with ground truth \(GT\) and predictions\. Son\. = Claude Sonnet 4\.6; Qw\. = Qwen 2\.5\-32B; Ll\. = LLaMA 3\.3\-70B\. Incorrect predictions inbold\.
## Appendix FAxiom Specification

The six primary regime axioms:

1. 1\.Chaotic⇒\\RightarrowDeterministic∧\\landPosLyap∧\\landSensitive∧\\landPointUnpredictable∧\\landStatPredictable∧\\landMixing; excludes Random, Periodic, QuasiPeriodic, FixedPointAttr\.
2. 2\.Randomexcludes Deterministic, Chaotic, QuasiPeriodic, Periodic\.
3. 3\.QuasiPeriodic⇒\\RightarrowDeterministic∧\\landBounded; excludes Chaotic, Random, Periodic, FixedPointAttr\.
4. 4\.Periodic⇒\\RightarrowDeterministic∧\\landBounded; excludes Chaotic, Random, QuasiPeriodic, StrangeAttr\.
5. 5\.FixedPointAttr⇒\\RightarrowDeterministic; excludes Chaotic, Random, QuasiPeriodic, Periodic, StrangeAttr\.
6. 6\.Deterministicexcludes Random\.

Additional edges: PosLyap⇒\\RightarrowSensitive⇒\\RightarrowPointUnpredictable; StrangeAttr⇒\\RightarrowDissipative∧\\landBounded; Mixing⇒\\RightarrowErgodic⇒\\RightarrowBounded; HyperChaotic⇒\\RightarrowChaotic∧\\landStrangeAttr∧\\landDissipative; Conservative⇒\\RightarrowBounded∧\\landErgodic; StrongMixing⇒\\RightarrowWeakMixing⇒\\RightarrowErgodic; ContinuousTime↔¬\\leftrightarrow\\negDiscreteTime; Forced↔¬\\leftrightarrow\\negAutonomous; DelaySystem⇒\\RightarrowContinuousTime\.

## Appendix GConfusion Matrices

Table 6:Regime\_transition \(N = 68\) confusion matrices\.
## Appendix HInvalid Rates

Most models produce zero invalids\. Mistral\-7B has the highest rate at 1\.1%; LLaMA 3\.1\-8B<<0\.01%; all others 0\.0%\.

## Appendix IPrompt Template

> Answer the following question about the dynamical system\. Reply with only TRUE or FALSE\. Question: \[question text\] Answer:

No system prompt, few\-shot examples, or CoT instructions\. Temperature = 0 for all models that accept the parameter\. Reasoning models \(o3\-mini, GPT\-5\.2\) do not accept a temperature argument; their outputs are internally deterministic via the reasoning process\. Max tokens: 16 for most models; 1024 for o3\-mini and GPT\-5\.2 \(reasoning token budget\), and Gemini 2\.5 Flash \(thinking process\)\.
ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

Similar Articles

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

Submit Feedback

Similar Articles

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs