On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning
Summary
This paper identifies a critical 'model collapse' issue in standard fine-tuning for causal reasoning and proposes a semantic loss function with graph-based logical constraints to prevent it.
View Cached Full Text
Cached at: 05/08/26, 07:22 AM
# On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning
Source: [https://arxiv.org/html/2605.05438](https://arxiv.org/html/2605.05438)
###### Abstract
Standard fine\-tuning of transformer models on causal reasoning tasks leads to catastrophic model collapse, where models learn trivial solutions such as always predicting ”Yes” or ”No” regardless of input structure\. We demonstrate that fine\-tuning Gemma 270M on transitivity and d\-separation tasks without semantic loss results in 100% collapse rate, with models achieving misleadingly high accuracy \(73\.9%\) while learning no causal reasoning\. We propose a semantic loss function with graph\-based logical constraints and dynamic lambda scheduling that prevents this collapse\. Our approach achieves 70\.4% accuracy on transitivity tasks and 68\.6% on d\-separation tasks with stable, context\-dependent predictions, representing a 42\.7% improvement over collapsed baselines\. Adversarial evaluation on 1,000 structural reasoning samples shows semantic models achieve 67\-70% accuracy while collapsed models fail catastrophically at 43\-71%\. We validate our findings through comprehensive benchmarking on 200,000\+ evaluation samples across five model variants, demonstrating that semantic loss is essential and not optional, for stable causal reasoning in transformers\.
## 1Introduction
Causal reasoning—the ability to understand and reason about cause\-and\-effect relationships—is fundamental to human cognition and increasingly critical for developing robust AI systems\[[2](https://arxiv.org/html/2605.05438#bib.bib2)\]\. Recent advances have shown that transformers can learn causal reasoning through axiomatic training on synthetic demonstrations of causal axioms\[[1](https://arxiv.org/html/2605.05438#bib.bib1)\]\. However, through systematic experimentation, we identify a critical and previously undocumented failure mode:standard fine\-tuning on causal reasoning tasks causes catastrophic model collapse with 100% occurrence rate\.
### 1\.1The Collapse Problem
We define model collapse as a degenerate learning outcome where a model’s prediction distributionP\(y\|x\)P\(y\|x\)becomes independent of input structurexx, converging to fixed outputs \(always ”Yes” or always ”No”\) regardless of causal graph topology\. Through comprehensive experiments with Gemma 270M models\[[4](https://arxiv.org/html/2605.05438#bib.bib4)\], we demonstrate:
- •Transitivity collapse: Models output ”Yes” for all inputs \(10,000/10,000 predictions\), achieving 27\.7% accuracy
- •D\-separation collapse: Models output ”No” for nearly all inputs, achieving misleadingly high accuracy \(73\.9%\) but critically low F1 score \(7\.6%\)
This collapse occurs in 100% of fine\-tuning attempts without semantic loss, rendering standard approaches fundamentally unreliable for causal reasoning tasks\.
### 1\.2Our Contributions
1. 1\.Problem identification: First systematic documentation of catastrophic model collapse in causal reasoning fine\-tuning, with 100% occurrence rate across both transitivity and d\-separation tasks
2. 2\.Theoretical framework: Formal definition of prediction bias collapse and analysis of why cross\-entropy loss alone fails for causal reasoning
3. 3\.Solution methodology: Semantic loss function incorporating graph\-based logical constraints with dynamic lambda scheduling \(λ:0\.05→0\.30\\lambda:0\.05\\rightarrow 0\.30\)
4. 4\.Comprehensive evaluation: Benchmarking across 200,000\+ samples demonstrating 42\.7% improvement over collapsed baselines and validation across two distinct causal reasoning tasks
5. 5\.Adversarial validation: Novel test suite proving semantic models learn structural reasoning \(67\-70% accuracy\) while collapsed models fail catastrophically \(43\-71%\)
## 2Related Work
### 2\.1Causal Reasoning in Neural Networks
Causal reasoning has been extensively studied in the context of causal discovery\[[2](https://arxiv.org/html/2605.05438#bib.bib2)\], effect estimation, and counterfactual inference\. Recent work has explored teaching causal concepts to neural networks through various approaches: symbolic demonstrations\[[1](https://arxiv.org/html/2605.05438#bib.bib1)\], causal graph generation, and intervention\-based learning\.
Vashishtha et al\.\[[1](https://arxiv.org/html/2605.05438#bib.bib1)\]demonstrated that 67M parameter transformers trained from scratch on axiomatic demonstrations can generalize to complex causal structures\. Their work showed strong performance on transitivity and d\-separation tasks when training from scratch with sufficient architectural capacity\. Our work extends this by identifying a critical failure mode when fine\-tuning pretrained models and developing solutions to prevent collapse\.
### 2\.2Semantic Loss and Neuro\-Symbolic Integration
Semantic loss functions incorporate symbolic knowledge into neural network training through differentiable constraint satisfaction\[[3](https://arxiv.org/html/2605.05438#bib.bib3)\]\. The core approach uses weighted model counting to compute gradients with respect to logical formula satisfaction\. Applications include semi\-supervised learning, structured prediction, and knowledge base completion\.
Our work adapts semantic loss specifically for causal graph constraints, developing a dynamic scheduling mechanism to balance stability and structural learning during fine\-tuning\.
### 2\.3Model Collapse Phenomena
Mode collapse has been extensively studied in generative adversarial networks \(GANs\)\[[5](https://arxiv.org/html/2605.05438#bib.bib5)\], where generators learn to produce limited diversity\. Representation collapse occurs in contrastive learning\[[6](https://arxiv.org/html/2605.05438#bib.bib6)\]when embeddings converge to constant vectors\. Recent work has identified collapse in large language models during instruction tuning and reinforcement learning from human feedback \(RLHF\)\[[7](https://arxiv.org/html/2605.05438#bib.bib7)\]\.
Our identified collapse differs fundamentally: it occurs during supervised fine\-tuning on well\-defined reasoning tasks with clear ground truth, and manifests as extreme prediction bias rather than representational degeneration\. To our knowledge, this is the first systematic documentation of collapse in causal reasoning fine\-tuning\.
### 2\.4Evaluation of Causal Reasoning
Recent benchmarks evaluate causal reasoning capabilities in language models, including CLADDER\[[8](https://arxiv.org/html/2605.05438#bib.bib8)\]for causal ladder questions and Corr2Cause\[[9](https://arxiv.org/html/2605.05438#bib.bib9)\]for inferring causation from correlation\. These benchmarks primarily assess pretrained or prompted models rather than fine\-tuned systems\.
Our adversarial evaluation methodology specifically targets the distinction between structural understanding and superficial heuristics, providing a diagnostic tool for identifying collapse\.
## 3Problem Formulation
### 3\.1Causal Reasoning Tasks
We focus on two fundamental causal reasoning tasks based on Pearl’s causal framework\[[2](https://arxiv.org/html/2605.05438#bib.bib2)\]:
##### Transitivity
Given a directed acyclic graph \(DAG\)G=\(V,E\)G=\(V,E\)representing causal relationships, determine if there exists a directed path from nodeAAto nodeBB\. Formally, the transitivity axiom states:
∀A,B,C∈V:\(A→C\)∧\(C→B\)⟹\(A→B\)\\forall A,B,C\\in V:\(A\\rightarrow C\)\\wedge\(C\\rightarrow B\)\\implies\(A\\rightarrow B\)\(1\)
##### D\-Separation
Determine if nodesXXandYYare conditionally independent given conditioning setZZin causal DAGGG, following Pearl’s d\-separation criterion\. NodesXXandYYare d\-separated byZZif all paths betweenXXandYYare blocked byZZ\.
### 3\.2Formal Problem Setup
Let𝒟=\{\(pi,hi,yi\)\}i=1N\\mathcal\{D\}=\\\{\(p\_\{i\},h\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote a training dataset where:
- •pip\_\{i\}: Textual premise describing causal graph structure
- •hih\_\{i\}: Binary hypothesis query about causal relationship
- •yi∈\{Yes,No\}y\_\{i\}\\in\\\{\\text\{Yes\},\\text\{No\}\\\}: Ground truth label
A modelfθ:\(p,h\)→ℝ2f\_\{\\theta\}:\(p,h\)\\rightarrow\\mathbb\{R\}^\{2\}maps premise\-hypothesis pairs to logits, from which we compute prediction probabilities via softmax:Pθ\(y\|p,h\)=softmax\(fθ\(p,h\)\)P\_\{\\theta\}\(y\|p,h\)=\\text\{softmax\}\(f\_\{\\theta\}\(p,h\)\)\.
### 3\.3Model Collapse: Formal Definition
###### Definition 1\(Prediction Bias Collapse\)\.
A modelfθf\_\{\\theta\}exhibits prediction bias collapse on task𝒯\\mathcal\{T\}if there exists a fixed predictiony¯\\bar\{y\}such that for evaluation dataset𝒟eval\\mathcal\{D\}\_\{\\text\{eval\}\}:
1\|𝒟eval\|∑\(p,h,y\)∈𝒟eval𝟙\[argmaxPθ\(y\|p,h\)=y¯\]\>0\.95\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\text\{eval\}\}\|\}\\sum\_\{\(p,h,y\)\\in\\mathcal\{D\}\_\{\\text\{eval\}\}\}\\mathbb\{1\}\[\\arg\\max P\_\{\\theta\}\(y\|p,h\)=\\bar\{y\}\]\>0\.95\(2\)
Collapse indicators:
- •Extreme prediction bias:\>95%\>95\\%predictions are identical class
- •Distribution independence: Predictions invariant to graph structure changes
- •Metric divergence: High accuracy on biased datasets, near\-zero F1 score
## 4Methodology
### 4\.1Semantic Loss for Causal Graphs
We augment standard cross\-entropy loss with a semantic component that enforces logical consistency with causal graph structure:
ℒtotal=ℒCE\(y,y^\)\+λ\(t\)⋅ℒsemantic\(p,h,y^\)\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{CE\}\}\(y,\\hat\{y\}\)\+\\lambda\(t\)\\cdot\\mathcal\{L\}\_\{\\text\{semantic\}\}\(p,h,\\hat\{y\}\)\(3\)
whereℒCE\\mathcal\{L\}\_\{\\text\{CE\}\}is cross\-entropy,y^=Pθ\(y\|p,h\)\\hat\{y\}=P\_\{\\theta\}\(y\|p,h\)are predicted probabilities, andλ\(t\)\\lambda\(t\)is a time\-dependent weighting factor\.
#### 4\.1\.1Graph\-Based Consistency
For transitivity tasks, we parse premiseppto extract causal graphG=\(V,E\)G=\(V,E\)and compute logical consistency:
c\(p,h,y^\)=\{Pθ\(y=Yes\|p,h\)if path exists inGPθ\(y=No\|p,h\)otherwisec\(p,h,\\hat\{y\}\)=\\begin\{cases\}P\_\{\\theta\}\(y=\\text\{Yes\}\|p,h\)&\\text\{if path exists in \}G\\\\ P\_\{\\theta\}\(y=\\text\{No\}\|p,h\)&\\text\{otherwise\}\\end\{cases\}\(4\)
The semantic loss penalizes inconsistency with graph structure:
ℒsemantic=−1N∑i=1Nlog\(c\(pi,hi,y^i\)\+ϵ\)\\mathcal\{L\}\_\{\\text\{semantic\}\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\log\(c\(p\_\{i\},h\_\{i\},\\hat\{y\}\_\{i\}\)\+\\epsilon\)\(5\)
whereϵ=10−8\\epsilon=10^\{\-8\}prevents numerical instability\.
For d\-separation, consistency is computed based on path blocking:c\(p,h,y^\)=P\(y=Yes\)c\(p,h,\\hat\{y\}\)=P\(y=\\text\{Yes\}\)if nodes are not d\-separated,P\(y=No\)P\(y=\\text\{No\}\)otherwise\.
#### 4\.1\.2Dynamic Lambda Scheduling
Critical to preventing collapse while maintaining training stability, we employ dynamic lambda scheduling:
λ\(t\)=λstart\+tT\(λend−λstart\)\\lambda\(t\)=\\lambda\_\{\\text\{start\}\}\+\\frac\{t\}\{T\}\(\\lambda\_\{\\text\{end\}\}\-\\lambda\_\{\\text\{start\}\}\)\(6\)
wherettis the current training step,TTis total steps,λstart=0\.05\\lambda\_\{\\text\{start\}\}=0\.05, andλend=0\.30\\lambda\_\{\\text\{end\}\}=0\.30\.
Design rationale:
- •Low initialλ\\lambda: Prevents conflict with cross\-entropy signal during early training
- •Gradual increase: Allows model to learn basic patterns before enforcing strict structural constraints
- •Final strength: Sufficient to prevent degenerate solutions while maintaining gradient flow
Algorithm 1Training with Semantic Loss1:Input:Dataset
𝒟\\mathcal\{D\}, model
fθf\_\{\\theta\}, epochs
EE, batch size
BB
2:Parameters:
λstart=0\.05\\lambda\_\{\\text\{start\}\}=0\.05,
λend=0\.30\\lambda\_\{\\text\{end\}\}=0\.30
3:
T←T\\leftarrowtotal training steps
4:forepoch
e=1e=1to
EEdo
5:foreach batch
\(p,h,y\)\(p,h,y\)in
𝒟\\mathcal\{D\}do
6:
t←t\\leftarrowcurrent step
7:
λ←λstart\+tT\(λend−λstart\)\\lambda\\leftarrow\\lambda\_\{\\text\{start\}\}\+\\frac\{t\}\{T\}\(\\lambda\_\{\\text\{end\}\}\-\\lambda\_\{\\text\{start\}\}\)
8:
y^←fθ\(p,h\)\\hat\{y\}\\leftarrow f\_\{\\theta\}\(p,h\)
9:
ℒCE←−∑ylogy^\\mathcal\{L\}\_\{\\text\{CE\}\}\\leftarrow\-\\sum y\\log\\hat\{y\}
10:
ℒsem←\\mathcal\{L\}\_\{\\text\{sem\}\}\\leftarrowComputeSemanticLoss
\(p,h,y^\)\(p,h,\\hat\{y\}\)
11:
ℒ←ℒCE\+λ⋅ℒsem\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\text\{CE\}\}\+\\lambda\\cdot\\mathcal\{L\}\_\{\\text\{sem\}\}
12:Update
θ\\thetavia gradient descent on
ℒ\\mathcal\{L\}
13:endfor
14:endfor
### 4\.2Training Configuration
Table 1:Training hyperparameters
### 4\.3Evaluation Methodology
We evaluate models across six test distributions, each containing 10,000 samples except adversarial \(1,000 samples\):
##### Standard Generalization Tests
- •Length: Causal chains of 7\-15 nodes \(training: 3\-6 nodes\)
- •Branching: DAGs with branching factor 1\.4\-2\.0
- •Reversed: All directed edges reversed
- •Shuffled: Premise statements in random order
- •Long Names: Variable names of 8\-10 characters \(training: 1\-3 chars\)
##### Adversarial Structural Tests
Novel evaluation set \(1,000 samples\) designed to distinguish structural understanding from heuristics:
- •Irrelevant nodes\(30%\): Additional nodes with no path to query variables
- •Broken chains\(30%\): Transitivity chains with single missing edge
- •Longer chains\(40%\): Extended transitivity requiring multiple axiom applications
##### Evaluation Metrics
Beyond standard accuracy, we compute:
- •F1 score, precision, and recall
- •Prediction distribution analysis \(Yes/No counts\)
- •Confusion matrices
- •Per\-task performance breakdown
## 5Experimental Results
### 5\.1Experimental Setup
All experiments use Gemma 3 270M Instruct\-tuned model as the base\. We train five model variants:
1. 1\.Standard Gemma: Zero\-shot baseline \(no fine\-tuning\)
2. 2\.Transitivity V1: Fine\-tuned on transitivity without semantic loss
3. 3\.D\-separation V1: Fine\-tuned on d\-separation without semantic loss
4. 4\.Transitivity Semantic V4: Fine\-tuned with dynamic semantic loss
5. 5\.D\-separation Semantic V2: Fine\-tuned with dynamic semantic loss
Training data consists of 50,000 synthetically generated examples per task, following the axiomatic training methodology of\[[1](https://arxiv.org/html/2605.05438#bib.bib1)\]with enhanced diversity in graph structures\.
### 5\.2Model Collapse in Standard Fine\-Tuning
Table[2](https://arxiv.org/html/2605.05438#S5.T2)demonstrates catastrophic collapse in 100% of models trained without semantic loss\.
Table 2:Model collapse evidence across 50,000 evaluation samples per model\. Prediction patterns show Yes/No counts \(in thousands\)\. V1 models exhibit 100% collapse rate with extreme prediction bias\.#### 5\.2\.1Collapse Analysis: Transitivity V1
Transitivity V1 exhibits complete collapse to always predicting ”Yes”:
- •Prediction distribution: 10,000 Yes / 0 No acrossall five test sets
- •Accuracy variance: 0\.15% \(shuffled\) to 100% \(length\)—entirely determined by label distribution
- •Structural independence: Predictions unchanged by graph topology, edge reversal, or node addition
- •F1 paradox: 31\.9% average F1 despite 27\.7% accuracy, indicating 100% recall but poor precision
#### 5\.2\.2Collapse Analysis: D\-separation V1
D\-separation V1 exhibits opposite collapse \(always ”No”\):
- •Prediction distribution: 0\-1,889 Yes / 8,111\-10,000 No
- •Misleading accuracy: 73\.9% average accuracy masks catastrophic failure
- •F1 reveals truth: 7\.6% F1 score exposes extreme recall failure \(8\.6% average\)
- •Test set bias: High accuracy results from No\-heavy label distribution, not learned reasoning
Key insight: Accuracy alone is insufficient—F1, precision, recall, and prediction distribution analysis are essential for detecting collapse\.
### 5\.3Semantic Loss Prevents Collapse
Table[3](https://arxiv.org/html/2605.05438#S5.T3)shows comprehensive results demonstrating collapse prevention\.
Table 3:Per\-task accuracy comparison \(10,000 samples each\. Transitivity task shown; d\-separation results in Section 5\.4\.\)\. Semantic loss achieves 42\.7% average improvement with massive gains on challenging tasks \(branching: \+95\.9%\)\.#### 5\.3\.1Quantitative Analysis
1. 1\.Collapse prevention: Zero instances of extreme prediction bias across all test sets
2. 2\.Prediction diversity: Yes predictions range from 17 \(branching\) to 6,464 \(length\) per 10,000 samples
3. 3\.Task\-specific adaptation: Prediction distribution varies appropriately with task difficulty
4. 4\.Balanced metrics: Precision \(38\.8%\) and recall \(43\.5%\) show reasonable trade\-offs vs\. V1’s 100% recall
5. 5\.Branching breakthrough: 1\.96% → 97\.9% demonstrates learning complex graph structures
### 5\.4D\-separation Results
D\-separation Semantic V2 achieves 68\.6% average accuracy with stable performance:
- •Per\-task: Length 62\.8%, Branching 97\.8%, Reversed 54\.1%, Shuffled 65\.0%, Long Names 63\.6%
- •F1 score: 25\.0% \(vs\. 7\.6% for collapsed V1\)
- •Prediction balance: 27\-6,283 Yes predictions across tasks
- •Generalization: Successful transfer to complex graph structures
### 5\.5Adversarial Evaluation
Table[4](https://arxiv.org/html/2605.05438#S5.T4)validates that semantic models learn structural reasoning while collapsed models fail\.
Table 4:Adversarial evaluation \(1,000 samples testing structural understanding\)\. Collapsed models show fixed predictions and catastrophic failure\. Semantic models demonstrate balanced, context\-dependent reasoning\.#### 5\.5\.1Key Adversarial Findings
1. 1\.Collapse persistence: Transitivity V1 maintains 100% ”Yes” bias even on adversarial distribution
2. 2\.Catastrophic failure: D\-separation V1 achieves only 43% accuracy \(below random baseline for balanced dataset\)
3. 3\.Semantic robustness: Both semantic models achieve 67\-70% with balanced predictions
4. 4\.Heuristic exposure: Standard Gemma’s 66\.7% suggests superficial pattern matching rather than genuine reasoning
### 5\.6Semantic Loss Version Progression
Table[5](https://arxiv.org/html/2605.05438#S5.T5)documents the iterative development of semantic loss\.
Table 5:Iterative development showing dynamic lambda scheduling as critical innovationThe progression demonstrates that dynamic scheduling is essential—neither too\-weak \(λ=0\.05\\lambda=0\.05\) nor too\-strong fixed values \(λ=0\.1\\lambda=0\.1\) achieve optimal performance\.
## 6Analysis
### 6\.1Why Does Collapse Occur?
We identify three contributing mechanisms:
##### Label Distribution Bias
Test sets exhibit natural imbalance \(e\.g\., d\-separation is predominantly ”No”\)\. Models exploit this statistical regularity rather than learning causal structure\.
##### Cross\-Entropy Shortcut Learning
Standard CE loss permits trivial solutions that minimize loss without structural understanding\. A model predicting constant ”No” on No\-heavy datasets achieves high accuracy despite zero reasoning\.
##### Absence of Structural Constraints
Without explicit penalties for violating causal axioms, gradient descent finds degenerate local minima that ignore input graph topology\.
### 6\.2Why Does Semantic Loss Prevent Collapse?
Dynamic lambda scheduling provides three critical properties:
##### Early Training Stability
Low initialλ=0\.05\\lambda=0\.05prevents catastrophic interference between CE and semantic gradients, allowing stable optimization\.
##### Gradual Constraint Enforcement
Linear increase enables the model to first learn basic input\-output mappings, then progressively incorporate structural constraints\.
##### Degenerate Solution Prevention
Finalλ=0\.30\\lambda=0\.30provides sufficient penalty to prevent collapse while maintaining reasonable gradient magnitudes\.
### 6\.3Comparison with Standard Gemma
Standard Gemma achieves 70\.1% standard accuracy and 66\.7% adversarial accuracy without fine\-tuning\. However, key differences emerge:
- •Mechanism: Gemma uses task\-specific heuristics learned during pretraining, not structural causal reasoning
- •Evidence: 0% F1 on branching tasks reveals blind ”No” predictions
- •Adversarial performance: Similar accuracy \(66\.7%\) but through pattern matching rather than graph analysis
- •Semantic models: Achieve comparable accuracy \(69\.8\-70\.4%\) via genuine structural understanding
The adversarial evaluation successfully distinguishes these mechanisms: semantic models maintain performance through reasoning, while Gemma’s heuristics coincidentally succeed on standard tests\.
## 7Limitations and Future Directions
### 7\.1Current Limitations
##### Model Scale
Experiments limited to 270M parameter models\. Larger models may exhibit different collapse characteristics or resistance\.
##### Task Scope
Evaluation restricted to transitivity and d\-separation\. Other causal axioms \(e\.g\., conditional independence, faithfulness\) remain unexplored\.
##### Performance Gap
Semantic models achieve 67\-70% adversarial accuracy, indicating room for improvement toward theoretical optimum\.
##### Computational Overhead
Graph parsing and semantic loss computation add 15% training time vs\. standard fine\-tuning\.
### 7\.2Future Directions
- •Scaling studies: Investigate collapse behavior in 1B\+ parameter models
- •Axiom expansion: Extend to full Pearl’s causal hierarchy \(association, intervention, counterfactuals\)
- •Adaptive scheduling: Learnλ\(t\)\\lambda\(t\)schedule from validation performance
- •Real\-world evaluation: Test on CLADDER\[[8](https://arxiv.org/html/2605.05438#bib.bib8)\], Corr2Cause\[[9](https://arxiv.org/html/2605.05438#bib.bib9)\], and causal discovery benchmarks
- •Theoretical analysis: Formal characterization of collapse conditions and prevention guarantees
## 8Conclusion
We have identified, characterized, and solved catastrophic model collapse in causal reasoning fine\-tuning\. Our key contributions:
1. 1\.Problem: 100% collapse rate in standard fine\-tuning across transitivity and d\-separation tasks
2. 2\.Diagnosis: Comprehensive analysis showing accuracy can be misleading; F1, precision, recall, and prediction distribution are essential
3. 3\.Solution: Semantic loss with graph\-based constraints and dynamic lambda scheduling
4. 4\.Validation: 42\.7% improvement over collapsed baselines across 200,000\+ evaluation samples
5. 5\.Generalization: Success on both transitivity \(70\.4%\) and d\-separation \(68\.6%\) tasks
6. 6\.Robustness: Adversarial tests confirm structural learning \(67\-70%\) vs\. catastrophic failure \(43\-71%\)
Practical impact: Semantic loss transforms causal reasoning fine\-tuning from fundamentally broken \(100% collapse\) to reliably stable\. This is not an optimization—it is essential for any practical deployment\.
Broader implications: Our findings suggest that fine\-tuning on complex reasoning tasks may require task\-specific inductive biases beyond standard cross\-entropy loss\. Future work on mathematical reasoning, logical inference, and other structured tasks should carefully monitor for similar collapse phenomena\.
## References
- \[1\]Aniket Vashishtha, Abhinav Kumar, Atharva Pandey, Abbavaram Gowtham Reddy, Kabir Ahuja, Vineeth N Balasubramanian, and Amit Sharma\.Teaching Transformers Causal Reasoning through Axiomatic Training\.arXiv preprint arXiv:2407\.07612, 2024\.
- \[2\]Judea Pearl\.Causality: Models, Reasoning, and Inference\.Cambridge University Press, 2nd edition, 2009\.
- \[3\]Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck\.A Semantic Loss Function for Deep Learning with Symbolic Knowledge\.InInternational Conference on Machine Learning \(ICML\), 2018\.
- \[4\]Gemma Team, Google DeepMind\.Gemma: Open Models Based on Gemini Research and Technology\.Technical report, Google DeepMind, 2024\.
- \[5\]Ian Goodfellow, Jean Pouget\-Abadie, Mehdi Mirza, Bing Xu, David Warde\-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio\.Generative Adversarial Networks\.InAdvances in Neural Information Processing Systems \(NeurIPS\), pages 2672–2680, 2014\.
- \[6\]Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton\.A Simple Framework for Contrastive Learning of Visual Representations\.InInternational Conference on Machine Learning \(ICML\), pages 1597–1607, 2020\.
- \[7\]Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training Language Models to Follow Instructions with Human Feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2022\.
- \[8\]Jinfa Huang, Yongqi Leng, Weitong Zhang, Xinyu Yang, Xiaowu Zhang, and Dahua Lin\.CLADDER: A Benchmark to Assess Causal Reasoning Capabilities of Language Models\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Track on Datasets and Benchmarks, 2023\.
- \[9\]Stephanie Long, Tibor Schuster, and Alexandre Piché\.Can Large Language Models Distinguish Cause from Effect?arXiv preprint arXiv:2310\.17961, 2023\.
## Appendix AImplementation Details
### A\.1Data Generation Pipeline
We implement a comprehensive synthetic data generation framework for causal reasoning tasks, consisting of two primary modules: a base generator for standard training and evaluation data, and a specialized adversarial generator for robustness testing\.
#### A\.1\.1Graph Generation Algorithms
Our framework employs two distinct graph generation strategies based on task requirements:
##### Sequential Chain Generation
For transitivity reasoning tasks, we generate directed acyclic chains of lengthℓ\\ellwhere nodesV=\{v1,v2,…,vℓ\}V=\\\{v\_\{1\},v\_\{2\},\\ldots,v\_\{\\ell\}\\\}are connected by edgesE=\{\(vi,vi\+1\)∣i∈\[1,ℓ−1\]\}E=\\\{\(v\_\{i\},v\_\{i\+1\}\)\\mid i\\in\[1,\\ell\-1\]\\\}\. To introduce structural variation, we apply edge flipping with probabilitypflip∈\{0\.0,0\.3,0\.5\}p\_\{\\text\{flip\}\}\\in\\\{0\.0,0\.3,0\.5\\\}, reversing the direction of individual edges while maintaining overall connectivity\.
Node names are randomly generated strings of lengthn∼𝒰\(nmin,nmax\)n\\sim\\mathcal\{U\}\(n\_\{\\min\},n\_\{\\max\}\)from the alphabetΣ=\{a\-z,A\-Z,0\-9\}\\Sigma=\\\{a\\text\{\-\}z,A\\text\{\-\}Z,0\\text\{\-\}9\\\}, where:
- •Training distribution:n∈\[1,3\]n\\in\[1,3\]
- •Evaluation distribution:n∈\[8,10\]n\\in\[8,10\]\(for name length generalization\)
##### DAG Generation with Controlled Branching
For d\-separation tasks requiring more complex graph structures, we implement a topologically\-ordered DAG generator\. Given parameters\(\|V\|,ρ\)\(\|V\|,\\rho\)whereρ\\rhois edge density:
Algorithm 2Controlled DAG Generation1:Initialize nodes
V=\{v1,…,v\|V\|\}V=\\\{v\_\{1\},\\ldots,v\_\{\|V\|\}\\\}with random names
2:
E←∅E\\leftarrow\\emptyset
3:for
i=1i=1to
\|V\|\|V\|do
4:
k←min\(⌊\|V\|⋅ρ⌋,5\)k\\leftarrow\\min\(\\lfloor\|V\|\\cdot\\rho\\rfloor,5\)
5:
T←T\\leftarrowsample
kknodes from
\{vi\+1,…,v\|V\|\}\\\{v\_\{i\+1\},\\ldots,v\_\{\|V\|\}\\\}
6:
E←E∪\{\(vi,vj\)∣vj∈T\}E\\leftarrow E\\cup\\\{\(v\_\{i\},v\_\{j\}\)\\mid v\_\{j\}\\in T\\\}
7:endfor
8:if
\|E\|<\|V\|−1\|E\|<\|V\|\-1then
9:Add backbone chain edges
10:endif
11:return
G=\(V,E\)G=\(V,E\)
Edge density ranges are task\-specific:
- •Training:ρ∼𝒰\(0\.3,0\.6\)\\rho\\sim\\mathcal\{U\}\(0\.3,0\.6\)
- •Evaluation:ρ∼𝒰\(0\.7,1\.2\)\\rho\\sim\\mathcal\{U\}\(0\.7,1\.2\)\(for branching complexity\)
#### A\.1\.2Natural Language Template Generation
Graphs are converted to natural language premises using deterministic templates:
```
premise:
" ".join(
[
f"{a} causes {b}."
for (a,b) in E
]
)
```
For transitivity tasks, hypotheses query direct or transitive causation:
```
hypothesis: "Does {v_i} cause {v_j}?"
```
For d\-separation tasks, hypotheses include optional conditioning setsZ⊂VZ\\subset V:
```
hypothesis:
"Are {v_i} and {v_j} d-separated given
{Z}?"
```
where\|Z\|≤3\|Z\|\\leq 3is sampled uniformly\.
### A\.2Causal Reasoning Algorithms
#### A\.2\.1Transitivity Label Generation
Labels are computed via depth\-first search \(DFS\) for directed path existence:
Algorithm 3Path Existence Check:FindPath\(E,vstart,vend\)\(E,v\_\{\\text\{start\}\},v\_\{\\text\{end\}\}\)1:if
vstart=vendv\_\{\\text\{start\}\}=v\_\{\\text\{end\}\}then
2:returnTrue
3:endif
4:
visited←∅\\text\{visited\}\\leftarrow\\emptyset
5:
stack←\[vstart\]\\text\{stack\}\\leftarrow\[v\_\{\\text\{start\}\}\]
6:while
stack≠∅\\text\{stack\}\\neq\\emptysetdo
7:
v←stack\.pop\(\)v\\leftarrow\\text\{stack\.pop\(\)\}
8:if
v=vendv=v\_\{\\text\{end\}\}then
9:returnTrue
10:endif
11:if
v∈visitedv\\in\\text\{visited\}then
12:continue
13:endif
14:
visited←visited∪\{v\}\\text\{visited\}\\leftarrow\\text\{visited\}\\cup\\\{v\\\}
15:
stack\.extend\(\{u∣\(v,u\)∈E\}\)\\text\{stack\.extend\}\(\\\{u\\mid\(v,u\)\\in E\\\}\)
16:endwhile
17:returnFalse
#### A\.2\.2D\-separation Algorithm
We implement Pearl’s d\-separation criterion\[[2](https://arxiv.org/html/2605.05438#bib.bib2)\]to determine conditional independence\. The algorithm:
1. 1\.Path Finding:Identify all undirected paths𝒫\(vi,vj\)\\mathcal\{P\}\(v\_\{i\},v\_\{j\}\)between query nodes using breadth\-first search with path length limitLmax=10L\_\{\\max\}=10\.
2. 2\.Blocking Rule Evaluation:For each pathp=\(vi,…,vj\)∈𝒫\(vi,vj\)p=\(v\_\{i\},\\ldots,v\_\{j\}\)\\in\\mathcal\{P\}\(v\_\{i\},v\_\{j\}\)and each intermediate nodevkv\_\{k\}with neighbors\(vk−1,vk\+1\)\(v\_\{k\-1\},v\_\{k\+1\}\): - •Collider:If\(vk−1,vk\)∈E\(v\_\{k\-1\},v\_\{k\}\)\\in Eand\(vk\+1,vk\)∈E\(v\_\{k\+1\},v\_\{k\}\)\\in E, path is blocked unlessvk∈Zv\_\{k\}\\in Zor∃vd∈descendants\(vk\):vd∈Z\\exists v\_\{d\}\\in\\text\{descendants\}\(v\_\{k\}\):v\_\{d\}\\in Z - •Chain:If\(vk−1,vk\)∈E\(v\_\{k\-1\},v\_\{k\}\)\\in Eand\(vk,vk\+1\)∈E\(v\_\{k\},v\_\{k\+1\}\)\\in E, path is blocked ifvk∈Zv\_\{k\}\\in Z - •Fork:If\(vk,vk−1\)∈E\(v\_\{k\},v\_\{k\-1\}\)\\in Eand\(vk,vk\+1\)∈E\(v\_\{k\},v\_\{k\+1\}\)\\in E, path is blocked ifvk∈Zv\_\{k\}\\in Z
3. 3\.D\-separation Decision:ReturnTrueif all paths are blocked,Falseotherwise\.
To handle descendant queries efficiently, we implement a memoized BFS traversal with visited set tracking\.
### A\.3Multi\-Stage Validation Framework
Each generated example undergoes rigorous validation to ensure logical consistency:
##### Premise Parsing
Causal edges are extracted using regex pattern matching:
```
pattern: r"(\w+) causes (\w+)"
```
Failed parses are rejected \(acceptance rate:\>99%\>99\\%\)\.
##### Hypothesis Parsing
Query nodes and conditioning sets are extracted via string decomposition with error handling for malformed queries\.
##### Label Verification
Ground truth labels are recomputed from parsed graphs and compared against generated labels\. Examples with mismatches are rejected\.
##### Graph Validity Checks
For d\-separation tasks, we reject graphs with:
- •\|E\|<\|V\|−1\|E\|<\|V\|\-1\(insufficient connectivity\)
- •\|E\|\>3\|V\|\|E\|\>3\|V\|\(excessive density\)
- •Unreachable node pairs with empty path sets
### A\.4Optimization Strategies
#### A\.4\.1Computational Optimizations
- •Memoization:D\-separation checks are cached using LRU cache with 1000\-entry limit, reducing redundant path computations for isomorphic subgraphs\.
- •Early Rejection:Invalid graphs are filtered before expensive d\-separation computation based on structural heuristics \(edge count bounds, node reachability\)\.
- •Attempt Limits:Generation retries are capped at 10 attempts per example to prevent infinite loops on infeasible configurations\.
- •Path Length Limits:BFS path finding terminates at depth 10, trading completeness for tractability on large graphs\.
#### A\.4\.2Acceptance Rate Analysis
Table[6](https://arxiv.org/html/2605.05438#A1.T6)shows generation acceptance rates across tasks:
Table 6:Generation Acceptance RatesLower acceptance for d\-separation reflects stricter validation requirements and graph complexity constraints\.
### A\.5Dataset Composition
#### A\.5\.1Training Datasets
- •Transitivity Training\(transitivity\_train\.jsonl\): 50,000 examples - –Chain length:ℓ∼𝒰\(3,6\)\\ell\\sim\\mathcal\{U\}\(3,6\) - –Node names:n∼𝒰\(1,3\)n\\sim\\mathcal\{U\}\(1,3\) - –Edge flipping:p∈\{0\.0,0\.3,0\.5\}p\\in\\\{0\.0,0\.3,0\.5\\\}
- •D\-separation Training\(dsep\_train\.jsonl\): 50,000 examples - –Graph size:\|V\|∼𝒰\(3,6\)\|V\|\\sim\\mathcal\{U\}\(3,6\) - –Edge density:ρ∼𝒰\(0\.3,0\.6\)\\rho\\sim\\mathcal\{U\}\(0\.3,0\.6\) - –Conditioning set size:\|Z\|∼𝒰\(0,3\)\|Z\|\\sim\\mathcal\{U\}\(0,3\)
#### A\.5\.2Standard Evaluation Datasets
Five evaluation sets test different generalization capabilities \(10,000 examples each\):
1. 1\.Length Generalization\(length\_eval\.jsonl\): Chain lengthℓ∼𝒰\(7,15\)\\ell\\sim\\mathcal\{U\}\(7,15\)
2. 2\.Structural Variation\(reversed\_eval\.jsonl\): All edges reversed,E′=\{\(b,a\)∣\(a,b\)∈E\}E^\{\\prime\}=\\\{\(b,a\)\\mid\(a,b\)\\in E\\\}
3. 3\.Order Invariance\(shuffled\_eval\.jsonl\): Premise statements randomly permuted withpflip=0\.5p\_\{\\text\{flip\}\}=0\.5
4. 4\.Name Length Generalization\(long\_names\_eval\.jsonl\): Node namesn∼𝒰\(8,10\)n\\sim\\mathcal\{U\}\(8,10\)
5. 5\.Branching Complexity\(branching\_eval\.jsonl\): DAGs withρ∼𝒰\(0\.7,1\.2\)\\rho\\sim\\mathcal\{U\}\(0\.7,1\.2\)andℓ∼𝒰\(7,15\)\\ell\\sim\\mathcal\{U\}\(7,15\)
#### A\.5\.3Adversarial Evaluation Dataset
The adversarial evaluation set \(adversarial\_eval\.jsonl, 1,000 examples\) targets specific failure modes through carefully designed graph construction strategies:
##### Irrelevant Nodes \(30%\)
These examples test whether models can focus on relevant causal structure while ignoring disconnected components\. Generation procedure:
1. 1\.Generate a main chain of lengthℓmain∼𝒰\(3,5\)\\ell\_\{\\text\{main\}\}\\sim\\mathcal\{U\}\(3,5\)with standard parameters
2. 2\.Addk∼𝒰\(1,3\)k\\sim\\mathcal\{U\}\(1,3\)disconnected chains, each of lengthℓirrel∼𝒰\(2,4\)\\ell\_\{\\text\{irrel\}\}\\sim\\mathcal\{U\}\(2,4\)
3. 3\.Ensure node name uniqueness across all chains through rejection sampling \(maximum 10 attempts\)
4. 4\.Query exclusively about nodes within the main chain:vi,vj∈Vmainv\_\{i\},v\_\{j\}\\in V\_\{\\text\{main\}\}
5. 5\.Premise contains edges from all chains:E=Emain∪Eirrel,1∪…∪Eirrel,kE=E\_\{\\text\{main\}\}\\cup E\_\{\\text\{irrel\},1\}\\cup\\ldots\\cup E\_\{\\text\{irrel\},k\}
Example structure:
```
Premise: "A causes B. B causes C.
X causes Y. P causes Q. Q causes R."
[main] [---irrelevant chains---]
Hypothesis: "Does A cause C?"
Label: "Yes"
```
This tests whether models erroneously incorporate irrelevant nodes into reasoning or correctly isolate the queried subgraph\.
##### Broken Chains \(30%\)
These examples test detection of non\-existent causal paths across disconnected graph components\. Generation procedure:
1. 1\.Generatek∼𝒰\(2,3\)k\\sim\\mathcal\{U\}\(2,3\)completely disconnected chains
2. 2\.Each chain has lengthℓi∼𝒰\(2,4\)\\ell\_\{i\}\\sim\\mathcal\{U\}\(2,4\)
3. 3\.Enforce strict node name disjointness:Vi∩Vj=∅V\_\{i\}\\cap V\_\{j\}=\\emptysetfori≠ji\\neq j
4. 4\.Query across different components: selectvi∈Vav\_\{i\}\\in V\_\{a\}andvj∈Vbv\_\{j\}\\in V\_\{b\}wherea≠ba\\neq b
5. 5\.Label is always “No” since no path exists between disconnected components
Example structure:
```
Premise: "A causes B. B causes C.
X causes Y. P causes Q."
[chain 1] [chain 2] [chain 3]
Hypothesis: "Does A cause Y?"
Label: "No"
```
This evaluates whether models incorrectly hallucinate transitive connections across graph boundaries or properly recognize component isolation\.
##### Extended Transitivity \(40%\)
These examples test multi\-hop reasoning beyond the training distribution length\. Generation procedure:
1. 1\.Generate sequential chains withℓ∼𝒰\(7,12\)\\ell\\sim\\mathcal\{U\}\(7,12\), exceeding training maximum of 6
2. 2\.Use standard edge generation without flipping:E=\{\(vi,vi\+1\)∣i∈\[1,ℓ−1\]\}E=\\\{\(v\_\{i\},v\_\{i\+1\}\)\\mid i\\in\[1,\\ell\-1\]\\\}
3. 3\.Query endpoint causation: “Doesv1v\_\{1\}causevℓv\_\{\\ell\}?”
4. 4\.Label is always “Yes” requiringℓ−1\\ell\-1transitive steps
Example structure:
```
Premise: "A causes B. B causes C.
C causes D. D causes E. E causes F.
F causes G. G causes H. H causes I.
I causes J."
[9-hop chain, exceeds training max]
Hypothesis: "Does A cause J?"
Label: "Yes"
```
This probes compositional generalization: whether models can chain reasoning beyond training\-time depth limits\.
##### Validation and Quality Control
All adversarial examples undergo identical validation as training data:
- •Premise parsing verification \(regex extraction of all edges\)
- •Label recomputation usingFindPathalgorithm
- •Graph connectivity checks \(appropriate for disconnected graph examples\)
- •Maximum 15 generation attempts per example \(higher than standard 10 due to structural constraints\)
Acceptance rates vary by adversarial type: irrelevant nodes \(∼\\sim70%\), broken chains \(∼\\sim60%\), extended transitivity \(∼\\sim65%\), yielding overall acceptance of∼\\sim65% for the adversarial set\.
### A\.6Implementation Details
The complete data generation pipeline is implemented in Python 3\.8\+ using:
- •dataclassesfor configuration management
- •collections\.dequefor efficient BFS implementation
- •functools\.lru\_cachefor memoization
- •loggingfor progress tracking and debugging
All datasets are serialized as JSONL files with schema:
```
{
"premise": str, # Causal statements
"hypothesis": str, # Query question
"label": str # "Yes" or "No"
}
```
Generation time averages 0\.02s per transitivity example and 0\.15s per d\-separation example on a single CPU core, with total pipeline runtime under 3 hours for all 121,000 examples\.
### A\.7Hardware and Runtime
All experiments conducted on Google Colab Pro with NVIDIA T4 GPU \(16GB\)\. Training time per model:
- •Baseline \(no semantic\): 45 minutes
- •Semantic loss: 52 minutes \(\+15% overhead\)
## Appendix BAdditional Experimental Results
### B\.1Per\-Task Confusion Matrices
This section provides detailed confusion matrices for all model variants across standard and adversarial evaluation sets\. Confusion matrices reveal the true nature of model predictions beyond aggregate accuracy metrics, particularly exposing collapse patterns through extreme TP/FP or TN/FN distributions\.
#### B\.1\.1Standard Evaluation \(10,000 samples per task\)
Tables[7](https://arxiv.org/html/2605.05438#A2.T7)through[11](https://arxiv.org/html/2605.05438#A2.T11)show confusion matrices across the five standard generalization tests\. Key patterns:
- •Transitivity V1 collapse:TP = 10,000 \(length task\), TN = 0 across all tasks→\\rightarrowalways predicts ”Yes”
- •D\-separation V1 collapse:TP near\-zero, TN dominates→\\rightarrowalways predicts ”No”
- •Semantic models:Balanced TP/TN/FP/FN distributions indicate context\-dependent predictions
Table 7:Standard Gemma: Confusion matrices across standard evaluation tasksTable 8:Transitivity V1 \(Collapsed\): Confusion matrices showing complete collapse to ”Yes” predictionsTable 9:D\-separation V1 \(Collapsed\): Confusion matrices showing collapse to ”No” predictionsTable 10:Transitivity Semantic V4: Confusion matrices showing balanced predictionsTable 11:D\-separation Semantic V2: Confusion matrices showing balanced predictions
#### B\.1\.2Adversarial Evaluation \(1,000 samples\)
Table[12](https://arxiv.org/html/2605.05438#A2.T12)shows confusion matrices on the adversarial structural robustness test\. This evaluation distinguishes structural understanding from heuristics through challenging examples with irrelevant nodes, broken chains, and extended transitivity\.
Key findings:
- •Transitivity V1:TP = 708, TN = 0, FN = 0→\\rightarrowMaintains collapse even on adversarial distribution
- •D\-separation V1:TP = 247, FN = 461→\\rightarrowCatastrophic recall failure \(34\.9%\)
- •Semantic models:Balanced confusion matrices with TP/TN/FP/FN all non\-zero→\\rightarrowContext\-dependent reasoning
Table 12:Adversarial evaluation confusion matrices \(1,000 samples testing structural understanding\)
#### B\.1\.3Interpretation Guidelines
The confusion matrices reveal three distinct behavioral patterns:
1\. Catastrophic Collapse \(V1 models\):
- •Transitivity V1: TN = 0 across all tasks, indicating exclusive ”Yes” predictions
- •D\-separation V1: TP near\-zero with massive FN counts, indicating exclusive ”No” predictions
- •These patterns are input\-independent, confirming prediction bias collapse
2\. Heuristic\-Based Predictions \(Standard Gemma\):
- •Task\-specific patterns \(e\.g\., 0% branching accuracy = all ”No”\)
- •Moderate TP/TN values with significant FP/FN errors
- •Performance varies dramatically by task type
3\. Structural Reasoning \(Semantic models\):
- •All four values \(TP/TN/FP/FN\) non\-zero and substantial
- •TP and TN values proportional to label distributions
- •Consistent error patterns across tasks, not task\-specific collapse
Critical diagnostic insight:Accuracy alone cannot detect collapse\. For example, D\-separation V1 achieves 73\.9% average accuracy \(Table[2](https://arxiv.org/html/2605.05438#S5.T2)\) while exhibiting severe FN bias \(8,111 false negatives on length task\)\. Only examination of the full confusion matrix reveals this catastrophic failure mode, highlighting the necessity of comprehensive metric reporting for causal reasoning evaluation\.
### B\.2Prediction Distribution Histograms
#### B\.2\.1Standard Evaluation
Figure 1:Prediction distribution of the pretrained Gemma\-3 270M model on the standard evaluation suite \(Length, Branching, Reversed, Shuffled, Long Names\)\.

Figure 2:Collapsed baseline models: Transitivity V1 \(left\) and D\-Separation V1 \(right\) on the standard evaluation tasks\.

Figure 3:Semantic\-loss fine\-tuned models: Transitivity V4 \(left\) and D\-Separation V2 \(right\) on the standard evaluation tasks\.
#### B\.2\.2Adversarial Evaluation \(Structural Robustness\)
Figure 4:Pretrained Gemma\-3 270M model on adversarial structural robustness tests \(Irrelevant nodes, Broken chains, Long chains\)\.

Figure 5:Collapsed baseline models \(Transitivity V1, D\-Separation V1\) on adversarial examples\.

Figure 6:Semantic\-loss fine\-tuned models \(Transitivity V4, D\-Separation V2\) on adversarial structural robustness tests\.
## Code and Data Availability
All code, trained models, and evaluation datasets are publicly available to ensure full reproducibility\.
- •Code & Experiments:The GitHub repository contains data generation scripts for both transitivity and d\-separation tasks \(generating the 50,000\-sample training sets and adversarial evaluation sets\), along with the comprehensive Colab notebook \(gemma\_semantic\.ipynb\) documenting all experiments – baseline fine\-tuning, semantic loss versions V1 through V4, dynamic lambda scheduling implementation, and the full evaluation pipeline: [https://github\.com/inquisitour/semantic\-loss\-causal\-reasoning](https://github.com/inquisitour/semantic-loss-causal-reasoning)
- •
- •Datasets:Training sets \(50,000 examples per task\), five standard evaluation sets \(10,000 samples each: length, branching, reversed, shuffled, long names\), and the adversarial structural robustness set \(1,000 samples\) are available at: [https://huggingface\.co/datasets/ludwigw/causal\-reasoning\-benchmarks](https://huggingface.co/datasets/ludwigw/causal-reasoning-benchmarks)Similar Articles
Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning
This paper investigates why instruction-tuned language models give different answers to causal reasoning questions when variable names are replaced with placeholders, finding that the issue stems from representational misalignment rather than information loss. The authors introduce Vernier, a method using paired-view weight updates and mechanism inspection to reveal that answer-relevant content is still present in the placeholder view but misaligned.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
This paper introduces PUMA, a plug-and-play framework that detects semantic redundancy in chain-of-thought reasoning to enable early exit, achieving 26.2% average token reduction across multiple models and benchmarks while preserving accuracy and reasoning quality.
Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road
This paper investigates why reasoning models lose coverage during supervised fine-tuning, linking the phenomenon to decision points in training data where multiple valid paths exist, and proposes data synthesis and diversity-aware decoding as mitigations.
Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction
This paper demonstrates that supervised fine-tuning with synthetic rationale data consistently harms prediction performance for Alzheimer's disease detection compared to label-only fine-tuning, across many configurations and model families. The degradation persists despite high-quality rationales and is attributed to a conflict between narrative plausibility and discriminative optimization.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
This paper investigates safety failures in Large Reasoning Models where harmful content appears in reasoning traces despite safe final answers, proposing an adaptive multi-principle steering method to mitigate these risks.