Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.

arXiv:2604.15842v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next-token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Source: https://arxiv.org/html/2604.15842

Tanja Baeumel1, 2, 3Josef van Genabith1, 2Simon Ostermann1, 2, 3
1German Research Center for Artificial Intelligence (DFKI)
2Saarland University
3Center for European Research in Trusted AI (CERTAIN)
[email protected]

## Abstract

Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next-token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.

## 1 Introduction

The increasingly impressive capabilities of large language models (LLMs) are generating a growing interest in the underlying mechanisms that facilitate their exceptional performance. Recent work in interpretability has allowed the community to gradually build a better understanding of how transformer-based models retrieve and use factual information that is implicitly stored in their parameters. However, our understanding of the mechanisms employed by LLMs to solve non-factual and reasoning-intensive tasks is less well developed.

In this study, we investigate the mechanisms by which large language models (LLMs) tackle mathematical reasoning. We focus on a task that is both straightforward to evaluate and interpret: specifically, we investigate how LLMs perform basic arithmetic operations (e.g., "Please calculate 143 + 81 =") and analyze the differences in the internal mechanisms of models exhibiting different degrees of arithmetic proficiency.

We investigate the model-internal mechanisms via the interpretability method of early decoding: We observe how the model's next token prediction is constructed throughout the layers, by de-embedding the residual stream after each attention and MLP module update (i.e., mapping it to a word), which reveals the contributions of the individual modules to the result generation process.

We present three sets of experiments to understand (1) how and when the task is recognized and the result generated, (2) how and when information from the input is propagated through the layers of the network and (3) how operands are combined, by looking at settings with 2 and 3 operands. Our main findings are:

- We show that LLMs recognize the task at hand in early layers, but the generation of the correct output happens late.
- We find strong indications that LLMs solve arithmetic tasks in a function-like manner, where one operand serves as an argument and the other operand is altered based on the operator and the 'argument-operand', indicating processing capabilities beyond factual recall.
- We show that models with good arithmetic capabilities show a clear division of tasks between attention and MLP modules: Similar to previous work, we find that the MLP modules *aggregate* information, while the attention modules *propagate* information. This division of tasks is absent in models that struggle with arithmetic tasks.

## 2 Methodology

To investigate the mechanisms that decoder-only transformer language models use to approach and solve arithmetic tasks, we observe how the model's next token prediction is constructed throughout the layers, by projecting the residual stream after each attention and MLP module into the vocabulary space through early decoding resulting in intermediate predictions. This enables us to understand which token the module is currently focused on and thus reveals the contributions of individual modules to solving the task. We analyze intermediate predictions on simple arithmetic tasks for two decoder-only language models with drastically different arithmetic capabilities, to understand what mechanisms enable models to generate correct results for arithmetic tasks.

### 2.1 Early Decoding

The method we employ for investigating the mechanisms that LLMs use to solve arithmetic tasks was initially introduced as 'logit lens'. It allows insights into how transformer-based models update the next token prediction throughout the generation process, by projecting the residual stream of the predicted token into the vocabulary space at intermediate layers.

Each attention and MLP component in a transformer-based LLM takes as input the current final token representation, i.e., the residual stream. In the stream, the output of each MLP and attention component is updated by adding the module's output to the input representation. The representation of the next token prediction at layer k thus includes all the preceding additive updates that have been made to the predicted token representation by the modules in previous layers.

A language modeling head (LM head), i.e., a linear prediction layer, is used in LLMs to produce probability distributions for the next token based on the last token representation. This LM head can also be applied to intermediate representations from middle layers. This effectively implements a de-embedding mechanism that allows us to investigate the most likely intermediate prediction at each step of the generation process. Such intermediate predictions allow, as a consequence, us to retrace the changes made to the representation by each module, and to determine the contributions of individual modules to the result generation process.

### 2.2 Task and Data

To create a controlled environment for observing LLM behavior on arithmetic tasks, we generate an artificial dataset with arithmetic tasks, which we control with respect to operators (summation, denoted as *add* henceforth, and subtraction, denoted as *sub*), operand size, number of operands, and result size.

We prompt the models with queries of the type "Please calculate operand 1 ○ operand 2 =", where operand 1, operand 2 ∈ ℕ and ○ ∈ {+, −}, for example "Please calculate 306 + 136 =", with the correct response being "442".

For each operator ○ ∈ {+, −}, we create one dataset with smaller numbers, i.e., operands and results, *add*_small and *sub*_small, and one with larger numbers *add*_large and *sub*_large. For the *small* datasets, both operands and the result are ≤ 99. In the dataset creation we ensure that all operands and results are integers ≤ 520. We choose this upper bound because this ensures that single-token number representations are used: The vocabularies of both GPT-2 XL and GPT-NeoX-20B tokenizers contain individual tokens for all integers between 0 and 520. Higher numbers may be encoded as multiple tokens. Each dataset contains 500 unique queries. All experiments are done in a zero-shot fashion, without training or adaptation of models.

In the remainder of the paper, we focus on evaluation of the *add* datasets, as we find structurally similar results for the datasets of both operators. We report differences between addition and subtraction where necessary. A full evaluation of the *sub* datasets is provided in the Appendix.

### 2.3 Models

We experiment with two decoder-only transformer language models: GPT-NeoX-20B and GPT-2 XL. We use the freely available EleutherAI/gpt-neox-20b (https://huggingface.co/EleutherAI/gpt-neox-20b/tree/main) and openai-community/gpt2-xl (https://huggingface.co/openai-community/gpt2-xl) variants on Hugging Face. We confirm previous findings on GPT-2 XL's inability to solve simple arithmetic tasks, while GPT-NeoX-20B performs well. Thus, in the remainder of the paper, we focus on GPT-NeoX-20B. Nevertheless, we present results on GPT-2 XL in the Appendix, as the differences in the internal mechanisms compared to GPT-NeoX-20B are of interest.

## 3 Experiment Set 1: Task Recognition and Result Generation

By investigating intermediate predictions (IPs) after MLP and attention modules, we first examine two fundamental questions: *When does the model recognize that it needs to perform an arithmetic task?* and *When does it start to generate the result?* We find answers to these questions primarily in the post-MLP predictions:

- The model recognizes that it has to solve a numerical task early, and considers unspecific numerical tokens until the mid layers, where the operands are loaded.
- The correct result is only predicted in the last layer.

### 3.1 When is the Task Recognized?

#### The model predicts numerical tokens early.

We observe the probability mass of numerical tokens in the post-MLP and post-attention IPs across different layers. The averaged probability mass assigned to numerical tokens in the *add*_large dataset shows a sharp increase in the proportion of numerical predictions around layer 9 in the post-MLP IPs, which could indicate that the model begins to recognize the numerical nature of the task. We find similar general trends in the *add*_small dataset.

#### Numerical tokens are only predicted with high confidence in the last layer.

The average proportion of numerical tokens within the top 1 and top 10 IPs on the *add*_large dataset shows that the number of numerical predictions in the top 10 predicted tokens post-MLP is between 10 and 40% for early-mid layers (layers 9 to 17) and for mid-late layers (layers 30 to 43). However, the model only predicts a numerical token as the top prediction in the very last layer, i.e., layer 44.

#### The behavior of attention layers differs significantly.

The patterns observed for the numerical predictions after the attention modules are very different. We observe significant spikes in numerical predictions at specific layers, particularly at layers 9, 12, 14, and 21, where 87% to 99% of the probability mass is on numerical tokens on average. Previous work has shown that attention modules are responsible for propagating information between positions; thus we conjecture that these spikes indicate that important numerical information may be attended to in the final token position or propagated to the final token position via the prominent attention modules. We investigate these findings in more detail in the following section.

### 3.2 When is the Correct Result Generated?

#### Numerical predictions are unrelated to the correct result until late layers.

To understand when the result is generated, we examine the similarity of numerical IPs to the correct result throughout the layers. We analyze the absolute error, defined as the difference between the predicted number and the correct number, within the top 10 and top 1 predicted tokens. Our analysis reveals several key findings: Firstly, in the post-MLP IPs, the similarity between the predicted numerical tokens and the correct result is generally low in early and mid layers, but begins to increase incrementally after layer 35, corresponding to the layer where the correct result is assigned a higher probability. There is a notable decrease in absolute error for the top 1 and top 10 post-attention predictions in layer 28, indicating that the model has a reasonable approximation of the magnitude of the correct result at that layer.

#### The correct result is produced only in very late layers.

We also analyze at which layer the correct result is produced and evaluate both the probability of the result token and its position among the most probable tokens at each layer. The correct result typically emerges in the later layers of the model. This is demonstrated by a significant increase in the correct token's probability and a corresponding decrease in its position, with the correct result appearing around layers 35-40 for large addition tasks, with a final sharp increase of probability and decrease of rank in the final layer 44. For smaller addition tasks the correct result token emerges as early as layer 26 and is anchored as the highest predicted token around layers 32-34. The correct result thus appears earlier in the generation process of easier tasks compared to more complex ones. This could indicate different internal mechanisms for solving easier compared to more challenging mathematical reasoning tasks.

## 4 Experiment Set 2: Input Propagation

Our findings in the previous section provide insights into where in the model the task is recognized and the result emerges. However, the specific mechanisms und

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Similar Articles

Learning to reason with LLMs

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

Reasoning emerges from constrained inference manifolds in large language models

LLM Parameters for Math Across Languages: Shared or Separate?

Submit Feedback

Similar Articles

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

Reasoning emerges from constrained inference manifolds in large language models

LLM Parameters for Math Across Languages: Shared or Separate?