How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

arXiv cs.AI Papers

Summary

This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.

arXiv:2605.06882v1 Announce Type: new Abstract: Large Language Models (LLMs) have achieved great improvements in recent years. Nevertheless, it still remains unclear how good LLMs are for reasoning tasks, especially for long-chain ones. In this paper, we evaluate LLMs' performance on the simplest yet long-chain reasoning task, namely the Equivalence Class Problem (ECP), i.e., determining whether two variables are equal given a set of randomly generated equivalence relations. We consider both reasoning and non-reasoning representative LLMs over a large variety of problem instances, ranging over different numbers of variables, connectivity probabilities, prompts, and other factors. The experimental results show that non-reasoning LLMs fail ECP, while reasoning models are significantly better but still struggle to completely solve this problem. Interestingly, considering various connectivity probabilities with a fixed number of variables, we observe that, for non-reasoning models, the hardest problem instances coincide with the phase transition point of ln n/(n-1), suggesting the chaos of the problem; in contrast, for reasoning models, the hardest ones coincide with the biggest diameter, suggesting the reasoning difficulty of the problem.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 07:08 AM

# How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Source: [https://arxiv.org/html/2605.06882](https://arxiv.org/html/2605.06882)
Lianlong Wu2,∗Bingqian Li1Lvting Liu1Yi Zhou11University of Science and Technology of China 2University of Oxford zhengchun@mail\.ustc\.edu\.cn, lianlong\.wu@cs\.ox\.ac\.uk, bqli315@gmail\.com, yi\_zhou@ustc\.edu\.cn

###### Abstract

Large Language Models \(LLMs\) have achieved great improvements in recent years\. Nevertheless, it still remains unclear how good LLMs are for reasoning tasks, especially for long\-chain ones\. In this paper, we evaluate LLMs’ performance on the simplest yet long\-chain reasoning task, namely the Equivalence Class Problem \(ECP\), i\.e\., determining whether two variables are equal given a set of randomly generated equivalence relations\. We consider both reasoning and non\-reasoning representative LLMs over a large variety of problem instances, ranging over different numbers of variables, connectivity probabilities, prompts, and other factors\. The experimental results show that non\-reasoning LLMs fail ECP, while reasoning models are significantly better but still struggle to completely solve this problem\. Interestingly, considering various connectivity probabilities with a fixed number of variables, we observe that, for non\-reasoning models, the hardest problem instances coincide with the phase transition point ofln⁡nn−1\\frac\{\\ln n\}\{n\-1\}, suggesting the chaos of the problem; in contrast, for reasoning models, the hardest ones coincide with the biggest diameter, suggesting the reasoning difficulty of the problem\.

11footnotetext:These authors contributed equally\.## 1Introduction

The evolution of Large Language Models \(LLMs\) has recently reached a watershed moment\. The field currently exhibits a bifurcation into two coexisting paradigms: one consists of standard non\-reasoning models, such as DeepSeek\-V3Liuet al\.\([2024](https://arxiv.org/html/2605.06882#bib.bib15)\)and Qwen3\-MaxQwen Team \([2025](https://arxiv.org/html/2605.06882#bib.bib23)\), which rely on massive parameter counts and extensive training data for rapid pattern matching and text generation\. The other comprises the recently emerging reasoning models, including Qwen3\-MAX\-thinkingQwen Team \([2025](https://arxiv.org/html/2605.06882#bib.bib23)\), DeepSeek\-V3\.2\-thinkingLiuet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib16)\), DeepSeek\-R1Guoet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib10)\), and Claude 4\.5 Sonnet ThinkingAnthropic \([2025](https://arxiv.org/html/2605.06882#bib.bib1)\)\. The distinguishing feature of the latter is the incorporation of Chain\-of\-Thought \(CoT\) mechanisms, enabling the generation of detailed reasoning traces before outputting a final answer\. This “System 2” approach has demonstrated impressive performance gains on complex mathematical and programming benchmarks and is widely regarded as a crucial step toward more general artificial intelligenceWeiet al\.\([2022](https://arxiv.org/html/2605.06882#bib.bib27)\); Kojimaet al\.\([2022](https://arxiv.org/html/2605.06882#bib.bib13)\)\.

However, despite the superior performance of both model paradigms on general leaderboards, the academic community lacks a deep understanding of their internal mechanisms and capability boundaries in long\-chain reasoning scenarios\. Current mainstream evaluation paradigms primarily rely on existing benchmarks such as AIMEMathematical Association of America \([2026](https://arxiv.org/html/2605.06882#bib.bib19)\)or MATHHendryckset al\.\([2021](https://arxiv.org/html/2605.06882#bib.bib11)\)\. These paradigms suffer from two core limitations: First, data contamination and memorization effects, where models may simply memorize the mapping between questions and answers rather than truly mastering logical rulesMcCoyet al\.\([2023](https://arxiv.org/html/2605.06882#bib.bib20)\); Second, a lack of fine\-grained control over complexity, as existing mathematical problems often entangle calculation, common sense, and logic, making it difficult to isolate a single variable and controllably test the impact of reasoning depth on performanceShojaeeet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib24)\); Estermannet al\.\([2024](https://arxiv.org/html/2605.06882#bib.bib7)\)\. A critical question remains unresolved: When semantic cues are stripped away, leaving only abstract logic, how do non\-reasoning and reasoning models fundamentally differ? Are they performing genuine logical deduction, or merely engaging in complex probabilistic curve fitting?

To strictly evaluate logical capabilities, this study constructs a synthetic task environment based on the Equivalence Class Problem\. This task is conceptually simple, requiring no external knowledge, yet it serves as a rigorous probe for long\-chain reasoning by necessitating multi\-step recursive deduction\. Mathematically, the problem is equivalent to identifying connected components in a graph, where the model must infer global membership relationships solely through the transitivity of local equivalence links\. This environment offers the unique advantage of isolating pure logical deduction from semantic interference, while being capable of generating problem instances that naturally span a diverse spectrum of inference depths\.

Our empirical research is based on a comprehensive evaluation of representative models from both non\-reasoning \(e\.g\., DeepSeek\-V3\) and reasoning \(e\.g\., DeepSeek\-R1\) paradigms\. By systematically sweeping connectivity probabilities across varying problem scales \(up ton=144n=144\) to cover the critical phase transition region, we reveal three key characteristics of current LLMs when processing abstract logic:

First, the error peaks diverge within the critical phase transition window \(between1/n1/nandln⁡nn−1\\frac\{\\ln n\}\{n\-1\}\)\. DeepSeek\-V3\.2 \(non\-reasoning model\) is overwhelmed by structural chaos, with its error rates peaking near the connectivity thresholdln⁡nn−1\\frac\{\\ln n\}\{n\-1\}\. In contrast, DeepSeek\-R1 \(reasoning model\) exhibits a depth\-dependent failure: its error rates peak earlier in the transition, closer to1/n1/n, precisely where the average inference depth reaches its maximum\. This confirms that while standard models struggle with topological complexity, reasoning models are primarily constrained by the length of the deduction chain\.

Second, non\-reasoning models exhibit a severe limitation in handling multi\-step dependencies\. Our experiments with DeepSeek\-V3\.2 reveal that while the model performs robustly on single\-hop deduction \(D​e​p​t​h=1Depth=1\), it fails to generalize to longer chains\. Specifically, the error rate increases sharply atD​e​p​t​h=2Depth=2and saturates to near\-random levels byD​e​p​t​h=3Depth=3\. This indicates that standard Transformer architectures struggle to maintain logical coherence beyond immediate connections, regardless of model scale\.

Third, reasoning models remain sensitive to the length of the reasoning chain\. Although CoT\-enhanced models like DeepSeek\-R1 achieve significantly lower overall error rates compared to non\-reasoning baselines, they have not achieved complete generalization\. We observe that as the inference depth increases linearly, the error rate of these models grows exponentially\. This indicates that while Chain\-of\-Thought mechanisms effectively extend the range of solvable steps, they do not fundamentally eliminate the accumulation of errors in recursive reasoning\.

Furthermore, extensive ablations reveal that explicit rules and few\-shot examples offer negligible gains, pointing to an execution bottleneck\. While graph\-theoretic framing significantly outperforms abstract logic, neither context isolation nor multi\-path sampling \(Pass@5\) can prevent the fundamental reasoning collapse at the critical phase transition\.

The main contributions of this paper are as follows:

- •We construct a controlled experimental platform based on equivalence class problem \(ECP\), providing a benchmark for assessing the pure logical reasoning boundaries of LLMs\.
- •We quantitatively characterize the failure of non\-reasoning models in multi\-step deduction, revealing their specific inability to perform reliable reasoning beyond single\-hop tasks despite high accuracy on direct queries\.
- •We demonstrate that even in state\-of\-the\-art reasoning models, long\-chain capabilities remain subject to exponential error growth, refuting the optimistic assumption that “reasoning models have solved logical generalization”\.
- •We provide empirical evidence that static prompt engineering is insufficient to overcome structural reasoning barriers, highlighting the necessity of System 2\-like active computation\.

## 2Related Work

#### Evolution of LLMs and Reasoning Paradigms

The rapid development of Large Language Models \(LLMs\) is widely regarded as a critical path toward Artificial General Intelligence \(AGI\), where the emergence of intelligence and reasoning capabilities has remained a core focus of researchMcCoyet al\.\([2023](https://arxiv.org/html/2605.06882#bib.bib20)\); Nezhurinaet al\.\([2024](https://arxiv.org/html/2605.06882#bib.bib21)\); Huet al\.\([2024](https://arxiv.org/html/2605.06882#bib.bib12)\)\. Early research on Chain\-of\-Thought \(CoT\) found that guiding models to output intermediate reasoning steps before generating the final answerWeiet al\.\([2022](https://arxiv.org/html/2605.06882#bib.bib27)\); Zhouet al\.\([2022](https://arxiv.org/html/2605.06882#bib.bib30)\); Kojimaet al\.\([2022](https://arxiv.org/html/2605.06882#bib.bib13)\)significantly improves accuracy on complex tasks\. This discovery influenced the inception of “reasoning models”: through reinforcement learning \(RL\) fine\-tuning, models learn to explicitly generate implicit thinking processes within special<think\>tokensGuoet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib10)\)\. This paradigm shift has not only significantly boosted performance across various benchmarks but also endowed models with certain generalization capabilitiesMaet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib17)\); Xuet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib28)\)\. In the latest frontier advancements \(e\.g\., DeepSeek\-V3\.2Liuet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib16)\), Qwen3\-MaxQwen Team \([2025](https://arxiv.org/html/2605.06882#bib.bib23)\)\), this “thinking” capability is becoming increasingly modularized, allowing users to dynamically toggle it during the inference phase based on task requirements, marking a new trend toward the flexible allocation of reasoning computational resources\.

#### Nature and Limitations of Reasoning Capabilities

As reasoning models become ubiquitous, the academic community has begun to deeply dissect their internal mechanisms and capability boundariesChenet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib3)\); Liet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib14)\)\. On one hand, microscopic analysis of reasoning traces has revealed complex behavioral patterns, ranging from self\-reflection to “overthinking”Chenet al\.\([2024](https://arxiv.org/html/2605.06882#bib.bib2)\); Marjanovićet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib18)\); Suiet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib25)\)\. On the other hand, controversy remains regarding whether RL truly endows models with “novel” reasoning capabilities\. Some studies point out that when controlling for computational cost \(e\.g\., pass@k testing\), the performance difference between RL\-trained reasoning models and base models is often negligible in many tasks, suggesting that RL may be eliciting existing capabilities rather than creating new onesYueet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib29)\)\. Furthermore, research focusing on classic algorithmic puzzles like the Tower of Hanoi and Checkers finds that as problem complexity increases, both reasoning and non\-reasoning models suffer from performance collapseShojaeeet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib24)\)\. Unlike these puzzles, which focus on testing “algorithmic execution” skills, this study constructs abstract Equivalence Class Problems aimed at stripping away semantic interference to purely evaluate the reasoning robustness of models in long\-chain logical contexts\.

![Refer to caption](https://arxiv.org/html/2605.06882v1/images/max_err.png)Figure 1:Scaling Limit\.Maximum error rate vs\. problem size \(nn\)\. The left panel reports non\-reasoning models, whose peak error rises sharply as the variable set grows\. The right panel reports reasoning models, which substantially reduce the error scale but still expose nonzero failures under larger instances\. Each point is the maximum error observed over the swept probability range for that problem size\.
#### Controllable Evaluation Environments

To overcome the limitations of traditional static benchmarks \(such as data contamination and unquantifiable difficulty\), researchers are increasingly turning to “Controllable Evaluation Environments”Estermannet al\.\([2024](https://arxiv.org/html/2605.06882#bib.bib7)\); Valmeekamet al\.\([2022](https://arxiv.org/html/2605.06882#bib.bib26)\); Guiet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib9)\); Shojaeeet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib24)\)\. The core advantage of such environments lies in their ability to quantitatively map model performance boundaries by finely tuning problem complexity via parameterization, while maintaining consistent logical structures \(such as rule definitions\)\. Following this methodology, this study designs an experimental framework based on equivalence classes\. This framework allows for the systematic control of inference depth and probability density, providing a controlled testbed for comparative analysis of behavioral differences between reasoning and non\-reasoning models in long\-chain reasoning tasks\.

## 3Methodology

In this section, we formalize the evaluation of long\-chain reasoning capabilities as an Equivalence Class Partitioning problem rooted in graph theory\. This controlled environment allows us to precisely modulate the reasoning width and depth by adjusting the problem size and connectivity probability\.

### 3\.1Task Formalization

We model the equivalence class problem using a random graph framework\. Given the number of variablesnnand a connectivity probabilitypp, the task is defined over a graphG=\(V,E\)G=\(V,E\):

#### Variables\.

LetV=\{v1,v2,…,vn\}V=\\\{v\_\{1\},v\_\{2\},\\dots,v\_\{n\}\\\}be a set ofnndistinct abstract variables \(e\.g\.,\{a1,a2,…,an\}\\\{a\_\{1\},a\_\{2\},\\dots,a\_\{n\}\\\}\)\.

#### Equivalence Relations\.

We employ the Erdős\-Rényi modelG​\(n,p\)G\(n,p\)to generate the underlying logical structureGilbert \([1959](https://arxiv.org/html/2605.06882#bib.bib8)\); Erdős and Rényi \([1959](https://arxiv.org/html/2605.06882#bib.bib5)\)\. For every pair of distinct variables\(vi,vj\)\(v\_\{i\},v\_\{j\}\), a direct equivalence relation \(represented as an undirected edge\) is established independently with probabilitypp\. LetEEdenote the set of all generated direct relations\. In this setting,ppalso corresponds to the expected graph density, since the expected fraction of realized edges among all\(n2\)\\binom\{n\}\{2\}possible undirected edges ispp\.

### 3\.2Probing Mechanism: Pairwise Queries

While the theoretical objective is to compute the full partition𝒫\\mathcal\{P\}, evaluating the generative output of full sets can be prone to parsing errors and format hallucinations\. To rigorously quantify the model’s reasoning accuracy, we simplify the output format into a binary classification task viaPairwise Queries\.

For a generated graphG=\(V,E\)G=\(V,E\), we construct a query setQ=\{\(ui,vi\)\}i=1kQ=\\\{\(u\_\{i\},v\_\{i\}\)\\\}\_\{i=1\}^\{k\}\. For each pair\(u,v\)\(u,v\), the model is presented with the edge listEEand asked:

“Based on the relations above, isuuequivalent tovv?”

The ground truthy​\(u,v\)y\(u,v\)is determined by the graph connectivity:

y​\(u,v\)=𝕀​\[u,v∈SameComponent​\(G\)\]y\(u,v\)=\\mathbb\{I\}\[u,v\\in\\text\{SameComponent\}\(G\)\]\(1\)where𝕀​\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function\. The model is required to output a binary response \(Yes/No\)\. This setup allows us to explicitly test whether the model has successfully captured the transitive closure of the equivalence relations\.

![Refer to caption](https://arxiv.org/html/2605.06882v1/images/89_p_err.png)\(a\)Error Rate vs\. Probability \(n=89n=89\)
![Refer to caption](https://arxiv.org/html/2605.06882v1/images/144_p_err.png)\(b\)Error Rate vs\. Probability \(n=144n=144\)

Figure 2:Error Rate vs\. Probability Analysis \(n=89n=89andn=144n=144\)\.The x\-axis is the edge probabilityppused to sample equivalence relations; the y\-axis is query error rate\. The two subfigures compare DeepSeek\-V3\.2 and DeepSeek\-R1\-250528 atn=89n=89andn=144n=144\. Vertical reference markers indicate the sparse\-chain regime around1/n1/nand the connectivity threshold aroundln⁡n/\(n−1\)\\ln n/\(n\-1\), where the model failure modes concentrate\.
### 3\.3Evaluation Protocol

To systematically map the capability boundaries of LLMs, we implement a dense sweeping strategy across the probability space:

1. 1\.Probability Sweep:We traverse the connectivity probabilityppfrom0to11\. The sampling density is increased near the critical phase transition points \(1/n1/nandln⁡n/\(n−1\)\\ln n/\(n\-1\)\) to capture fine\-grained behavioral changesErdős and Rényi \([1960](https://arxiv.org/html/2605.06882#bib.bib6)\)\.
2. 2\.Instance Sampling:For each probability pointpp, we randomly generateNg​r​a​p​h​sN\_\{graphs\}distinct graph instances to ensure statistical significance\.
3. 3\.Query Sampling:For each graph instance, we samplek=10k=10random pairs of variables\(u,v\)\(u,v\)as test queries\.
4. 4\.Metric:The performance is evaluated using theError Rate, defined as the fraction of queries where the model’s prediction differs from the ground truthy​\(u,v\)y\(u,v\)\.

This protocol ensures that our evaluation covers the full spectrum of complexity, from sparse, disconnected graphs to dense, fully connected networks\.

## 4Experiments & Results

### 4\.1Experimental Setup

We conduct experiments using a diverse set of Large Language Models to represent distinct paradigms\. For thenon\-reasoning paradigm, we evaluateDeepSeek\-V3\(including snapshots V3\-250324 and V3\.2\)Liuet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib16),[2024](https://arxiv.org/html/2605.06882#bib.bib15)\)alongsideGemini\-3\-FlashDeepmind \([2025](https://arxiv.org/html/2605.06882#bib.bib4)\)as a strong baseline\. For thereasoning paradigm, we employDeepSeek\-R1,DeepSeek\-R1\-250528Guoet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib10)\),DeepSeek\-V3\.2\-thinkingLiuet al\.\([2025](https://arxiv.org/html/2605.06882#bib.bib16)\)andGPT\-5\-miniopenai \([2025](https://arxiv.org/html/2605.06882#bib.bib22)\)\. To accommodate the extensive edge list descriptions in dense graphs, the context window is set to60,000 tokensfor all tests\. Following the protocol in Section[3\.3](https://arxiv.org/html/2605.06882#S3.SS3), we primarily report theAverage Error Rate, calculated by aggregating results from 10 randomly sampled pairwise queries across multiple independent graph instances for each configuration\. Unless otherwise stated, we employPrompt 1with temperatureT=0\.01T=0\.01\(pass@1\)\. The detailed specifications for Prompt 1, 2, and 3 are provided in the ablation analysis \(see[4\.4\.1](https://arxiv.org/html/2605.06882#S4.SS4.SSS0.Px1)\)\.

### 4\.2Main Results: Scaling Limits and Reasoning Collapse

We analyze the fundamental reasoning capabilities of the models across varying problem scales \(nn\) and inference depths\. The experimental results, encompassing consecutive Fibonacci scales \(n=89,144n=89,144\), reveal consistent failure patterns that persist regardless of problem size\.

![Refer to caption](https://arxiv.org/html/2605.06882v1/images/89_depth_err.png)\(a\)Error Rate vs\. Inference Depth \(n=89n=89\)
![Refer to caption](https://arxiv.org/html/2605.06882v1/images/144_depth_err.png)\(b\)Error Rate vs\. Inference Depth \(n=144n=144\)

Figure 3:Inference Depth Analysis \(n=89n=89andn=144n=144\)\.The x\-axis is the shortest\-path inference depth between the queried variables, and the y\-axis is the corresponding average error rate\. DeepSeek\-V3\.2 collapses immediately beyond single\-hop inference \(Depth≥\\geq2\), whereas DeepSeek\-R1\-250528 keeps errors low but still shows depth\-dependent accumulation\.#### Scaling with Problem Size\.

As shown in Figure[1](https://arxiv.org/html/2605.06882#S2.F1), theMaximum Error Rate\(i\.e\., the peak error observed across the full probability spectrumpp\) exhibits a clear upward trend as the problem size increases fromn=2n=2ton=144n=144, but with vastly different magnitudes across paradigms\.

- •Non\-Reasoning Collapse:ForDeepSeek\-V3\(Figure[1](https://arxiv.org/html/2605.06882#S2.F1)Left\), the performance degradation is catastrophic\. The error rate rises sharply, exceeding90%atn=144n=144for both V3 and V3\.2 variants\. This indicates a near\-total breakdown in logical consistency as the search space expands\. Interestingly,Gemini\-3\-Flashexhibits significantly better robustness, capping at approximately22%error atn=144n=144, suggesting that while architecture matters, specific training recipes can mitigate some \(but not all\) logical deficits in non\-reasoning models\.
- •Reasoning Robustness:Reasoning models \(Figure[1](https://arxiv.org/html/2605.06882#S2.F1)Right\) universally suppress error scales, though with evolving stability profiles\. The latest architectures,DeepSeek\-V3\.2\-thinkingandGPT\-5\-mini, demonstrate near\-perfect scalability, both maintaining an exceptional error rate of∼\\sim1\.0%even atn=144n=144\. In comparison, the pioneerDeepSeek\-R1\-250528, while still highly robust, exhibits a slight linear rise in errors aftern=34n=34to reach∼\\sim3\.4%\. This progression indicates that recent optimizations in “thinking” models have effectively mitigated complexity leakage, further compressing the error margin beyond the initial capabilities of R1\.

#### Phase Transitions and Probability \(n=89n=89vs\.n=144n=144\)\.

Figure[2](https://arxiv.org/html/2605.06882#S3.F2)compares the error distributions across the probability spectrumpp\. A universal phenomenon is observed across both problem sizes:

- •Error rates are not uniformly distributed but peak precisely near the theoretical phase transition region, specifically aligning with the emergence of the giant component \(between1/n1/nandln⁡n/\(n−1\)\\ln n/\(n\-1\)\)\.
- •ForDeepSeek\-V3\.2, the error distribution is broad and catastrophic, covering the entire “super\-critical” phase\. As shown in Figure[2](https://arxiv.org/html/2605.06882#S3.F2)\(Left\), the error rate rapidly saturates to\>80%\>80\\%immediately after the critical threshold \(log⁡\(n\)/\(n−1\)\\log\(n\)/\(n\-1\)\), indicating a total failure to navigate complex logical topologies\.
- •ForDeepSeek\-R1\-250528, the error landscape is fundamentally different\. While distinct error spikes still appear near the critical transition point \(where logical complexity is maximized\), the magnitude is suppressed by orders of magnitude \(peaking at<3\.5%<3\.5\\%compared to V3\.2’s\>85%\>85\\%\)\. This confirms that R1 effectively handles the vast majority of transitive deductions, faltering only at the chaotic edge of connectivity\.

![Refer to caption](https://arxiv.org/html/2605.06882v1/images/longcontext.png)Figure 4:Contextual Robustness of DeepSeek\-R1 \(n=144n=144andn=200n=200\)\.Comparison of error rates under different batching strategies:100×1100\\times 1\(1 query per graph\),20×520\\times 5, and10×1010\\times 10\.Left \(n=144n=144\):The10×1010\\times 10setting \(Orange\) shows higher error variance compared to the stable100×1100\\times 1setting \(Blue\)\.Right \(n=200n=200\):As complexity increases, the gap widens\. Notably, even the cleanest100×1100\\times 1setting fails to solve the problem near the phase transition, indicating fundamental reasoning limitations beyond mere context window constraints\.
#### The Inference Horizon Barrier\.

The relationship between average error rate and inference depth \(Figure[3](https://arxiv.org/html/2605.06882#S4.F3)\) delineates a sharp boundary between the two model paradigms, particularly at the scale ofn=144n=144\.

- •DeepSeek\-V3\.2 \(Non\-Reasoning\):The updated results reveal a striking“One\-Hop Limit”\. As illustrated in Figure[3](https://arxiv.org/html/2605.06882#S4.F3)\(Middle Left\), V3\.2 achieves a surprisingly low error rate of∼\\sim4\.4% at Depth=1, suggesting robust performance on direct premise retrieval\. However, logical coherence collapses immediately upon extending the chain: the error rate skyrockets to∼\\sim58% at Depth=2 and hits a saturation plateau of∼\\sim85% by Depth=3\. This steep “Reasoning Wall” confirms that despite model updates, the non\-reasoning paradigm remains fundamentally incapable of reliable multi\-step deduction, treatingA→B→CA\\to B\\to Cas disjointed facts rather than a connected chain\.
- •DeepSeek\-R1\-250528 \(Reasoning\):In contrast, R1 demonstrates remarkable robustness\. The error rate remains negligible \(∼0\.01%\\sim 0\.01\\%\) for shallow depths and maintains a strictly controlled profile even at deeper levels\. However, theexponential trendpersists: errors grow geometrically from∼0\.04%\\sim 0\.04\\%at Depth 3 to∼0\.78%\\sim 0\.78\\%at Depth 7\. While the absolute error remains low \(<1%<1\\%\), the trajectory indicates that even sophisticated reasoning models are subject to multiplicative error accumulation as the reasoning chain lengthens\.

### 4\.3Contextual Robustness in Reasoning Models

While the reasoning model DeepSeek\-R1 demonstrates superior performance compared to non\-reasoning baselines, a critical question remains: is its reasoning capability robust to context saturation? To investigate this, we designed a “Contextual Interference” experiment\. We kept the total number of evaluation queries constant \(100 queries per probability point\) but varied the distribution of queries per graph instance\. Specifically, we compared three batching strategies:

- •Low\-Interference \(100×1100\\times 1\):100 graphs, each followed by a single query\. This represents the purest reasoning environment with minimal context history\.
- •Medium\-Interference \(20×520\\times 5\):20 graphs, each followed by 5 consecutive queries\.
- •High\-Interference \(10×1010\\times 10\):10 graphs, each followed by 10 consecutive queries\.

As illustrated in Figure[4](https://arxiv.org/html/2605.06882#S4.F4), we observe two distinct phenomena across bothn=144n=144andn=200n=200:

#### 1\. The Attention Dilution Effect\.

There is a clear performance hierarchy:100×1100\\times 1\(Blue\) outperforms20×520\\times 5\(Green\), which in turn outperforms10×1010\\times 10\(Orange\)\. The error rate increases significantly when the model is required to resolve multiple queries within a single context window\. This suggests that despite the Chain\-of\-Thought mechanism, reasoning models suffer fromattention dilutionor state\-tracking degradation when maintaining the logical state over an extended interaction turn\.

#### 2\. Intrinsic Logical Limits\.

Crucially, even in the most favorable setting \(100×1100\\times 1\), the error ratedoes not converge to zero\. As shown in then=200n=200subplot \(Right\), the100×1100\\times 1configuration still exhibits substantial error spikes \(reaching\>\>10%\) near the phase transition region\. This confirms that the observed failures are not solely artifacts of context length limitations or distraction, but stem from intrinsic deficits in the model’s ability to generalize equivalence class rules at scale\.

### 4\.4Ablation Studies on Non\-Reasoning Models

Given the structural limitations observed in DeepSeek\-V3\.2 \(Section[4\.2](https://arxiv.org/html/2605.06882#S4.SS2)\), we performed extensive ablation studies to determine whether external prompting strategies or decoding adjustments could mitigate these deficits\. All experiments were conducted at the critical scale ofn=144n=144\.

#### 1\. Impact of Rule Specification in Prompts\.

We evaluated whether the “Reasoning Wall” could be breached by providing explicit logical scaffolding\.

- •Prompt 1 \(Baseline\):Full instructions including explicit transitivity/reflexivity definitions and role\-playing \(Mathematician\)\.
- •Prompt 2:Removed explicit rule definitions \(relying on the model’s internal concept of “equivalence”\)\.
- •Prompt 3:Removed both rule definitions and role\-playing\.

![Refer to caption](https://arxiv.org/html/2605.06882v1/images/prompt123.png)\(a\)Rule Specification \(P1 vs P2 vs P3\)
![Refer to caption](https://arxiv.org/html/2605.06882v1/images/prompt14.png)\(b\)Domain Framing \(Logic P1 vs Graph P4\)
![Refer to caption](https://arxiv.org/html/2605.06882v1/images/casePrompt.png)\(c\)In\-Context Learning \(Zero\-shot P1 vs Few\-shot P5\)
![Refer to caption](https://arxiv.org/html/2605.06882v1/images/passk.png)\(d\)Decoding Strategies \(Pass@K & Temperature\)

Figure 5:Ablation Studies \(n=144n=144\)\.\(a\)Explicitly stating logical rules \(P1\) yields performance nearly identical to implicit prompts \(P3\)\.\(b\)Framing the problem as a Graph Theory task \(P4\) surprisingly outperforms the abstract Logic framing \(P1\)\.\(c\)Few\-shot examples \(P5\) demonstrate comparable performance to Zero\-shot \(P1\), indicating that ICL offers limited benefits for global logical reasoning\.\(d\)Higher temperature \(T=0\.8T=0\.8\) with Pass@5 \(Red\) improves robustness compared to greedy decoding \(Blue\), but cannot eliminate errors at the phase transition\.As illustrated in Figure[5](https://arxiv.org/html/2605.06882#S4.F5)a, contrary to standard expectations, we observeno significant performance divergenceamong the three settings \(P​1≈P​2≈P​3P1\\approx P2\\approx P3\)\. The error curves remain tightly coupled across the entire probability spectrum, responding identically to the phase transition\. This yields a critical insight: DeepSeek\-V3\.2’s failure in long\-chain reasoning is not informational but structural\. The model clearly “knows” the rules of equivalence \(as evidenced by its high accuracy at Depth=1\), but extra definitions in the context window offer no aid when the fundamental bottleneck is the inability to execute recursive state\-tracking across multiple hops\. This confirms that static prompt engineering is insufficient to overcome the architectural “One\-Hop Limit”\.

#### 2\. Domain Framing: Logic vs\. Graph Theory \(Prompt 1 vs\. 4\)\.

We hypothesized that the model’s training corpus might be richer in algorithmic graph problems than in abstract set theory\. We rephrased the task from “equivalence classes” \(Prompt 1\) to “connected components in a graph” \(Prompt 4\)\. Surprisingly, theGraph Theory framing significantly outperformed the Logical framing\(Figure[5](https://arxiv.org/html/2605.06882#S4.F5)b\)\. The error rate for Prompt 4 \(Brown\) is consistently lower than Prompt 1 \(Blue\), suggesting that activating the “graph algorithm” subspace of the model is more effective than the “logical deduction” subspace\.

#### 3\. The Limited Efficacy of In\-Context Learning \(Prompt 1 vs\. 5\)\.

We investigated whether providing concrete examples could guide the model’s reasoning process\. We compared the zero\-shot baseline \(Prompt 1\) with a few\-shot setup \(Prompt 5\) that included explicit transitive demonstrations \(e\.g\., “a1=a2, a2=a3 implies a1=a3”\)\. As illustrated in Figure[5](https://arxiv.org/html/2605.06882#S4.F5)c, the few\-shot intervention yielded results that werestatistically comparable to, or only marginally better than, the zero\-shot baseline\. The error curves for Prompt 5 \(Pink\) and Prompt 1 \(Blue\) are closely entangled across the entire probability spectrum\. While providing examples prevents some extreme error spikes found in the zero\-shot setting, it fails to fundamentally alter the “reasoning collapse” behavior\. This suggests that while ICL \(In\-Context Learning\) helps clarify task formatting or local rules, it is insufficient to induce the deep, recursive algorithms required for global equivalence partitioning\.

#### 4\. Decoding Strategies \(Pass@K\)\.

Finally, we explored whether the correct reasoning path exists in the model’s latent space but is suppressed by greedy decoding\. We compared Pass@1 and Pass@5 at temperaturesT=0\.01T=0\.01andT=0\.8T=0\.8\. Results in Figure[5](https://arxiv.org/html/2605.06882#S4.F5)d show that increasing diversity \(T=0\.8T=0\.8\) combined with multiple samples \(Pass@5\) yields the best performance \(Red triangle curve\)\. However, even with Pass@5, the error rate remains high near the phase transition, indicating that sampling alone cannot fully compensate for the reasoning deficit\.

## 5Conclusion

This work establishes the Equivalence Class Problem as a controlled diagnostic probe for long\-chain reasoning, revealing sharp capability boundaries in current LLMs\. Three findings emerge\. First, non\-reasoning models collapse immediately beyond single\-hop deductions, confirming that standard Transformers cannot reliably execute recursive state\-tracking\. Second, reasoning models exhibit a depth\-dependent decay; error rates grow geometrically with chain length, indicating that CoT mechanisms expand the effective reasoning window but do not achieve true logical generalization\. Third, prompt ablation suggests that these deficits stem from intrinsic architectural limitations rather than insufficient instruction\.

These findings carry practical implications: deploying LLMs in domains requiring guaranteed multi\-step reasoning \(e\.g\., formal verification, legal inference\) demands caution, as even state\-of\-the\-art reasoning models exhibit systematic failure modes at scale\.

Our study has limitations\. ECP is deliberately homogeneous: it isolates transitive equivalence reasoning and does not cover heterogeneous real\-world tasks that mix language understanding, external knowledge, planning, and tool use\. Our analysis also remains behavioral rather than mechanistic\. Future work should probe attention dynamics during reasoning collapse and test whether targeted fine\-tuning can extend the effective reasoning depth\. Ultimately, despite recent advancements, the Equivalence Class Problem remains an open challenge for Large Language Models\.

## References

- Anthropic \[2025\]Anthropic\.Claude Sonnet 4\.5, 2025\.September\.[https://www\.anthropic\.com/news/claude\-sonnet\-4\-5](https://www.anthropic.com/news/claude-sonnet-4-5)\.
- Chenet al\.\[2024\]Xingyu Chen, Jiahao Xu, Tian Liang, et al\.Do not think that much for 2\+3=? on the overthinking of o1\-like LLMs\.arXiv preprint arXiv:2412\.21187, 2024\.
- Chenet al\.\[2025\]Yanda Chen, Joe Benton, Ansh Radhakrishnan, et al\.Reasoning models don’t always say what they think\.arXiv preprint arXiv:2505\.05410, 2025\.
- Deepmind \[2025\]Deepmind\.Gemini3flash, 2025\.September\.[https://deepmind\.google/models/gemini/](https://deepmind.google/models/gemini/)\.
- Erdős and Rényi \[1959\]P Erdős and A Rényi\.On random graphs I\.Publ\. Math\. Debrecen, 6:290–297, 1959\.
- Erdős and Rényi \[1960\]Paul Erdős and Alfréd Rényi\.On the evolution of random graphs\.Publ\. Math\. Inst\. Hungar\. Acad\. Sci, 5:17–61, 1960\.
- Estermannet al\.\[2024\]Benjamin Estermann, Luca A Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer\.Puzzles: A benchmark for neural algorithmic reasoning\.Advances in Neural Information Processing Systems, 37:127059–127098, 2024\.
- Gilbert \[1959\]Edgar N Gilbert\.Random graphs\.The Annals of Mathematical Statistics, 30\(4\):1141–1144, 1959\.
- Guiet al\.\[2025\]Jiayi Gui, Yiming Liu, Jiale Cheng, et al\.LogicGame: Benchmarking rule\-based reasoning abilities of large language models\.InFindings of the Association for Computational Linguistics: ACL 2025, pages 1474–1491, 2025\.
- Guoet al\.\[2025\]Daya Guo, Dejian Yang, Haowei Zhang, et al\.DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.Nature, 645\(8081\):633–638, 2025\.
- Hendryckset al\.\[2021\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874, 2021\.
- Huet al\.\[2024\]Jiewen Hu, Thomas Zhu, and Sean Welleck\.miniCTX: Neural theorem proving with \(long\-\) contexts\.arXiv preprint arXiv:2408\.03350, 2024\.
- Kojimaet al\.\[2022\]Takeshi Kojima, Shixiang Shane Gu, Machel Reid, et al\.Large language models are zero\-shot reasoners\.Advances in Neural Information Processing Systems, 35:22199–22213, 2022\.
- Liet al\.\[2025\]Dacheng Li, Shiyi Cao, Tyler Griggs, et al\.LLMs can easily learn to reason from demonstrations: Structure, not content, is what matters\!arXiv preprint arXiv:2502\.07374, 2025\.
- Liuet al\.\[2024\]Aixin Liu, Bei Feng, Bing Xue, et al\.DeepSeek\-V3 technical report\.arXiv preprint arXiv:2412\.19437, 2024\.[https://arxiv\.org/abs/2412\.19437](https://arxiv.org/abs/2412.19437)\.
- Liuet al\.\[2025\]Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, et al\.DeepSeek\-V3\.2: Pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556, 2025\.[https://arxiv\.org/abs/2512\.02556](https://arxiv.org/abs/2512.02556)\.
- Maet al\.\[2025\]Xueguang Ma, Qian Liu, Dongfu Jiang, et al\.General\-reasoner: Advancing LLM reasoning across all domains\.arXiv preprint arXiv:2505\.14652, 2025\.
- Marjanovićet al\.\[2025\]Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, et al\.DeepSeek\-R1 thoughtology: Let’s think about LLM reasoning\.arXiv preprint arXiv:2504\.07128, 2025\.
- Mathematical Association of America \[2026\]Mathematical Association of America\.American Invitational Mathematics Examination \(AIME\), 2026\.[https://maa\.org/maa\-invitational\-competitions/](https://maa.org/maa-invitational-competitions/)\.
- McCoyet al\.\[2023\]R Thomas McCoy, Shunyu Yao, Dan Friedman, et al\.Embers of autoregression: Understanding large language models through the problem they are trained to solve\.arXiv preprint arXiv:2309\.13638, 2023\.
- Nezhurinaet al\.\[2024\]Marianna Nezhurina, Lucia Cipolina\-Kun, Mehdi Cherti, and Jenia Jitsev\.Alice in wonderland: Simple tasks showing complete reasoning breakdown in state\-of\-the\-art large language models\.arXiv preprint arXiv:2406\.02061, 2024\.
- openai \[2025\]openai\.GPT5mini, 2025\.September\.[https://openai\.com/zh\-Hans\-CN/index/introducing\-gpt\-5/](https://openai.com/zh-Hans-CN/index/introducing-gpt-5/)\.
- Qwen Team \[2025\]Qwen Team\.Qwen3\-Max: Just scale it, 2025\.September\.[https://qwenlm\.github\.io/](https://qwenlm.github.io/)\.
- Shojaeeet al\.\[2025\]Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, et al\.The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity\.arXiv preprint arXiv:2506\.06941, 2025\.
- Suiet al\.\[2025\]Yang Sui, Yu\-Neng Chuang, Guanchu Wang, et al\.Stop overthinking: A survey on efficient reasoning for large language models\.arXiv preprint arXiv:2503\.16419, 2025\.
- Valmeekamet al\.\[2022\]Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati\.Large language models still can’t plan \(a benchmark for LLMs on planning and reasoning about change\)\.InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022\.
- Weiet al\.\[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, et al\.Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems, 35:24824–24837, 2022\.
- Xuet al\.\[2025\]Fengli Xu, Qianyue Hao, Zefang Zong, et al\.Towards large reasoning models: A survey of reinforced reasoning with large language models\.arXiv preprint arXiv:2501\.09686, 2025\.
- Yueet al\.\[2025\]Yang Yue, Zhiqi Chen, Rui Lu, et al\.Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504\.13837, 2025\.
- Zhouet al\.\[2022\]Hattie Zhou, Azade Nova, Hugo Larochelle, et al\.Teaching algorithmic reasoning via in\-context learning\.arXiv preprint arXiv:2211\.09066, 2022\.

Similar Articles

Learning to reason with LLMs

OpenAI Blog

OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.

When Can LLMs Learn to Reason with Weak Supervision?

Hugging Face Daily Papers

This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Hugging Face Daily Papers

This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.