From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

arXiv cs.CL 05/11/26, 04:00 AM Papers
Summary
This paper introduces LogiHard, a framework that uses combinatorial hardening to expose compositional failures in frontier LLMs, demonstrating significant accuracy drops in logical reasoning tasks.
arXiv:2605.07268v1 Announce Type: new Abstract: Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 06:57 AM
# From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs
Source: [https://arxiv.org/html/2605.07268](https://arxiv.org/html/2605.07268)
Hanmeng Liu1, Shichao Weng2, Xiulai Li1, Zhicai Zhang1, Anli Yan1, Xiaozhang Liu1 1Hainan University, Haikou, China2Fudan University, Shanghai, China \{liuhanmeng,lixiulai01,zzcai,yananli,lxzh\}@hainanu\.edu\.cn,scweng23@m\.fudan\.edu\.cn

###### Abstract

Multiple\-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations\. Ad\-hoc hardening methods \(paraphrasing, perturbation\) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models\. We present LogiHard, a formal framework that deterministically transforms 0\-order selection into 2\-order logical judgment, which significantly increases the thinking overhead and reasoning steps\. The framework integrates Item Response Theory \(IRT\) for computerized adaptive testing \(CAT\), enabling precise difficulty control with fewer questions than static benchmarks\. We instantiate LogiHard\-2k, a logical reasoning dataset constructed by cognitively ranking high\-stakes examination questions via 9\-dimensional analysis of model thinking traces, followed by combinatorial transformation of high\-difficulty items\. Evaluation across twelve state\-of\-the\-art models reveals an accuracy degradation ranging from31%31\\%to56%56\\%on combinatorially hardened questions\. LLMs suffer from the multi\-select failure and early exit bias, which are not shared by human testees\. Zero\-shot transfer to MMLU demonstrates47%47\\%accuracy degradation \(89\.84%→42\.86%89\.84\\%\\to 42\.86\\%\), confirming applicability across domains with provable validity preservation\. The consistent aggregate degeneration is domain\-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training\-induced completeness\-verification deficit\.

## 1Introduction

Multiple\-choice questions \(MCQs\) remain the dominant paradigm for evaluating large language models \(LLMs\)\([mmlu2021,](https://arxiv.org/html/2605.07268#bib.bib11);[bbh2022,](https://arxiv.org/html/2605.07268#bib.bib34);[phan2025lastexam,](https://arxiv.org/html/2605.07268#bib.bib2)\), with logical reasoning benchmarks garnering particular attention for their ability to isolate pure reasoning from domain\-specific knowledge\. The recent emergence of Large Reasoning Models \(LRMs\)\([deepseekai2025deepseekr1incentivizingreasoningcapability,](https://arxiv.org/html/2605.07268#bib.bib7);[qwq32b,](https://arxiv.org/html/2605.07268#bib.bib29)\)has accelerated this trend by leveraging test\-time scaling and extended chain\-of\-thought \(CoT\) with reflection\([chen2025reasoningerasurveylong,](https://arxiv.org/html/2605.07268#bib.bib3)\)to achieve unprecedented performance on complex reasoning tasks\. Supervised fine\-tuning on logical reasoning datasets like LogiQA\([10174688,](https://arxiv.org/html/2605.07268#bib.bib18);[liu2023logicot,](https://arxiv.org/html/2605.07268#bib.bib19)\)has become common practice for instilling foundational reasoning capabilities\([muennighoff2025s1simpletesttimescaling,](https://arxiv.org/html/2605.07268#bib.bib25);[nvidia2025nvidianemotronnano2,](https://arxiv.org/html/2605.07268#bib.bib26)\)\.

Yet the development of LRMs has led to the rapid saturation of reasoning benchmarks\. MMLU falls to GPT\-5 at 92\.5%\([singh2026openaigpt5card,](https://arxiv.org/html/2605.07268#bib.bib32)\), Sonnet 3\.5 exceeds 93\.1% on BBH\([bbh2022,](https://arxiv.org/html/2605.07268#bib.bib34)\), and OpenAI o1 achieves 90\.0% average accuracy on LogiQA\([latif2025comparative,](https://arxiv.org/html/2605.07268#bib.bib15)\)\. These numbers signal not the resolution of machine reasoning, but the failure of static evaluation\([malek2025frontierllmsstrugglesimple,](https://arxiv.org/html/2605.07268#bib.bib22)\)\. Contemporary models achieve superhuman accuracy partly through training set memorization and exploitation of surface patterns \(position bias, lexical overlap, stylistic cues\)[xie\-etal\-2025\-memorization](https://arxiv.org/html/2605.07268#bib.bib40)\. In response, ad\-hoc hardening methods have proliferated: None\-of\-the\-Above \(NOTA\) distractors\([tam2025none,](https://arxiv.org/html/2605.07268#bib.bib35);[madhusudhan2025llms,](https://arxiv.org/html/2605.07268#bib.bib21)\), adversarial perturbations\([wallace2021universaladversarialtriggersattacking,](https://arxiv.org/html/2605.07268#bib.bib37);[moffett\-dhingra\-2025\-close,](https://arxiv.org/html/2605.07268#bib.bib24)\), and template\-based obfuscation\([gsm\-symbolic,](https://arxiv.org/html/2605.07268#bib.bib23)\)\. These approaches, however, suffer from a fundamental*validity crisis*where human verification is required for a rigorous procedure\([park2025vlmcont,](https://arxiv.org/html/2605.07268#bib.bib27)\)\. Stochastic perturbations introduce semantic drifting and unintended artifacts in generated questions\([sun2025emperor,](https://arxiv.org/html/2605.07268#bib.bib33);[chen\-etal\-2025\-benchmarking\-large,](https://arxiv.org/html/2605.07268#bib.bib4)\)rather than eliminating contamination; while NOTA variants increase surface verification burden by requiring models to reject all distractors, they remain*0\-order selection tasks*that do not elevate logical order—the core reasoning step is still to identify a single correct option among candidates, not to evaluate compound propositional constraints\. Consequently, they fail to challenge state\-of\-the\-art reasoning models that already excel at extended chain\-of\-thought verification\.

Figure 1:An example from the LogiHard\-2k benchmark![Refer to caption](https://arxiv.org/html/2605.07268v1/x1.png)To address this, we propose LogiHard, a formal framework for dynamic hardness evaluation that deterministically transforms multiple\-choice questions via propositional logical combinatorics\. Our core premise is that if a model truly understands logical entailment, it should handle the same knowledge when expressed as combinatorial constraints over atomic propositions\. As shown in Figure[1](https://arxiv.org/html/2605.07268#S1.F1), rather than corrupting surface text, LogiHard elevates the reasoning order by mapping atomic options to propositional variables and synthesizing compound formulas—including exactness, disjunction, and negation—that the model must evaluate under the ground\-truth assignment\. This transforms the task from 0\-order selection to 2\-order logical judgment, ensuring*validity\-by\-construction*while rendering memorization ineffective\. LogiHard comprises three synergistic components: 1\) A*deterministic combinatorial protocol*that synthesizes logically valid, difficulty\-calibrated questions via controlled propositional complexity \(e\.g\., Easy:∧\\land, Medium:∨\\lor, Hard:¬\\neg, Expert: compound¬\\neg\); 2\) A*cognitive difficulty scoring*module that extracts 9\-dimensional metrics \(oscillation points, logic density, abductive depth\) from model thinking traces to empirically rank items before hardening; 3\) A*dynamic evaluation protocol*based on Item Response Theory \(IRT\)\([lord1980applications,](https://arxiv.org/html/2605.07268#bib.bib20)\)and Computerized Adaptive Testing \(CAT\), enabling efficient ability estimationθ^\\hat\{\\theta\}with fewer questions than static benchmarks\.

To rigorously validate this framework, we instantiateLogiHard\-2k, a native logical reasoning dataset curated from high\-stakes human examinations, isolating pure logical deduction via syllogistic, analogical, and propositional reasoning\. Evaluation across twelve state\-of\-the\-art models reveals a fundamental combinatorial multi\-select reasoning failure—all systems exhibit severe degradation despite strong conventional benchmark performance—confirming that current models lack robust propositional constraint satisfaction\.

Our contributions are threefold:

1. 1\.Validity\-guaranteed hardness synthesis\.A deterministic combinatorial protocol that transforms MCQs into propositional logic tasks via validity\-by\-construction, elevating reasoning from 0\-order selection to 2\-order judgment while resisting memorization—the correct answer determines the truth assignment, yet the atomic option never appears directly in the output, rendering memorization ineffective while eliminating the semantic drift of adversarial methods\.
2. 2\.Dynamic difficulty control\.We establish adaptive evaluation via two core mechanisms: \(i\)*on\-set*difficulty control through the combinatorial protocol’s tiered operator configurations, enabling deterministic calibration of item complexity at generation time; and \(ii\) IRT\-based computerized adaptive testing \(CAT\), where item parameters are informed by automated 9\-dimensional cognitive scoring, enabling precise ability targeting and efficient testing\.
3. 3\.Identifying reasoning failure and cross\-domain transfer\.We instantiate LogiHard\-2k and evaluatetwelvestate\-of\-the\-art models, revealing a multi\-select failure and combinatorial reasoning gap \(accuracy degradation3131–56%56\\%\) with zero logical invalidity\. Zero\-shot transfer to MMLU confirms cross\-domain generalization \(89\.84%→42\.86%89\.84\\%\\to 42\.86\\%\)\.

## 2Related Work

Our work builds upon and extends three critical research strands: systematic methods for hardening multiple\-choice benchmarks, combating data contamination and benchmark saturation, and adaptive evaluation paradigms\. While existing approaches address isolated aspects of these challenges, LogiHard provides a unified framework that combinesvalidity\-guaranteed hardness synthesiswithdynamic difficulty adaptation\.

### 2\.1Benchmark Hardening and Surface Perturbation

Identifying and controlling “hardness” is fundamental to creating high\-difficulty benchmarks\. Approaches such as Humanity’s Last Exam \(HLE\)\([phan2025lastexam,](https://arxiv.org/html/2605.07268#bib.bib2)\)and Arena\-hard\([li2024crowdsourced,](https://arxiv.org/html/2605.07268#bib.bib16)\)employ expert curation or LLM\-as\-judge scoring, but remain labor\-intensive or dependent on the judge model’s own capabilities\.

To increase difficulty at scale, various perturbation methods have emerged\. Simple surface modifications—shuffling, synonym replacement, distractor insertion\([gupta2024changing,](https://arxiv.org/html/2605.07268#bib.bib10);[pezeshkpour2024large,](https://arxiv.org/html/2605.07268#bib.bib28);[kostic2026same,](https://arxiv.org/html/2605.07268#bib.bib14);[YIGIT2025100186,](https://arxiv.org/html/2605.07268#bib.bib42)\)—increase apparent difficulty while leaving underlying reasoning invariant\. The “None of the Others” \(NOTO\) technique\([salido2025none,](https://arxiv.org/html/2605.07268#bib.bib30)\)complicates selection by requiring verification of all distractors, yet remains afirst\-ordertransformation that does not elevate logical order\. More systematic approaches include human re\-authoring \(MMLU\-CF\([zhao2024mmlucfcontaminationfreemultitasklanguage,](https://arxiv.org/html/2605.07268#bib.bib43)\)\), template\-based symbolic perturbation \(GSM\-Symbolic\([gsm\-symbolic,](https://arxiv.org/html/2605.07268#bib.bib23)\)\), and multi\-hop replacements \(BBEH\([kazemi2025big,](https://arxiv.org/html/2605.07268#bib.bib13)\)\)\. While effective, these methods offer no mechanism for dynamic difficulty scaling and remain vulnerable to surface\-pattern exploitation\.

Unlike these methods, LogiHard treats hardness synthesis as adeterministic combinatorial processrather than stochastic perturbation\. By transforming atomic options into propositional compounds via validity\-by\-construction protocols, we achieve contamination resistancealgorithmicallywhile elevating reasoning from0\-order selection to22\-order logical judgment—a fundamental shift that surface perturbations cannot achieve\. This distinction is particularly salient for LRMs: while extended chain\-of\-thought and reflection yield strong conventional benchmark performance, our experiments demonstrate that such capabilities do not automatically confer robustness to propositional combinatorics\.

### 2\.2Dynamic and Adaptive Evaluation

Beyond static hardening, dynamic evaluation strategies create non\-stationary test environments to combat data contamination and enable precise ability measurement\.

#### Temporal Firewall Methods\.

A dominant strategy involves constructing non\-static test sets from post\-training\-cutoff sources\. LiveBench\([livebench,](https://arxiv.org/html/2605.07268#bib.bib38)\)and AntiLeakBench\([wu\-etal\-2025\-antileakbench,](https://arxiv.org/html/2605.07268#bib.bib39)\)automatically collect questions from recent competitions and updated knowledge sources, establishing temporal isolation\. OKBench\([li2025okbenchdemocratizingllmevaluation,](https://arxiv.org/html/2605.07268#bib.bib17)\)further automates on\-demand generation from daily news, while DyCodeEval\([chen2025dynamic,](https://arxiv.org/html/2605.07268#bib.bib5)\)generates semantically diverse code variants\. While effective against contamination, these methods primarily addressdata freshnessrather thanreasoning hardness\. They lack a principled mechanism to dynamically control logical complexity independent of data provenance\.

#### Psychometric Adaptive Testing\.

Computerized Adaptive Testing \(CAT\) based on Item Response Theory \(IRT\)\([lord1980applications,](https://arxiv.org/html/2605.07268#bib.bib20)\)provides rigorous foundations for efficient ability estimation\. Fluid Benchmarking\([hofmann2025fluid,](https://arxiv.org/html/2605.07268#bib.bib12)\)represents the state\-of\-the\-art integration of IRT with NLP evaluation, dynamically selecting the most informative itemit\+1i\_\{t\+1\}given current ability estimateθ^t\\hat\{\\theta\}\_\{t\}by maximizing Fisher information:it\+1=arg⁡maxi⁡Ii\(θ^t\)i\_\{t\+1\}=\\arg\\max\_\{i\}I\_\{i\}\(\\hat\{\\theta\}\_\{t\}\)\. However, Fluid Benchmarking optimizes selection from afixeditem pool; its efficacy depends entirely on the pre\-existing quality and difficulty range of available questions\.

LogiHard unifies generation and adaptation\. Our IRT\-CAT protocol operates on a difficulty continuum defined by our logical combinatorics, enabling both precise measurement and the on\-demand synthesis of appropriately difficult items\. Unlike Fluid Benchmarking, which is constrained by fixed item pools, LogiHard can dynamically generate items at target difficulty levels via controlled propositional complexity\. This integration ofsynthesisandselectionenables significantly fewer questions than static benchmarks while maintaining measurement precision\.

## 3The LogiHard Framework

LogiHard comprises three synergistic components: acognitive difficulty scoringmodule that stratifies items via automated analysis of model thinking traces; adeterministic combinatorial protocolthat synthesizes logically valid questions via propositional logic; and adynamic evaluation protocolbased on Item Response Theory \(IRT\) and Computerized Adaptive Testing \(CAT\)\. We instantiate these components on the LogiHard\-2k benchmark, constructed from high\-stakes human examinations\.

### 3\.1Formal Preliminaries and Combinatorial Protocol

We formalize the transformation from atomic multiple\-choice questions to combinatorial logic tasks\. Letℒ\\mathcal\{L\}denote the propositional language generated by connectives\{∧,∨,¬\}\\\{\\land,\\lor,\\neg\\\}over a countable set of propositional variables𝒫\\mathcal\{P\}\. An atomic multiple\-choice question is a tuple𝒬=\(𝒞,𝒪,a\)\\mathcal\{Q\}=\(\\mathcal\{C\},\\mathcal\{O\},a\)where𝒞\\mathcal\{C\}denotes the context,𝒪=\{oI,oII,oIII,oIV\}\\mathcal\{O\}=\\\{o\_\{I\},o\_\{II\},o\_\{III\},o\_\{IV\}\\\}represents exactly four atomic options, anda∈\{I,II,III,IV\}a\\in\\\{I,II,III,IV\\\}is the*unique*ground\-truth index satisfying𝒞⊧oa\\mathcal\{C\}\\models o\_\{a\}and𝒞⊧̸oj\\mathcal\{C\}\\not\\models o\_\{j\}for allj≠aj\\neq a\. The atomization functionα:𝒪→𝒫\\alpha:\\mathcal\{O\}\\rightarrow\\mathcal\{P\}maps each optionoio\_\{i\}to a propositional variablepi∈𝒫p\_\{i\}\\in\\mathcal\{P\}, inducing the ground\-truth assignment𝒯:𝒫→\{⊤,⊥\}\\mathcal\{T\}:\\mathcal\{P\}\\rightarrow\\\{\\top,\\bot\\\}where𝒯\(pi\)=⊤\\mathcal\{T\}\(p\_\{i\}\)=\\topifi=ai=aand⊥\\bototherwise\. A combinatorial question𝒬′=\(𝒞′,Φ,𝒜\)\\mathcal\{Q\}^\{\\prime\}=\(\\mathcal\{C\}^\{\\prime\},\\Phi,\\mathcal\{A\}\)consists of extended context𝒞′=𝒞∪𝒮\\mathcal\{C\}^\{\\prime\}=\\mathcal\{C\}\\cup\\mathcal\{S\}where𝒮=\{sI,sII,sIII,sIV\}\\mathcal\{S\}=\\\{s\_\{I\},s\_\{II\},s\_\{III\},s\_\{IV\}\\\}denotes natural language statements corresponding to the atomized options; a formula setΦ=\{ϕ1,…,ϕm\}⊂ℒ\\Phi=\\\{\\phi\_\{1\},\\ldots,\\phi\_\{m\}\\\}\\subset\\mathcal\{L\}; and the correct index set𝒜=\{j∈\[m\]:⟦ϕj⟧𝒯=⊤\}\\mathcal\{A\}=\\\{j\\in\[m\]:\\llbracket\\phi\_\{j\}\\rrbracket\_\{\\mathcal\{T\}\}=\\top\\\}\.

The combinatorial synthesis constructs formulas over𝒫\\mathcal\{P\}using three logical patterns:Exactness\(EXACTi≡pi∧⋀j≠i¬pj\\text\{EXACT\}\_\{i\}\\equiv p\_\{i\}\\land\\bigwedge\_\{j\\neq i\}\\neg p\_\{j\}\),Disjunction\(pi∨pjp\_\{i\}\\lor p\_\{j\}\), andNegation\(¬pi\\neg p\_\{i\}and compound forms¬pi∧¬pj\\neg p\_\{i\}\\land\\neg p\_\{j\}\)\. These three patterns constitute a functionally complete subset of propositional connectives sufficient to generate controllable difficulty gradients\. Implication \(p→qp\\rightarrow q\) and biconditional \(p↔qp\\leftrightarrow q\) are expressible as¬p∨q\\neg p\\lor qand\(¬p∨q\)∧\(¬q∨p\)\(\\neg p\\lor q\)\\land\(\\neg q\\lor p\)respectively, and thus introduce no additional cognitive primitives beyond disjunction and negation\. By restricting the protocol to exactness, disjunction, and negation, we ensure that each tier maps to a distinct, monotonically increasing reasoning operation count, enabling precise*a priori*difficulty calibration\. The synthesis operates in three phases: \(1\) generate valid formulasΦ⊤\\Phi\_\{\\top\}evaluating to⊤\\topunder𝒯\\mathcal\{T\}\(includingEXACTa\\text\{EXACT\}\_\{a\}, disjunctionspa∨pjp\_\{a\}\\lor p\_\{j\}if Medium\+, negations¬pj\\neg p\_\{j\}if Hard\+, and compound negations if Expert\); \(2\) generate distractorsΦ⊥\\Phi\_\{\\bot\}evaluating to⊥\\bot\(false exactness claims, unsatisfied disjunctions,¬pa\\neg p\_\{a\}, and the universal distractor⋀i¬pi\\bigwedge\_\{i\}\\neg p\_\{i\}\); \(3\) samplencorrectn\_\{\\text\{correct\}\}fromΦ⊤\\Phi\_\{\\top\}and remaining fromΦ⊥\\Phi\_\{\\bot\}, enforcing tier constraints \(Hard requires≥1\\geq 1negation; Expert requires≥1\\geq 1disjunction and≥1\\geq 1negation\)\. The operator complexity hierarchy determines the difficulty tier: Easy uses exactness only; Medium adds disjunction; Hard adds negation \(mandatory\); Expert adds compound negations\. This hierarchy directly maps to IRT discrimination parameters\.

#### Validity\-by\-Construction

For any combinatorial question𝒬′\\mathcal\{Q\}^\{\\prime\}generated by the above protocol, the correct answer set𝒜\\mathcal\{A\}is logically consistent with the atomic ground truthaaof𝒬\\mathcal\{Q\}\. Eachϕ∈Φ⊤\\phi\\in\\Phi\_\{\\top\}evaluates to⊤\\topunder𝒯\\mathcal\{T\}by construction:EXACTa\\text\{EXACT\}\_\{a\}holds because𝒯\(pa\)=⊤\\mathcal\{T\}\(p\_\{a\}\)=\\topand𝒯\(pj\)=⊥\\mathcal\{T\}\(p\_\{j\}\)=\\botforj≠aj\\neq a; disjunctionspa∨pjp\_\{a\}\\lor p\_\{j\}hold because𝒯\(pa\)=⊤\\mathcal\{T\}\(p\_\{a\}\)=\\top; negations¬pj\\neg p\_\{j\}hold forj≠aj\\neq abecause𝒯\(pj\)=⊥\\mathcal\{T\}\(p\_\{j\}\)=\\bot; compound negations hold by conjunction of true negations\. Conversely,Φ⊥\\Phi\_\{\\bot\}formulas evaluate to⊥\\botunder𝒯\\mathcal\{T\}by parallel construction\. Since𝒜\\mathcal\{A\}indexes exactlyΦ⊤\\Phi\_\{\\top\}and𝒯\\mathcal\{T\}derives fromaa, consistency holds\.

#### Contamination Resistance

If a model has memorized the atomic answeraawithout understanding𝒞⊧oa\\mathcal\{C\}\\models o\_\{a\}, the probability of correctly answering𝒬′\\mathcal\{Q\}^\{\\prime\}is bounded by random guessing:P\(correct∣memorization\)≤\|𝒜\|/\|Φ\|P\(\\text\{correct\}\\mid\\text\{memorization\}\)\\leq\{\|\\mathcal\{A\}\|\}/\{\|\\Phi\|\}\. This bound holds under the assumption that the model has memorized only the atomic answeraaand possesses no prior knowledge of the combinatorial protocol \(i\.e\., the mapping from atomic options to propositional variables and the operator composition rules\)\. The atomic optionoao\_\{a\}never appears in𝒬′\\mathcal\{Q\}^\{\\prime\}; only its propositional counterpartpap\_\{a\}appears within compound formulas\. Memorization of surface patterns \(e\.g\., “Answer is A”\) provides no information about which combinations ofpI,pII,…p\_\{I\},p\_\{II\},\\ldotsevaluate to⊤\\topunder𝒯\\mathcal\{T\}\. Without logical evaluation ofΦ\\Phiunder𝒯\\mathcal\{T\}, the model can only guess uniformly from\|Φ\|\|\\Phi\|options, yielding the bound\.

Table 1:Metrics extracted from thinking traces for Gold Score computation\.

### 3\.2Cognitive Scoring and Difficulty Stratification

Prior to combinatorial transformation, we stratify questions by intrinsic cognitive complexity via analysis of model thinking traces\. Given a reasoning modelℳ\\mathcal\{M\}with exposed chain\-of\-thought𝒯co\\mathcal\{T\}\_\{\\text\{co\}\}, we extract nine cognitive metrics spanning epistemic uncertainty, logical structure, and reasoning dynamics \(Table[1](https://arxiv.org/html/2605.07268#S3.T1)\)\.

These metrics aggregate into theGold Scorevia a weighted linear combination:

Sgold=∑i=19wifi\(𝒯co\)−βℛ\(𝒯co\)S\_\{\\text\{gold\}\}=\\sum\_\{i=1\}^\{9\}w\_\{i\}f\_\{i\}\(\\mathcal\{T\}\_\{\\text\{co\}\}\)\-\\beta\\mathcal\{R\}\(\\mathcal\{T\}\_\{\\text\{co\}\}\)\(1\)wherefif\_\{i\}arezz\-normalized metrics,wiw\_\{i\}are cognitive load weights, andℛ\\mathcal\{R\}penalizes logical fallacies detected via automated consistency checking\. The corpus stratifies into tiers: Easy \(Sgold<20S\_\{\\text\{gold\}\}<20\), Medium \(20≤Sgold<2520\\leq S\_\{\\text\{gold\}\}<25\), Hard \(25≤Sgold<3025\\leq S\_\{\\text\{gold\}\}<30\), and Expert \(Sgold≥30S\_\{\\text\{gold\}\}\\geq 30\)\. Only Hard and Expert tiers undergo combinatorial transformation to prevent floor effects\.

The cognitive scoring module is optional for domains where thinking traces are unavailable; fixed difficulty tiers \(e\.g\., official examination categories\) can substitute while retaining combinatorial benefits\.

### 3\.3Dynamic Evaluation via IRT\-CAT

Static benchmarks administer fixed question sets, leading to inefficiency \(easy questions for strong models\) or imprecision \(floor effects for weak models\)\. LogiHard integrates Computerized Adaptive Testing \(CAT\) via Item Response Theory \(IRT\)\.

Each itemjjis parameterized by a 3\-Parameter Logistic \(3PL\) model:

Pj\(θ\)=cj\+1−cj1\+exp⁡\(−aj\(θ−bj\)\)P\_\{j\}\(\\theta\)=c\_\{j\}\+\\frac\{1\-c\_\{j\}\}\{1\+\\exp\(\-a\_\{j\}\(\\theta\-b\_\{j\}\)\)\}\(2\)whereθ∈ℝ\\theta\\in\\mathbb\{R\}is the latent ability,aj\>0a\_\{j\}\>0is discrimination,bj∈ℝb\_\{j\}\\in\\mathbb\{R\}is difficulty, andcj∈\[0,1\]c\_\{j\}\\in\[0,1\]is the pseudo\-guessing parameter \(typicallycj=1/mc\_\{j\}=1/mformm\-option questions\)\. The difficulty parameterbjb\_\{j\}is empirically calibrated from cognitive features:

bj=Sgold−7254\+0\.1\(λ−2\)\+log10⁡\(max⁡\(1,L\)\)−3\.17\+σ−100200b\_\{j\}=\\frac\{S\_\{\\text\{gold\}\}\-72\}\{54\}\+0\.1\(\\lambda\-2\)\+\\log\_\{10\}\(\\max\(1,L\)\)\-3\.17\+\\frac\{\\sigma\-100\}\{200\}\(3\)whereSgoldS\_\{\\text\{gold\}\}is the Gold Score,λ\\lambdais logic density,LLis thinking length \(tokens\), andσ\\sigmais reasoning segments count\. Discriminationaja\_\{j\}is mapped from operator complexity: Easy \(≈0\.8\\approx 0\.8\), Medium \(≈1\.2\\approx 1\.2\), Hard \(≈1\.6\\approx 1\.6\), Expert \(≈2\.0\\approx 2\.0\)\.

We implement a dual\-subset evaluation protocol that separately estimates ability on atomic \(0\-order\) and combinatorial \(2\-order\) reasoning\. The CAT engine maintains parallel ability estimatesθ^tℬ\\hat\{\\theta\}\_\{t\}^\{\\mathcal\{B\}\}andθ^t𝒞\\hat\{\\theta\}\_\{t\}^\{\\mathcal\{C\}\}with standard normal priors\. At each step, the engine selects the unadministered item maximizing Fisher informationIj\(θ^\)=aj2Pj\(θ^\)\(1−Pj\(θ^\)\)I\_\{j\}\(\\hat\{\\theta\}\)=a\_\{j\}^\{2\}P\_\{j\}\(\\hat\{\\theta\}\)\(1\-P\_\{j\}\(\\hat\{\\theta\}\)\)for each subset\. Responses update estimates via Expected A Posteriori \(EAP\) with 61 quadrature points over\[−6,6\]\[\-6,6\]\. The algorithm terminates when standard errorSEt<0\.3\\text\{SE\}\_\{t\}<0\.3or maximum itemsTmax=60T\_\{\\max\}=60per subset is reached\.

This protocol achieves measurement precision equivalent to static 60\-item tests using only 15–25 adaptively selected questions per subset, while precisely targeting the difficulty zone appropriate to each model’s capability\.

### 3\.4The LogiHard\-2k Benchmark

We instantiate the framework on a corpus of 6,235 high\-stakes examination questions \(Chinese Civil Service, LSAT, GMAT, IBPS, CAT, Raven’s Matrices\), preserving original languages \(45% English, 55% Chinese\)\. Cognitive scoring employs Kimi\-k2\.5 as the reasoning modelℳ\\mathcal\{M\}, generating long\-CoT traces for automated extraction of the nine metrics and Gold Score computation\. The resulting distribution \(μ=23\.2,σ=4\.1\\mu=23\.2,\\sigma=4\.1, range\[12\.4,34\.8\]\[12\.4,34\.8\]\) determines tier assignment; the top 2,000 questions proceed to the next stage\. Of these, 539 \(27%, exactly four atomic options, single\-select\) undergo combinatorial transformation \(LogiHard\-C\) and 1,461 \(73%\) remain atomic \(LogiHard\-Base\)\. The 539 combinatorial questions distribute across Easy \(108, 20%, exactness only\), Medium \(215, 40%, \+disjunction\), Hard \(162, 30%, \+negation\), and Expert \(54, 10%, \+compound negations\)\. All transformed items undergo automated theorem\-proving verification \(2\.3% regeneration rate\)\. Each question includes 9\-dimensional cognitive features, IRT 3PL parameters \(aj,bj,cja\_\{j\},b\_\{j\},c\_\{j\}\), source attribution, and reasoning type labels\.

## 4Experiments

### 4\.1Experimental Setup

#### Models\.

We evaluate frontier LLMs: GLM\-5\([glm5team2026glm5vibecodingagentic,](https://arxiv.org/html/2605.07268#bib.bib9)\), GLM\-4\.7 \(Zhipu\); GPT\-5\.4, o3 \(OpenAI\); Claude\-Opus\-4\.6\([claude\_opus\_4\_6,](https://arxiv.org/html/2605.07268#bib.bib1)\)\(Anthropic\); DeepSeek\-R1, DeepSeek\-V3\.2, DeepSeek\-V4\-Pro\([deepseekai2026deepseekv4,](https://arxiv.org/html/2605.07268#bib.bib8)\)\(DeepSeek\); Gemini\-3\.1\-pro\([deepmind2026gemini31pro,](https://arxiv.org/html/2605.07268#bib.bib6)\)\(Google\); Kimi\-k2\.5\([team2026kimi,](https://arxiv.org/html/2605.07268#bib.bib36)\)\(Moonshot\); Qwen3\.5\-Plus, Qwen3\.6\-plus\([yang2025qwen3,](https://arxiv.org/html/2605.07268#bib.bib41)\)\(Alibaba\)\. Dual\-subset evaluation with 3PL parameterization under Hard Mode \(Sgold≥25S\_\{\\text\{gold\}\}\\geq 25\)\. Maximum 60 items per subset; termination atSE<0\.3\\text\{SE\}<0\.3\. EAP estimation with𝒩\(0,1\)\\mathcal\{N\}\(0,1\)priors, 61 quadrature points\. All model evaluations were conducted via public API endpoints; Decoding temperature1\.01\.0, max tokens65,53665\{,\}536\. Requests were configured with a timeout of 600s per query\. Implementation details are shown in Appendix[A](https://arxiv.org/html/2605.07268#A1)

#### Baselines and human evaluation\.

Prior to the development of the LogiHard transformation protocol, we conducted baseline hardening experiments on the full raw corpus of 6,235 questions\. We evaluate two standard techniques:NOTA insertion\(replacing the correct answer with “None of the Above”\) andoption shuffling\(random permutation of atomic options\)\. Both preserve underlying reasoning paths while altering surface presentation\. For human reference,n=30n=30graduate\-level volunteers complete a fixed 30\-item Hard/Expert subset in both Original and Combinatorial formats under no time pressure\.

### 4\.2RQ1: Combinatorial Collapse and Resilience Tiers

Table[2](https://arxiv.org/html/2605.07268#S4.T2)presents the central finding\. On the full\-set baseline, all twelve models achieve strong performance \(78\.578\.5–83\.0%83\.0\\%\) with minimal degradation from surface perturbations \(<2\.5%<2\.5\\%for NOTA,<1\.8%<1\.8\\%for shuffling\), confirming that conventional benchmarks remain saturated\. The Hard\-Mode IRT\-CAT evaluation reveals a dramatically different picture: all models exhibit severe degradation on Combinatorial questions, with meanΔθ=1\.90\\Delta\\theta=1\.90\(range1\.051\.05–3\.143\.14\)\. The gap between surface perturbation and combinatorial transformation confirms that the degeneration stems from fundamental logical order elevation\. For reference, human evaluators achieve79\.5%79\.5\\%on Hard\-Combinatorial \(Δθ≈0\.28\\Delta\\theta\\approx 0\.28\), far above even the most resilient model \(GLM\-5,38\.3%38\.3\\%\)\.

Table 2:Performance under full\-set baseline \(Original, NOTA, Shuffle\) and Hard\-Mode IRT\-CAT \(H\-Base, H\-Comb\)\.Δθ\\Delta\\thetadenotes latent ability drop\. Human results fromn=30n=30graduate evaluators\.Three resilience tiers emerge\.First tier\(Δθ<1\.6\\Delta\\theta<1\.6\): GLM\-5 \(1\.051\.05\), GPT\-5\.4 \(1\.111\.11\), DeepSeek\-R1 \(1\.391\.39\), DeepSeek\-V4\-Pro \(1\.491\.49\), and Claude\-Opus\-4\-6 \(1\.521\.52\)\. DeepSeek\-V4\-Pro, as the latest iteration in the DeepSeek family, shows marked improvement over its generalist predecessor V3\.2 \(Δθ\\Delta\\theta:2\.92→1\.492\.92\\to 1\.49, a gap of1\.431\.43\), approaching the RL\-optimized R1 \(1\.391\.39\) without explicit reasoning\-specific training\. This suggests that architectural evolution between versions can substantially enhance combinatorial robustness\. GPT\-5\.4 achieves resilience despite modest Hard\-Base performance; DeepSeek\-R1 benefits from RL optimization \(1\.531\.53improvement over V3\.2\)\.Second tier\(Δθ≈1\.7\\Delta\\theta\\approx 1\.7–1\.91\.9\): Gemini\-3\.1\-pro \(1\.671\.67\), Kimi\-k2\.5 \(1\.701\.70\), o3 \(1\.851\.85\)\.Third tier\(Δθ≥2\.3\\Delta\\theta\\geq 2\.3\): Qwen3\.6\-plus \(2\.342\.34\), Qwen3\.5\-plus \(2\.612\.61\), DeepSeek\-V3\.2 \(2\.922\.92\), and GLM\-4\.7 \(3\.143\.14\)\. GLM\-4\.7 exhibits the most severe combinatorial collapse despite achieving the highest full\-set accuracy \(83\.0%83\.0\\%\), confirming the dissociation between conventional benchmark performance and propositional robustness\. The dissociation between full\-set and Hard\-Mode performance is striking\. GLM\-4\.7 \(83\.0%83\.0\\%full\-set\) suffers the most severe collapse \(Δθ=3\.14\\Delta\\theta=3\.14\), while GPT\-5\.4 \(80\.5%80\.5\\%\) shows the second\-smallest drop \(1\.111\.11\)\. This challenges the assumption that higher conventional capacity confers propositional robustness; notably, GLM\-4\.7 required4040Combinatorial items without reaching theSEthreshold, reflecting persistent uncertainty under elevation\. The most extreme combinatorial degradation occurs in Qwen3\.6\-plus \(5\.00% H\-Comb accuracy,Δθ=2\.34\\Delta\\theta=2\.34\), a 46\.7 percentage\-point drop from its H\-Base performance \(51\.67%\)\. This outlier is not an artifact: repeated API evaluations with temperature 1\.0 and 65,536 max tokens consistently yield near\-floor performance, indicating that Qwen3\.6\-plus’s chain\-of\-thought mechanism catastrophically fails under multi\-select enumeration pressure\. The severity exceeds what operator complexity alone would predict, suggesting an interaction between the model’s early\-exit training bias and compound\-negation tiers\. This confirms that combinatorial robustness is not merely a function of scale or general benchmark score, but depends on how reasoning architectures handle*set\-valued constraint satisfaction*\.

### 4\.3RQ2: The Multi\-Select Bottleneck and Operator Complexity

Disaggregation by answer cardinality reveals a specific failure mode: accuracy is31\.2%31\.2\\%forncorrect=1n\_\{\\text\{correct\}\}=1but drops to12\.4%12\.4\\%forncorrect≥2n\_\{\\text\{correct\}\}\\geq 2\(60%60\\%relative decrease\)\. Models frequently identify one valid formula but fail to enumerate complete answer sets, suggesting that chain\-of\-thought mechanisms support unitary proposition verification but not set\-valued constraint satisfaction\.

This multi\-select bottleneck intensifies with operator complexity\. Within the Expert tier, accuracy decreases monotonically: Exactness \(52\.1%52\.1\\%\), Disjunction \(48\.3%48\.3\\%\), Negation \(41\.7%41\.7\\%\), Compound negation \(33\.4%33\.4\\%\)\. The compound negation pattern is particularly difficult because it requires simultaneous access to the ground\-truth assignment \(to know which propositions are false\) and the formula structure \(to know which are negated\)—a dual\-access requirement that strains current attention mechanisms\.

An ablation study \(Table[4](https://arxiv.org/html/2605.07268#S4.T4)\) validates the necessity of cognitive stratification\. Gold Score stratification achievesr=0\.67r=0\.67correlation with human difficulty judgments, approaching human curation \(r=0\.72r=0\.72\), while random selection yields onlyr=0\.31r=0\.31\. This ablation validates that cognitive stratification is necessary, confirming that the 9\-dimensional cognitive metrics capture psychologically meaningful complexity dimensions and enable scalable benchmark construction without prohibitive annotation costs\.

Table 3:Selection strategy ablation \(GPT\-5\.4\)\.
Table 4:CAT efficiency analysis\.

### 4\.4RQ3: Efficiency and Universality

Two barriers limit practical deployment of hardness evaluation: measurement cost and domain specificity\. Static benchmarks require administering 50–100\+ fixed items per model to achieve reliable estimates, many of which are uninformative\. The IRT\-CAT protocol addresses cost by adaptively selecting only the most informative items\. As shown in Table[4](https://arxiv.org/html/2605.07268#S4.T4), CAT achievesSE<0\.3\\text\{SE\}<0\.3with merely 12–22 items per subset—a6060–80%80\\%reduction versus static 60\-item tests\.

The dual\-subset IRT\-CAT design enables direct comparison of 0\-order versus 2\-order reasoning within identical ability metrics, making the specific combinatorial deficitΔθ\\Delta\\thetaa diagnostic tool independent of raw logical capacity\. To address domain specificity, we evaluatezero\-shot transfer to MMLU\. Applying the identical IRT\-CAT protocol to 128 MMLU questions \(formal logic, philosophy, mathematics, computer science, law, medical ethics\) without domain adaptation yields46\.99%46\.99\\%accuracy degradation \(89\.84%→42\.86%89\.84\\%\\to 42\.86\\%\) withΔθ=2\.743\\Delta\\theta=2\.743—statistically indistinguishable from native LogiHard\-2k performance \(t\(10\)=0\.14,p=0\.89t\(10\)=0\.14,p=0\.89\)\. This confirms LogiHard as adomain\-agnostic hardness wrapper: the protocol separates domain knowledge \(encoded in atomic propositions\) from reasoning capability \(manifested in compound formula evaluation\), enabling contamination\-resistant evaluation without de novo question authoring\.

## 5Discussion

### 5\.1Mechanistic Analysis: The Combinatorial Mapping Gap

#### Two\-stage bottleneck and failure taxonomy\.

We propose a two\-stage model of combinatorial reasoning\.Stage 1evaluates atomic truth assignments\{TI,TII,TIII,TIV\}\\\{T\_\{I\},T\_\{II\},T\_\{III\},T\_\{IV\}\\\}under the context𝒞\\mathcal\{C\}\.Stage 2translates these truth values into multi\-select option sets by verifying⟦ϕj⟧𝒯=⊤\\llbracket\\phi\_\{j\}\\rrbracket\_\{\\mathcal\{T\}\}=\\topfor each of themmcompound formulas and aggregating the complete answer set𝒜\\mathcal\{A\}\. For a typical LogiHard question withn=4n=4propositions,m=6m=6options, andk=2k=2operators per option, Stage 2 imposesn\+m×k=16n\+m\\times k=16sequential logical operations—a load that substantially exceeds the reliable reasoning depth observed in current LRMs even with extended chain\-of\-thought\([illusion\-of\-thinking,](https://arxiv.org/html/2605.07268#bib.bib31)\)\.

To locate the precise failure locus, we annotated 408 failure cases across all evaluated models across three non\-exclusive levels\.Level 1\(logical understanding errors\) accounts for only 86 cases \(21\.1%\)\.Level 2\(combinatorial mapping failures\) dominates at 367 cases \(89\.9%\)\.Level 3\(response generation failures\) is negligible at 9 cases \(2\.2%\)\. Because a single failure can exhibit multiple levels simultaneously—for instance, a Level 1 mis\-evaluation may propagate into an incomplete enumeration at Level 2—the percentages sum beyond 100%\. The near\-total dominance of Level 2 confirms that current LLMs possess logical knowledge but lack*compositional reliability*: the bottleneck is not*whether*models can reason, but*whether they can compose*multiple reasoning steps into a globally consistent solution\.

#### Operator complexity and early\-exit as training paradigm defect\.

Disaggregating by operator complexity reveals a monotonic accuracy gradient that validates oura prioridifficulty calibration: Exactness \(52\.1%52\.1\\%\), Disjunction \(48\.3%48\.3\\%\), Negation \(41\.7%41\.7\\%\), and Compound negation \(33\.4%33\.4\\%\)\. Easy and Medium tiers \(Exactness, Disjunction\) almost exclusively trigger Level 2 under\-selection, because models correctly evaluate atomic propositions but fail to enumerate complete option sets\. Hard and Expert tiers introduce Negation and Compound negation, which increase Level 1 errors because the dual\-access requirement—simultaneously holding the ground\-truth assignment and the formula structure—strains working memory and causes logical mis\-evaluation before combinatorial mapping begins\. When operators further combine with multi\-select pressure \(ncorrect≥2n\_\{\\text\{correct\}\}\\geq 2\), accuracy drops from31\.2%31\.2\\%to12\.4%12\.4\\%\.

Crucially,58\.3%58\.3\\%of Combinatorial responses exhibit early termination after identifying 1–2 correct options, despite recognizing additional valid choices in their reasoning traces\. This under\-selection dominates failure modes and reflects a systematic training bias: models are optimized on single\-answer benchmarks and lack metacognitive triggers for completeness verification\. The Qwen family provides circumstantial evidence: Qwen3\.6\-plus improves over 3\.5\-Plus by0\.270\.27in combinatorial resilience \(Δθ\\Delta\\theta:2\.61→2\.342\.61\\to 2\.34\), yet both show similar early\-exit rates\. Version iteration improves Stage 1 logical analysis without addressing the underlying completeness verification deficit; notably, Qwen3\.6\-plus still exhibits the most severe absolute collapse \(5\.00%5\.00\\%H\-Comb accuracy\), confirming that incremental architectural improvements can leave fundamental compositional deficits intact\. Diagnostic failure cases are provided in Appendix[B](https://arxiv.org/html/2605.07268#A2)\. This stark contrast with human performance confirms that LogiHard exposes a*machine\-specific*reasoning limitation: humans natively interpret compound logical formulas as transparent meta\-linguistic descriptions, whereas models fail at symbol\-grounding and set\-enumeration requirements\.

### 5\.2Implications for Frontier Model Development

The combinatorial multi\-select reasoning failure exposes a correctable structural deficit in current frontier models\. Our two\-stage analysis reveals that RL and SFT pipelines have optimized for*unitary proposition verification*at the expense of*set\-valued constraint satisfaction*: the58\.3%58\.3\\%early\-exit rate indicates that models treat reasoning as a “first correct answer wins” process, having never acquired metacognitive triggers for completeness checking\. Because standard post\-training datasets overwhelmingly reward single\-answer selection, models develop strong Stage 1 logical analysis without reliable Stage 2 compositional mapping—a gap that test\-time scaling alone cannot bridge, as evidenced by65,53665\{,\}536\-token chain\-of\-thought budgets failing on tasks requiring only∼\\sim16 sequential logical operations\. This suggests that future training regimes should incorporate*deterministic combinatorial elevation*as a native augmentation strategy rather than relying on stochastic paraphrase or adversarial perturbation\. By transforming existing high\-quality MCQs into tiered propositional compounds, practitioners can generate validity\-guaranteed, difficulty\-calibrated signals that explicitly target compositional mapping, while the 9\-dimensional cognitive scoring enables adaptive hardness curricula that track model abilityθ\\thetadynamically\.

The performance gap between surface perturbation \(NOTA, shuffling\) and combinatorial transformation further signals that static benchmarks are approaching obsolescence for frontier evaluation\. As conventional test scores exceed90%90\\%, the community requires algorithmically renewable protocols capable of generating new variants on\-demand without human re\-authoring or contamination risks—precisely what LogiHard’s validity\-by\-construction synthesis provides\. The IRT\-CAT integration reduces measurement cost and carbon footprint by6060–80%80\\%while maintaining precision, making sustained evaluation practical at scale\. At the architectural level, the combinatorial mapping gap implies that current transformers lack an explicit*working memory*mechanism for maintaining set\-valued truth assignments under compound constraints\. Future models may need dedicated constraint\-satisfaction modules—differentiable SAT layers or explicit set\-representation mechanisms—alongside autoregressive generation to enforce global consistency over multi\-select outputs\. Until then, the distinction between “knowing logic” and “composing logic” will remain a persistent ceiling on machine reasoning\.

## 6Conclusion

We present LogiHard, a formal framework that deterministically transforms multiple\-choice questions into combinatorial logic tasks, elevating reasoning from 0\-order selection to 2\-order judgment\. By combining validity\-by\-construction synthesis with IRT\-based computerized adaptive testing, LogiHard enables precise, efficient, and contamination\-resistant hardness evaluation\. Evaluation across twelve state\-of\-the\-art models reveals a multi\-select failure driven by a combinatorial reasoning deficit: models correctly evaluate atomic propositions \(Stage 1\) yet systematically fail to enumerate complete answer sets under compound constraints \(Stage 2\)\. This failure reflects a training\-induced early\-exit bias—post\-training pipelines optimized for unitary proposition verification lack meta\-cognitive triggers for set\-valued completeness checking, leaving compositional reliability unlearned even as chain\-of\-thought budgets scale\. The stark human–machine asymmetry confirms that the limitation is architecture\-specific rather than intrinsic to logical reasoning itself\. Practically, LogiHard serves as a domain\-agnostic hardness wrapper: zero\-shot transfer to MMLU demonstrates that domain knowledge and combinatorial reasoning can be isolated and evaluated independently, while IRT\-CAT reduces measurement cost by6060–80%80\\%\. These findings suggest that future training regimes should treat deterministic combinatorial elevation as a native augmentation strategy—not merely to harden benchmarks, but to instill the set\-valued constraint satisfaction that current reasoning architectures lack\.

## Limitations

LogiHard\-2k is constructed from English and Chinese examinations, and the Gold Score depends on a single reasoning model’s traces without extensive human calibration\. Our IRT\-CAT implementation adopts a fixed 3PL parameterization with EAP estimation under a uniform prior, leaving more sophisticated Bayesian updates and testlet effects from shared logical contexts to future work\.

## Broader Impacts

#### Positive impacts\.

LogiHard addresses the growing trust crisis in AI evaluation by providing a validity\-guaranteed, contamination\-resistant hardness protocol\. By exposing the gap between surface\-level benchmark performance and genuine propositional reasoning, our framework helps practitioners and policymakers avoid over\-reliance on saturated metrics\. The IRT\-CAT component reduces evaluation cost and carbon footprint by administering fewer questions while maintaining precision\. More broadly, rigorous evaluation standards ultimately steer AI development toward robust reasoning rather than pattern memorization\.

#### Negative impacts and safeguards\.

Any hardness protocol carries the risk of misuse: adversarial deployers could selectively cite combinatorial collapse to disparage specific models out of context, or high\-stakes screening systems \(hiring, admissions\) could adopt combinatorial formats that introduce artificial cognitive barriers unrelated to the target domain, potentially disadvantaging populations unfamiliar with formal logic notation\. We explicitly caution against deploying LogiHard in such settings without domain validation and human\-normed calibration\. Additionally, while the framework resists memorization, determined actors might overfit to the specific propositional patterns \(Exactness, Disjunction, Negation\) we expose; we mitigate this by advocating for continuous renewal of evaluation formats\. We commit to releasing the combinatorial protocol as an open\-source tool, enabling community scrutiny and iterative improvement of the hardness transformation rules\.

## References

- \(1\)Anthropic\.Claude opus 4\.6 system card, 2026\.
- \(2\)Center for AI Safety, Scale AI, and HLE Contributors Consortium\.A benchmark of expert\-level academic questions to assess AI capabilities\.Nature, 649:1139–1146, 2026\.
- \(3\)Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che\.Towards reasoning era: A survey of long chain\-of\-thought for reasoning large language models, 2025\.
- \(4\)Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray\.Benchmarking large language models under data contamination: A survey from static to dynamic evaluation\.In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10080–10098, Suzhou, China, November 2025\. Association for Computational Linguistics\.
- \(5\)Simin Chen, Pranav Pusarla, and Baishakhi Ray\.Dynamic benchmarking of reasoning capabilities in code large language models under data contamination\.InProceedings of the 42nd International Conference on Machine Learning \(ICML\)\. PMLR, 2025\.
- \(6\)Google DeepMind\.Gemini 3\.1 pro model card, February 2026\.
- \(7\)DeepSeek\-AI\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025\.
- \(8\)DeepSeek\-AI\.Deepseek\-v4: Towards highly efficient million\-token context intelligence, 2026\.
- \(9\)GLM\-5\-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong, Mingdao Liu, Mingming Zhao, Pengfan Du, Qian Dong, Rui Lu, Shuang\-Li, Shulin Cao, Song Liu, Ting Jiang, Xiaodong Chen, Xiaohan Zhang, Xuancheng Huang, Xuezhen Dong, Yabo Xu, Yao Wei, Yifan An, Yilin Niu, Yitong Zhu, Yuanhao Wen, Yukuo Cen, Yushi Bai, Zhongpei Qiao, Zihan Wang, Zikang Wang, Zilin Zhu, Ziqiang Liu, Zixuan Li, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chen Li, Chengwei Hu, Chenhui Zhang, Dan Zhang, Daoyan Lin, Dayong Yang, Di Wang, Ding Ai, Erle Zhu, Fangzhou Yi, Feiyu Chen, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Hao Peng, Hao Tai, Haobo Zhang, He Liu, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huanpeng Chu, Jia’ni Zhao, Jiachen Wang, Jiajing Zhao, Jiamin Ren, Jiapeng Wang, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jijie Li, Jing An, Jing Li, Jingwei Yuan, Jinhua Du, Jinxin Liu, Junkai Zhi, Junwen Duan, Kaiyue Zhou, Kangjian Wei, Ke Wang, Keyun Luo, Laiqiang Zhang, Leigang Sha, Liang Xu, Lindong Wu, Lintao Ding, Lu Chen, Minghao Li, Nianyi Lin, Pan Ta, Qiang Zou, Rongjun Song, Ruiqi Yang, Shangqing Tu, Shangtong Yang, Shaoxiang Wu, Shengyan Zhang, Shijie Li, Shuang Li, Shuyi Fan, Wei Qin, Wei Tian, Weining Zhang, Wenbo Yu, Wenjie Liang, Xiang Kuang, Xiangmeng Cheng, Xiangyang Li, Xiaoquan Yan, Xiaowei Hu, Xiaoying Ling, Xing Fan, Xingye Xia, Xinyuan Zhang, Xinze Zhang, Xirui Pan, Xu Zou, Xunkai Zhang, Yadi Liu, Yandong Wu, Yanfu Li, Yidong Wang, Yifan Zhu, Yijun Tan, Yilin Zhou, Yiming Pan, Ying Zhang, Yinpei Su, Yipeng Geng, Yong Yan, Yonglin Tan, Yuean Bi, Yuhan Shen, Yuhao Yang, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yurong Wu, Yutao Zhang, Yuxi Duan, Yuxuan Zhang, Zezhen Liu, Zhengtao Jiang, Zhenhe Yan, Zheyu Zhang, Zhixiang Wei, Zhuo Chen, Zhuoer Feng, Zijun Yao, Ziwei Chai, Ziyuan Wang, Zuzhou Zhang, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, and Jie Tang\.Glm\-5: from vibe coding to agentic engineering, 2026\.
- \(10\)Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, and Megan Ung\.Changing answer order can decrease mmlu accuracy\.arXiv preprint arXiv:2406\.19470, 2024\.
- \(11\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\), 2021\.
- \(12\)Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, and Noah A\. Smith\.Fluid language model benchmarking\.InSecond Conference on Language Modeling, 2025\.
- \(13\)Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Yuanzhu Peter Chen, et al\.Big\-bench extra hard\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 26473–26501, 2025\.
- \(14\)Bogdan Kostić, Conor Fallon, Julian Risch, and Alexander Löser\.Same meaning, different scores: Lexical and syntactic sensitivity in llm evaluation\.arXiv preprint arXiv:2602\.17316, 2026\.
- \(15\)Ehsan Latif, Yifan Zhou, Shuchen Guo, Yizhu Gao, Lehong Shi, Matthew Nyaaba, Arne Bewerdorff, Xiantong Yang, and Xiaoming Zhai\.Comparative evaluation of openai o1 and human performance in higher order cognition\.Scientific Reports, 2025\.
- \(16\)Tianle Li, Wei\-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica\.From crowdsourced data to high\-quality benchmarks: Arena\-hard and benchbuilder pipeline\.arXiv preprint arXiv:2406\.11939, 2024\.
- \(17\)Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, and Jiawei Zhou\.Okbench: Democratizing llm evaluation with fully automated, on\-demand, open knowledge benchmarking, 2025\.
- \(18\)Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang\.Logiqa 2\.0—an improved dataset for logical reasoning in natural language understanding\.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962, 2023\.
- \(19\)Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang\.Logicot: Logical chain\-of\-thought instruction tuning\.InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2908–2921, 2023\.
- \(20\)F\.M\. Lord\.Applications of Item Response Theory to Practical Testing Problems\.L\. Erlbaum Associates, 1980\.
- \(21\)Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi\.Do llms know when to not answer? investigating abstention abilities of large language models\.InProceedings of the 31st International Conference on Computational Linguistics, pages 9329–9345, 2025\.
- \(22\)Alan Malek, Jiawei Ge, Nevena Lazic, Chi Jin, András György, and Csaba Szepesvári\.Frontier llms still struggle with simple reasoning tasks, 2025\.
- \(23\)Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar\.Gsm\-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2024\.
- \(24\)Luke Moffett and Bhuwan Dhingra\.Close or cloze? assessing the robustness of large language models to adversarial perturbations via word recovery\.In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al\-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 6999–7019, Abu Dhabi, UAE, January 2025\. Association for Computational Linguistics\.
- \(25\)Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei\-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto\.s1: Simple test\-time scaling, 2025\.
- \(26\)NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haifeng Qian, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jian Zhang, Jiaqi Zeng, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska\-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El\-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J\. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman, Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Soumye Singhal, Stefania Alborghetti, Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Tugrul Konuk, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi, Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, and Zijia Chen\.Nvidia nemotron nano 2: An accurate and efficient hybrid mamba\-transformer reasoning model, 2025\.
- \(27\)Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, and Yong Jae Lee\.Contamination detection for vlms using multi\-modal semantic perturbation\.International Conference on Learning Representations, 2026\.
- \(28\)Pouya Pezeshkpour and Estevam Hruschka\.Large language models sensitivity to the order of options in multiple\-choice questions\.InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, 2024\.
- \(29\)Qwen Team\.Qwq\-32b: Embracing the power of reinforcement learning, March 2025\.
- \(30\)Eva Sánchez Salido, Julio Gonzalo, and Guillermo Marco\.None of the others: a general technique to distinguish reasoning from memorization in multiple\-choice llm evaluation benchmarks\.arXiv preprint arXiv:2502\.12896, 2025\.
- \(31\)Parshin Shojaee\*, Iman Mirzadeh\*, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar\.The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity\.InNeurIPS, 2025\.
- \(32\)Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El\-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker\-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Alexey Ivanov, Alexi Christakis, Alistair Gillespie, Allison Tam, Ally Bennett, Alvin Wan, Alyssa Huang, Amy McDonald Sandjideh, Amy Yang, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrei Gheorghe, Andres Garcia Garcia, Andrew Braunstein, Andrew Liu, Andrew Schmidt, Andrey Mereskin, Andrey Mishchenko, Andy Applebaum, Andy Rogerson, Ann Rajan, Annie Wei, Anoop Kotha, Anubha Srivastava, Anushree Agrawal, Arun Vijayvergiya, Ashley Tyra, Ashvin Nair, Avi Nayak, Ben Eggers, Bessie Ji, Beth Hoover, Bill Chen, Blair Chen, Boaz Barak, Borys Minaiev, Botao Hao, Bowen Baker, Brad Lightcap, Brandon McKinzie, Brandon Wang, Brendan Quinn, Brian Fioca, Brian Hsu, Brian Yang, Brian Yu, Brian Zhang, Brittany Brenner, Callie Riggins Zetino, Cameron Raymond, Camillo Lugaresi, Carolina Paz, Cary Hudson, Cedric Whitney, Chak Li, Charles Chen, Charlotte Cole, Chelsea Voss, Chen Ding, Chen Shen, Chengdu Huang, Chris Colby, Chris Hallacy, Chris Koch, Chris Lu, Christina Kaplan, Christina Kim, CJ Minott\-Henriques, Cliff Frey, Cody Yu, Coley Czarnecki, Colin Reid, Colin Wei, Cory Decareaux, Cristina Scheau, Cyril Zhang, Cyrus Forbes, Da Tang, Dakota Goldberg, Dan Roberts, Dana Palmie, Daniel Kappler, Daniel Levine, Daniel Wright, Dave Leo, David Lin, David Robinson, Declan Grabb, Derek Chen, Derek Lim, Derek Salama, Dibya Bhattacharjee, Dimitris Tsipras, Dinghua Li, Dingli Yu, DJ Strouse, Drew Williams, Dylan Hunn, Ed Bayes, Edwin Arbus, Ekin Akyurek, Elaine Ya Le, Elana Widmann, Eli Yani, Elizabeth Proehl, Enis Sert, Enoch Cheung, Eri Schwartz, Eric Han, Eric Jiang, Eric Mitchell, Eric Sigler, Eric Wallace, Erik Ritter, Erin Kavanaugh, Evan Mays, Evgenii Nikishin, Fangyuan Li, Felipe Petroski Such, Filipe de Avila Belbute Peres, Filippo Raso, Florent Bekerman, Foivos Tsimpourlas, Fotis Chantzis, Francis Song, Francis Zhang, Gaby Raila, Garrett McGrath, Gary Briggs, Gary Yang, Giambattista Parascandolo, Gildas Chabot, Grace Kim, Grace Zhao, Gregory Valiant, Guillaume Leclerc, Hadi Salman, Hanson Wang, Hao Sheng, Haoming Jiang, Haoyu Wang, Haozhun Jin, Harshit Sikchi, Heather Schmidt, Henry Aspegren, Honglin Chen, Huida Qiu, Hunter Lightman, Ian Covert, Ian Kivlichan, Ian Silber, Ian Sohl, Ibrahim Hammoud, Ignasi Clavera, Ikai Lan, Ilge Akkaya, Ilya Kostrikov, Irina Kofman, Isak Etinger, Ishaan Singal, Jackie Hehir, Jacob Huh, Jacqueline Pan, Jake Wilczynski, Jakub Pachocki, James Lee, James Quinn, Jamie Kiros, Janvi Kalra, Jasmyn Samaroo, Jason Wang, Jason Wolfe, Jay Chen, Jay Wang, Jean Harb, Jeffrey Han, Jeffrey Wang, Jennifer Zhao, Jeremy Chen, Jerene Yang, Jerry Tworek, Jesse Chand, Jessica Landon, Jessica Liang, Ji Lin, Jiancheng Liu, Jianfeng Wang, Jie Tang, Jihan Yin, Joanne Jang, Joel Morris, Joey Flynn, Johannes Ferstad, Johannes Heidecke, John Fishbein, John Hallman, Jonah Grant, Jonathan Chien, Jonathan Gordon, Jongsoo Park, Jordan Liss, Jos Kraaijeveld, Joseph Guay, Joseph Mo, Josh Lawson, Josh McGrath, Joshua Vendrow, Joy Jiao, Julian Lee, Julie Steele, Julie Wang, Junhua Mao, Kai Chen, Kai Hayashi, Kai Xiao, Kamyar Salahi, Kan Wu, Karan Sekhri, Karan Sharma, Karan Singhal, Karen Li, Kenny Nguyen, Keren Gu\-Lemberg, Kevin King, Kevin Liu, Kevin Stone, Kevin Yu, Kristen Ying, Kristian Georgiev, Kristie Lim, Kushal Tirumala, Kyle Miller, Lama Ahmad, Larry Lv, Laura Clare, Laurance Fauconnet, Lauren Itow, Lauren Yang, Laurentia Romaniuk, Leah Anise, Lee Byron, Leher Pathak, Leon Maksin, Leyan Lo, Leyton Ho, Li Jing, Liang Wu, Liang Xiong, Lien Mamitsuka, Lin Yang, Lindsay McCallum, Lindsey Held, Liz Bourgeois, Logan Engstrom, Lorenz Kuhn, Louis Feuvrier, Lu Zhang, Lucas Switzer, Lukas Kondraciuk, Lukasz Kaiser, Manas Joglekar, Mandeep Singh, Mandip Shah, Manuka Stratta, Marcus Williams, Mark Chen, Mark Sun, Marselus Cayton, Martin Li, Marvin Zhang, Marwan Aljubeh, Matt Nichols, Matthew Haines, Max Schwarzer, Mayank Gupta, Meghan Shah, Melody Y\. Guan, Melody Huang, Meng Dong, Mengqing Wang, Mia Glaese, Micah Carroll, Michael Lampe, Michael Malek, Michael Sharman, Michael Zhang, Michele Wang, Michelle Pokrass, Mihai Florian, Mikhail Pavlov, Miles Wang, Ming Chen, Mingxuan Wang, Minnia Feng, Mo Bavarian, Molly Lin, Moose Abdool, Mostafa Rohaninejad, Nacho Soto, Natalie Staudacher, Natan LaFontaine, Nathan Marwell, Nelson Liu, Nick Preston, Nick Turley, Nicklas Ansman, Nicole Blades, Nikil Pancha, Nikita Mikhaylin, Niko Felix, Nikunj Handa, Nishant Rai, Nitish Keskar, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Oona Gleeson, Pamela Mishkin, Patryk Lesiewicz, Paul Baltescu, Pavel Belov, Peter Zhokhov, Philip Pronin, Phillip Guo, Phoebe Thacker, Qi Liu, Qiming Yuan, Qinghua Liu, Rachel Dias, Rachel Puckett, Rahul Arora, Ravi Teja Mullapudi, Raz Gaon, Reah Miyara, Rennie Song, Rishabh Aggarwal, RJ Marsan, Robel Yemiru, Robert Xiong, Rohan Kshirsagar, Rohan Nuttall, Roman Tsiupa, Ronen Eldan, Rose Wang, Roshan James, Roy Ziv, Rui Shu, Ruslan Nigmatullin, Saachi Jain, Saam Talaie, Sam Altman, Sam Arnesen, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Sarah Yoo, Savannah Heon, Scott Ethersmith, Sean Grove, Sean Taylor, Sebastien Bubeck, Sever Banesiu, Shaokyi Amdo, Shengjia Zhao, Sherwin Wu, Shibani Santurkar, Shiyu Zhao, Shraman Ray Chaudhuri, Shreyas Krishnaswamy, Shuaiqi, Xia, Shuyang Cheng, Shyamal Anadkat, Simón Posada Fishman, Simon Tobin, Siyuan Fu, Somay Jain, Song Mei, Sonya Egoian, Spencer Kim, Spug Golden, SQ Mah, Steph Lin, Stephen Imm, Steve Sharpe, Steve Yadlowsky, Sulman Choudhry, Sungwon Eum, Suvansh Sanjeev, Tabarak Khan, Tal Stramer, Tao Wang, Tao Xin, Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas Degry, Thomas Shadwell, Tianfu Fu, Tianshi Gao, Timur Garipov, Tina Sriskandarajah, Toki Sherbakov, Tomek Korbak, Tomer Kaftan, Tomo Hiratsuka, Tongzhou Wang, Tony Song, Tony Zhao, Troy Peterson, Val Kharitonov, Victoria Chernova, Vineet Kosaraju, Vishal Kuo, Vitchyr Pong, Vivek Verma, Vlad Petrov, Wanning Jiang, Weixing Zhang, Wenda Zhou, Wenlei Xie, Wenting Zhan, Wes McCabe, Will DePue, Will Ellsworth, Wulfie Bain, Wyatt Thompson, Xiangning Chen, Xiangyu Qi, Xin Xiang, Xinwei Shi, Yann Dubois, Yaodong Yu, Yara Khakbaz, Yifan Wu, Yilei Qian, Yin Tat Lee, Yinbo Chen, Yizhen Zhang, Yizhong Xiong, Yonglong Tian, Young Cha, Yu Bai, Yu Yang, Yuan Yuan, Yuanzhi Li, Yufeng Zhang, Yuguang Yang, Yujia Jin, Yun Jiang, Yunyun Wang, Yushi Wang, Yutian Liu, Zach Stubenvoll, Zehao Dou, Zheng Wu, and Zhigang Wang\.Openai gpt\-5 system card, 2026\.
- \(33\)Yifan Sun, Han Wang, Dongbai Li, Gang Wang, and Huan Zhang\.The emperor’s new clothes in benchmarking? a rigorous examination of mitigation strategies for llm benchmark data contamination\.arXiv preprint arXiv:2503\.16402, 2025\.
- \(34\)Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei\.Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.arXiv preprint arXiv:2210\.09261, 2022\.
- \(35\)Zhi Rui Tam, Cheng\-Kuang Wu, Chieh\-Yen Lin, and Yun\-Nung Chen\.None of the above, less of the right parallel patterns in human and llm performance on multi\-choice questions answering\.InFindings of the Association for Computational Linguistics: ACL 2025, pages 20112–20134, 2025\.
- \(36\)Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al\.Kimi k2\. 5: Visual agentic intelligence\.arXiv preprint arXiv:2602\.02276, 2026\.
- \(37\)Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh\.Universal adversarial triggers for attacking and analyzing nlp, 2021\.
- \(38\)Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz\-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh\-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum\.Livebench: A challenging, contamination\-free LLM benchmark\.InThe Thirteenth International Conference on Learning Representations, 2025\.
- \(39\)Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Anh Tuan Luu, and William Yang Wang\.AntiLeakBench: Preventing data contamination by automatically constructing benchmarks with updated real\-world knowledge\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 18403–18419, Vienna, Austria, July 2025\. Association for Computational Linguistics\.
- \(40\)Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar\.On memorization of large language models in logical reasoning\.In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F\. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics, pages 2742–2785, Mumbai, India, December 2025\. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics\.
- \(41\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.arXiv preprint arXiv:2505\.09388, 2025\.
- \(42\)Gulsum Yigit and Mehmet Fatih Amasyali\.Adversarial distractor generation for mcqa: Leveraging in\-context learning and rule\-based approaches\.Natural Language Processing Journal, 13:100186, 2025\.
- \(43\)Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, and Furu Wei\.Mmlu\-cf: A contamination\-free multi\-task language understanding benchmark, 2024\.

## Appendix AImplementation Details

All models were accessed through their respective official public APIs: GPT\-5\.4 and o3 \(OpenAI\), Claude\-Opus\-4\.6 \(Anthropic\), DeepSeek\-R1/V3\.2/V4\-Pro \(DeepSeek\), Gemini\-3\.1\-pro \(Google\), Kimi\-k2\.5 \(Moonshot\), GLM\-5/4\.7 \(Zhipu\), Qwen3\.5/3\.6\-Plus \(Alibaba\)\.

### A\.1Model Evaluation Prompt

The following system prompt was used for all model evaluations on both Base and Combinatorial subsets:

Youaretakingamultiple\-choicelogictest\.

Instructions:

\-Readthecontextcarefully\(maycontainstatementsI,II,III,IV\)

\-Evaluateeachoption\(A,B,C,D,etc\.\)

\-SelectALLoptionsthatarecorrect\(maybeoneormore\)

\-Youmayshowyourreasoning,butputyourFINALANSWERasjusttheletter\(s\)onthelastline

\-Format:"A"or"A,B"or"B,C,D"

The user prompt presented the question text directly, with no additional prefix for the Base subset\. For the Combinatorial subset, the prompt included the extended context𝒞′\\mathcal\{C\}^\{\\prime\}with atomized statements𝒮\\mathcal\{S\}and compound formulasΦ\\Phi\.

### A\.2Cognitive Scoring Prompts

The original prompts were written in Chinese for Kimi\-k2\.5; the following English translations are provided for reproducibility documentation\.

Standard mode \(default\):

Pleasesolvethefollowinglogicalreasoningproblemandshowyourthinkingprocessindetail\.

\{base\}

Requirements:

1\.Analyzestepbystep,explainingthereasoningbasisforeachstep

2\.Evaluateeachoption\(correctorreasonforelimination\)

3\.Finallyprovideadefinitiveanswer

Pleasebegin:

Strict mode:

Pleasesolvethefollowinglogicalreasoningproblem\.Beforegivingthefinalanswer,showyourcompletethinkingprocess:

\{base\}

Pleasebeginyourdetailedreasoning:

Minimal mode:

\{base\}

Pleasereasonindetailbeforegivingyouranswer:

Human evaluators received the following instruction: "Please answer all 30 multiple\-choice questions to the best of your ability\. There is no time limit\. Each question has one or more correct options\. Select all that apply\."

### A\.3Combinatorial Synthesis Algorithm

Algorithm[1](https://arxiv.org/html/2605.07268#alg1)provides the pseudocode for the deterministic combinatorial protocol described in Section[3\.1](https://arxiv.org/html/2605.07268#S3.SS1)\.

Algorithm 1Combinatorial SynthesisInput:Atomic question

𝒬=\(𝒞,𝒪,a\)\\mathcal\{Q\}=\(\\mathcal\{C\},\\mathcal\{O\},a\)with

\|𝒪\|=4\|\\mathcal\{O\}\|=4, difficulty configuration

ℒops\\mathcal\{L\}\_\{\\text\{ops\}\}
Output:Combinatorial question

𝒬′=\(𝒞′,Φ,𝒜\)\\mathcal\{Q\}^\{\\prime\}=\(\\mathcal\{C\}^\{\\prime\},\\Phi,\\mathcal\{A\}\)
Initialize

𝒫←\{pI,pII,pIII,pIV\}\\mathcal\{P\}\\leftarrow\\\{p\_\{I\},p\_\{II\},p\_\{III\},p\_\{IV\}\\\}, compute

𝒯\\mathcal\{T\}from

aawhere

𝒯\(pa\)=⊤\\mathcal\{T\}\(p\_\{a\}\)=\\top
Initialize candidate pools

Φ⊤←∅\\Phi\_\{\\top\}\\leftarrow\\emptyset,

Φ⊥←∅\\Phi\_\{\\bot\}\\leftarrow\\emptyset
// Phase 1: Generate valid formulas \(evaluate to⊤\\topunder𝒯\\mathcal\{T\}\)

Add

EXACTa:pa∧⋀j≠a¬pj\\text\{EXACT\}\_\{a\}:p\_\{a\}\\land\\bigwedge\_\{j\\neq a\}\\neg p\_\{j\}to

Φ⊤\\Phi\_\{\\top\}
if

ℒops\\mathcal\{L\}\_\{\\text\{ops\}\}allows

∨\\lorthen

for

j≠aj\\neq ado

Add

pa∨pjp\_\{a\}\\lor p\_\{j\}to

Φ⊤\\Phi\_\{\\top\}
endfor

endif

if

ℒops\\mathcal\{L\}\_\{\\text\{ops\}\}allows

¬\\negthen

for

j≠aj\\neq ado

Add

¬pj\\neg p\_\{j\}to

Φ⊤\\Phi\_\{\\top\}// True since𝒯\(pj\)=⊥\\mathcal\{T\}\(p\_\{j\}\)=\\bot

endfor

if

ℒops\\mathcal\{L\}\_\{\\text\{ops\}\}allows compound

∧\\landthen

forpairs

\(j,k\)\(j,k\)where

j,k≠aj,k\\neq ado

Add

¬pj∧¬pk\\neg p\_\{j\}\\land\\neg p\_\{k\}to

Φ⊤\\Phi\_\{\\top\}
endfor

endif

endif

// Phase 2: Generate distractors \(evaluate to⊥\\botunder𝒯\\mathcal\{T\}\)

for

j≠aj\\neq ado

Add

EXACTj:pj∧⋀k≠j¬pk\\text\{EXACT\}\_\{j\}:p\_\{j\}\\land\\bigwedge\_\{k\\neq j\}\\neg p\_\{k\}to

Φ⊥\\Phi\_\{\\bot\}
endfor

if

ℒops\\mathcal\{L\}\_\{\\text\{ops\}\}allows

∨\\lorthen

Add distractor disjunctions not covering

pap\_\{a\}to

Φ⊥\\Phi\_\{\\bot\}
endif

if

ℒops\\mathcal\{L\}\_\{\\text\{ops\}\}allows

¬\\negthen

Add

¬pa\\neg p\_\{a\}to

Φ⊥\\Phi\_\{\\bot\}// False since𝒯\(pa\)=⊤\\mathcal\{T\}\(p\_\{a\}\)=\\top

endif

Add universal distractor

NONE:⋀i¬pi\\text\{NONE\}:\\bigwedge\_\{i\}\\neg p\_\{i\}to

Φ⊥\\Phi\_\{\\bot\}
// Phase 3: Select and assemble

Determine target count

ncorrect∼Uniform\(ℒops\.n\_correct\)n\_\{\\text\{correct\}\}\\sim\\text\{Uniform\}\(\\mathcal\{L\}\_\{\\text\{ops\}\}\.\\text\{n\\\_correct\}\)
if

ℒops\\mathcal\{L\}\_\{\\text\{ops\}\}requires

¬\\negthen

Enforce at least one

¬\\negformula in final selection

endif

if

ℒops\\mathcal\{L\}\_\{\\text\{ops\}\}requires

∨\\lorthen

Enforce at least one

∨\\lorformula in final selection

endif

Sample remaining formulas from

Φ⊤\\Phi\_\{\\top\}using difficulty weights

wwuntil

\|Φ∩\|=ncorrect\|\\Phi\_\{\\cap\}\|=n\_\{\\text\{correct\}\}
Sample

nout−ncorrectn\_\{\\text\{out\}\}\-n\_\{\\text\{correct\}\}formulas from

Φ⊥\\Phi\_\{\\bot\}uniformly

Φ←RandomShuffle\(Φ∩∪Φdistractor\)\\Phi\\leftarrow\\text\{RandomShuffle\}\(\\Phi\_\{\\cap\}\\cup\\Phi\_\{\\text\{distractor\}\}\), preserving option indices

𝒜←\{j:ϕj∈Φ∩\}\\mathcal\{A\}\\leftarrow\\\{j:\\phi\_\{j\}\\in\\Phi\_\{\\cap\}\\\}

return

\(𝒞∪\{sI,sII,sIII,sIV\},Φ,𝒜\)\(\\mathcal\{C\}\\cup\\\{s\_\{I\},s\_\{II\},s\_\{III\},s\_\{IV\}\\\},\\Phi,\\mathcal\{A\}\)

### A\.4IRT\-CAT Engine Configuration

Table[5](https://arxiv.org/html/2605.07268#A1.T5)summarizes the hyperparameters for the adaptive testing engine\.

Table 5:IRT\-CAT engine configuration\.

## Appendix BFull Case Study Responses

This appendix provides complete, unedited model responses for representative cases discussed in Section[5\.1](https://arxiv.org/html/2605.07268#S5.SS1)\. All cases are drawn from the Hard\-Mode IRT\-CAT evaluation \(Sgold≥25S\_\{\\text\{gold\}\}\\geq 25\)\. Four diagnostic cases illustrate the failure spectrum\. \(i\)Over\-selection: correct application of De Morgan’s laws on an Expert\-tier question, but inclusion of an option that is technically true in isolation rather than accurately characterizing the complete truth pattern\. \(ii\)Under\-selection: correct identification that statements II and III challenge a conclusion, yet omission of a compound formula \(¬pIII∧¬pIV\\neg p\_\{\\text\{III\}\}\\land\\neg p\_\{\\text\{IV\}\}\) that was implicitly evaluated correctly—this pattern dominates at 89\.9% \(367/408\) of all failures\. \(iii\)Reasoning chain collapse: a Hard\-tier puzzle requiring 16\+ constraint steps, where the model loses track after 14 deductions and reaches an ambiguous disjunction\. \(iv\)Base success: perfect accuracy on a 2\-step Base question, confirming that the collapse is specific to propositional composition rather than domain unfamiliarity\. These cases reveal acombinatorial mapping gap: models succeed at Stage 1 \(logical analysis of atomic truth assignments\) but fail at Stage 2 \(translation into multi\-select option sets\)\. The multiplicative complexity of Stage 2 \(n\+m×kn\+m\\times koperations fornnpropositions andmmoptions\) exceeds current architectures’ reliable reasoning depth \(5–8 steps\), explaining the reasoning chain degeneration\.

### B\.1Case 1: Perfect Match \(Claude\-Opus\-4\.6, F1=1\.00\)

Question ID:LOGI\-7222fa3b44f7 Tier:Hard Operators:NOT

Ground Truth:B, D \(2 correct options\) Claude\-Opus\-4\.6 Prediction:B, D \(exact match\)

> ``` I need to analyze the logical statements from Xiao Li’s predictions. **Premises:** 1. Old Zhang wins OR Old Yan wins (P V Q) 2. If Old Zhang wins overseas project damaged (P R) 3. If Old Yan wins domestic project paused (Q S) **Evaluating each statement:** **Statement I:** "The company’s overseas project might not be damaged, and domestic product development project won’t be paused." From premises: P V Q. If P, then R. If Q, then S. So from P V Q, we get R V S (overseas damaged OR domestic paused). This means it’s NOT possible that both "overseas not damaged" AND "domestic not paused." Statement I says both could be the case (R \land S is possible), but from the premises, R V S must hold. So Statement I is **false**. **Statement II:** "If overseas project is damaged, then Old Zhang won the position." R P? We know P R, but R P is the converse, which doesn’t logically follow. The overseas project could be damaged for other reasons not specified. However, within this logical system, R is only introduced through P R. But strictly in propositional logic, we cannot conclude R P from P R alone. Statement II is **not derivable**. **Statement III:** "Overseas project will be damaged, or domestic project will be paused." From P V Q, and P R, Q S: If P then R, if Q then S. So P V Q gives us R V S. Statement III **follows logically**. This is **true**. **Statement IV:** "If overseas project is damaged AND domestic project is paused, then both Zhang and Yan won." (R \land S) (P \land Q)? Again, R doesn’t necessarily imply P, and S doesn’t necessarily imply Q in pure logic. Statement IV is **not derivable**. **Now evaluating the options:** - **A. I or II**: I is false, II is false. "I or II" means at least one is correct. Neither is correct from the premises. So A is **false**. - **B. not-IV (Not IV)**: IV is not derivable/true, so "Not IV" seems correct. Since IV cannot be derived from the premises, saying "not IV" (IV is not a valid conclusion) is **true**. - **C. only II (Only II)**: II is not derivable, so C is **false**. - **D. not-II (Not II)**: II is not derivable, so "Not II" is **true**. - **E. none of the above holds (None of the above)**: But B and D seem correct... Both B and D appear to be true statements. B, D ```

Analysis:Claude\-Opus\-4\.6 demonstrates flawless Stage 1 logical analysis \(correctly formalizing premises and evaluating all four statements\) and perfect Stage 2 combinatorial mapping \(identifying that B and D are the only true options\)\. This represents the 60% of cases where strong models achieve perfect accuracy when reasoning chain length is manageable \(∼\\sim10 steps\)\.

### B\.2Case 2: Systematic Under\-Selection \(Claude\-Opus\-4\.6, F1=0\.86\)

Question ID:LOGI\-9a4601e9ed26 Tier:Expert Operators:NOT \+ OR

Ground Truth:A, B, E, F \(4 correct options\) Claude\-Opus\-4\.6 Prediction:A, E, F \(3 options, missed B\)

> ``` I need to figure out which statements, if true, most effectively challenge the scholar’s argument. The scholar’s argument: Since non-smoking/non-drinking ethnic groups live 5-10 years shorter than smoking/drinking groups, smoking and drinking may not affect lifespan. The key weakness is that this is a correlation-based argument that ignores confounding variables (other factors that differ between these groups). Let me evaluate each statement: **I.** Smoking and drinking provide pleasure and stress relief This actually *supports* the scholar’s view (explains why smokers/ drinkers might live longer), doesn’t challenge it. **II.** Those ethnic groups that avoid tobacco/alcohol also have monotonous diets with extremely high calories This provides an **alternative explanation** for why those groups have shorter lifespans. It’s not the absence of smoking/drinking that shortens their lives, but their poor diet. This **challenges** the scholar’s conclusion by pointing out a confounding variable. **III.** Smoking and drinking are major causes of lung and liver cancer, which are the top two causes of death This directly challenges the conclusion by showing smoking/drinking DO affect lifespan through deadly diseases. This **challenges** the argument. **IV.** Tobacco and alcohol stimulate dopamine, relaxing organs and promoting body repair This actually *supports* the scholar’s view, doesn’t challenge it. So statements that challenge the argument: **II and III** (most effectively II, as it directly undermines the comparative evidence). Now let me evaluate the options: **A. "II is true"** - meaning II is one that challenges the argument. Yes, II does challenge it. **B. "Not III and not IV"** - meaning neither III nor IV challenges the argument. But III DOES challenge it, so this is false. **C. "Only IV"** - IV doesn’t challenge, it supports. False. **D. "None of the above are correct"** - Let me check if A is correct first. **E. "II or I"** - I doesn’t challenge (it supports), but II does. The statement "II or I" is logically true if at least one is true. Since II does challenge, "II or I" as a disjunction is true. **F. "II or IV"** - IV doesn’t challenge, but II does. "II or IV" is true since II challenges. Actually, if A is true, then E is true (II(IIVI)), and F is true (II(IIVIV)). So A, E, and F would all be correct. A, E, F ```

Analysis:Claude\-Opus\-4\.6 correctly completes Stage 1 \(identifying II and III as challenging, I and IV as non\-challenging\) but fails in Stage 2\. The model misses option B \(¬pIII∧¬pIV\\neg p\_\{\\text\{III\}\}\\land\\neg p\_\{\\text\{IV\}\}\) because it conflates “III challenges” with “not\-III is false”—failing to recognize that option B describes which statements*do not*challenge, a meta\-linguistic judgment requiring re\-evaluation of the already\-completed Stage 1 under a different logical frame\. This is the dominant failure mode \(89\.9% of 367 cases\)\.

### B\.3Case 3: Severe Under\-Selection on Complex Constraints \(Claude\-Opus\-4\.6, F1=0\.50\)

Question ID:LOGI\-f364bc97749a Tier:Hard Operators:NOT

Ground Truth:B, D, E \(3 correct options\) Claude\-Opus\-4\.6 Prediction:B \(1 option, severe under\-selection\)

> ``` I need to figure out what day "today" is and when each car is restricted. **Given information:** - E is restricted on Thursday - B was restricted yesterday - A and C can both drive for 4 consecutive days starting today - E can drive tomorrow - Each car is restricted one weekday (Mon-Fri), weekends no restrictions - At least 4 cars available each day **Step 1: Determine today’s day.** E is restricted Thursday. E can drive tomorrow, meaning tomorrow is not Thursday. So today is not Wednesday. A and C can drive for 4 consecutive days starting today. Each is restricted one weekday. If today is a weekday, then in a 4-day stretch, we need both A and C to not be restricted on any of those days. Let me consider possible days for "today": **If today is Thursday:** - 4 consecutive days: Thu, Fri, Sat, Sun. A and C must not be restricted Thu or Fri (Sat/Sun are free). So A and C are restricted on Mon, Tue, or Wed. - E can drive tomorrow (Friday) \checkmark (E is restricted Thursday) - B was restricted yesterday (Wednesday) - So: B=Wed, E=Thu, and A,C restricted on Mon/Tue/Wed. But B=Wed, so A and C are on Mon and Tue (in some order). D gets the remaining day: Friday. - Check at least 4 cars daily: - Mon: A restricted B,C,D,E = 4 \checkmark - Tue: C restricted (say) A,B,D,E = 4 \checkmark - Wed: B restricted 4 cars \checkmark - Thu: E restricted 4 cars \checkmark - Fri: D restricted 4 cars \checkmark This works! **If today is Saturday:** - 4 consecutive days: Sat, Sun, Mon, Tue. A and C must not be restricted Mon or Tue. So A and C are restricted Wed, Thu, or Fri. But E is restricted Thu, so A and C are among {Wed, Fri} exactly 2 slots for 2 cars. - E can drive tomorrow (Sunday) \checkmark - B was restricted yesterday (Friday). But if C is restricted Friday, then B and C both restricted Friday impossible. - So B=Friday means A must be Wed and C=Fri, but then B=Fri and C=Fri conflict! - This doesn’t work. **Today is not Saturday.** **If today is Friday:** - 4 consecutive days: Fri, Sat, Sun, Mon. A and C not restricted Fri or Mon. So A,C \in {Tue, Wed, Thu}. But E=Thu, so A,C \in {Tue, Wed}. - B restricted yesterday = Thursday. But E=Thursday. Conflict! **Not Friday.** **Other days:** Today=Monday: tomorrow=Tuesday, E drives \checkmark. 4 days: Mon-Thu. A,C not restricted Mon-Thu. A,C restricted Friday. But two cars can’t share Friday. \times Today=Tuesday: 4 days Tue-Fri. A,C not restricted Tue-Fri. Both restricted Monday impossible. \times Today=Sunday: B restricted yesterday=Saturday. But Saturday has no restrictions. \times **So today must be Thursday.** **Checking the statements:** - I. Today is Thursday \checkmark **TRUE** - II. Today is Saturday **FALSE** - III. A car restricted Wednesday We showed A is restricted Mon or Tue, not necessarily Wed. **Not necessarily true** - IV. C car restricted Friday C is restricted Mon or Tue, not Friday. **FALSE** Now evaluating options: - A. IV or II IV is false AND II is false, so "IV or II" ... this means at least one is true? Both false. **No.** - B. Only I **Yes**, I is the only definitely correct statement. - C. All fail No, I is correct. - D. Non IV IV is false, so "not IV" is true. **Yes.** - E. Non III III is not necessarily true, so "not III" is true. **Yes.** Wait, but B says "only I" is correct among I-IV. D says "not IV" is correct. E says "not III" is correct. These are about which statements are "definitely correct." III is not *necessarily* correct, and IV is definitely wrong. The question asks which statements are "definitely correct." Only I is definitely correct. So B is the answer. **B** ```

Analysis:Claude\-Opus\-4\.6 performs excellent Stage 1 constraint satisfaction: it systematically eliminates six possible days, correctly identifies Thursday, and accurately evaluates all four statements \(I true, II false, III not\-necessarily\-true, IV false\)\. The model even explicitly notes in its reasoning that D \(not\-IV\) and E \(not\-III\) are true statements\. However, it ultimately selects only option B, falling into the early\-exit trap: the model recognizes multiple correct options but defaults to selecting only the “most direct” answer\. This demonstrates the training paradigm defect—models are optimized for finding “the” answer rather than “all” correct answers, even when their own reasoning traces contain the evidence for completeness\.

### B\.4Case 4: Base Question Contrast \(GPT\-5\.4, F1=1\.00\)

Question ID:LOGI\-f94c86a2a136 Tier:Base \(0\-order reasoning\)

Ground Truth:E GPT\-5\.4 Prediction:E

> ``` The ethicist starts with a principle: experimental treatment without consent is generally wrong because patients have a right to informed choice. Then the argument says: in emergencies, learning the best treatment sometimes requires bypassing consent, so some restricted nonconsensual research should be allowed. For that conclusion to follow, the argument must assume that in at least some emergency cases, the value of gaining that knowledge can override the patient’s usual right to informed consent. That is exactly: E. The right of patients to informed consent is outweighed in at least some medical emergencies by the possible benefits of research conducted without their consent. Why not the others: - A: Not required; doctors’ uncertainty is not the key assumption. - B: Says informed patients would affect research outcomes; not needed. - C: Too strong and patient-specific; the argument is about research benefits generally, not only direct benefit to that patient. - D: Says the patient ceases to have the right entirely, which the argument does not require. E ```

Analysis:This Base question requires only 2 reasoning steps: \(1\) understand the argument structure, \(2\) select the correct option\. GPT\-5\.4 achieves perfect accuracy with clear, focused reasoning\. Contrast this with the 14–20 step LogiHard combinatorial questions \(Cases 1–3\) where the same model family fails despite generating substantially longer responses\. The Base–Combinatorial asymmetry confirms that the collapse stems from compositional complexity rather than domain knowledge or basic logical competence\.
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

Similar Articles

Logic-Regularized Verifier Elicits Reasoning from LLMs

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Learning to reason with LLMs

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Submit Feedback

Similar Articles

Logic-Regularized Verifier Elicits Reasoning from LLMs
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key