Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction
Summary
Introduces Epi2Diff, a framework that maps LLM reasoning traces into cognitive episodes to predict human item difficulty, outperforming baselines and providing interpretable process evidence.
View Cached Full Text
Cached at: 06/29/26, 05:25 AM
# Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction Source: [https://arxiv.org/abs/2606.28186](https://arxiv.org/abs/2606.28186) [View PDF](https://arxiv.org/pdf/2606.28186) > Abstract:Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction\. Existing methods often depend on costly human calibration or item\-level textual representations, providing limited evidence about the cognitive processes that make items difficult\. We argue that difficulty should be viewed not only as a property of item text, but also as an observable consequence of the problem\-solving burden an item induces\. Large Reasoning Models \(LRMs\) offer scalable process evidence through reasoning traces, but such evidence must be structured to support interpretable modeling\. To this end, we introduce Epi2Diff \(Episode to Difficulty\), a framework that maps LRM reasoning traces into cognitively grounded episode sequences\. These episodes group trace segments into functional problem\-solving states, enabling difficulty to be modeled through reasoning scale, effort allocation, and state transitions\. Epi2Diff extracts compact episode\-dynamic features and combines them with semantic item representations for human difficulty prediction\. Experiments on four real\-world human difficulty datasets show that Epi2Diff consistently outperforms strong baselines, including fine\-tuned small language models, LLM in\-context learning, and supervised LLM adaptation\. On SAT\-derived classification benchmarks, Epi2Diff achieves an 8\.1% average relative gain over supervised LLM fine\-tuning baselines\. Further analyses show that harder items induce more effortful, iterative, and implementation\-centered episode dynamics, rather than merely longer responses\. These results demonstrate that cognitive episodes in LRM reasoning traces provide a predictive and interpretable process representation for human item difficulty, offering a new lens for educational measurement with reasoning models\. ## Submission history From: Chenguang Wang \[[view email](https://arxiv.org/show-email/eee6397a/2606.28186)\] **\[v1\]**Fri, 26 Jun 2026 15:32:17 UTC \(3,478 KB\)
Similar Articles
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
This paper introduces HyperLens, a high-resolution probe to quantify cognitive effort in LLMs by tracing fine-grained confidence trajectories across layers. It reveals that complex tasks require higher cognitive effort and demonstrates how Supervised Fine-Tuning can reduce this effort, potentially degrading performance.
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.
Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs
This paper proposes E³RL, a reinforcement learning method that uses dynamic epistemic entropy thresholds to enable LLMs to excise local logical defects during generation, overcoming the autoregressive curse in long-horizon reasoning and achieving state-of-the-art results on mathematical reasoning benchmarks like AIME.
Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation
This paper dissociates difficulty registration from deliberation allocation in large reasoning models (LRMs) and humans, finding that LRMs spend more tokens on problems they get wrong while humans spend less time on failures, revealing opposite within-item patterns despite similar cross-item difficulty correlations.
Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.