Tag
The article discusses a shift in LLM reasoning research from making reasoning explicit via chain-of-thought to exploring latent reasoning that doesn't require language traces, questioning whether visibility is necessary for effective reasoning.
Introduces ReasoningFlow, a framework to capture discourse structures of large language model reasoning traces as directed acyclic graphs, enabling fine-grained analysis of reasoning behaviors like self-reflection and backtracking. Based on manual and automatic annotation of thousands of traces, it reveals structural similarities across models and that most erroneous steps do not contribute to final answers.
This paper argues that consensus-seeking in multi-agent LLM systems is insufficient for value-laden tasks, proposing a knowledge-representation layer that classifies agent reasoning-trace disagreements into four symbolic states to enable strategic routing in systems like content moderation.
ReasonOps introduces an unsupervised method for annotating chain-of-thought traces from large reasoning models, identifying 7 recurring reasoning operators. The method enables analysis of reasoning structure, model identification, and correctness prediction across 12 models and 8 benchmarks.
This paper reveals that aggregating complete reasoning traces from multiple LLM agents, rather than just their final answers, can correct errors even when agents unanimously agree, introducing the 'aggregation paradox' and the Self-Consistent Mixture of Agents method.
User tested Gemma 4 2B running locally via LM Studio and Spring AI for structured JSON output, tool calling, and reasoning traces, finding it correctly identified a Java bug in code review and performed comparably to larger models.
This paper introduces the concept of 'minimal cores' in overcomplete reasoning traces, showing that on average 46% of steps can be removed while preserving the final answer, and that minimal cores improve trace separation and reduce intrinsic dimensionality.
This paper investigates intrinsic data metrics to predict the utility of reasoning supervision before costly fine-tuning, finding that smaller models benefit from alignment-focused metrics while larger models gain from verbose traces, thus establishing a scale-aware framework for validating reasoning datasets.
This paper introduces a controlled-invariance methodology and two oracle tests (Force and Remove) to determine if LLM hallucination detectors rely on reasoning traces or final answer artifacts. It proposes TRACT, a lightweight scorer using lexical features, which demonstrates robust performance independent of answer-level cues.
This research paper analyzes LLM reasoning traces in the game four-in-a-row, finding that LLMs exhibit myopic planning where performance is driven by shallow search breadth rather than deep lookahead, unlike human experts.
RadAgent is a tool-using AI agent that generates chest CT reports through interpretable step-by-step reasoning, improving clinical accuracy by 36.4% relative and achieving 37% faithfulness—a capability absent in existing 3D vision-language models. The system provides fully inspectable reasoning traces allowing clinicians to validate and refine diagnostic outputs.