Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs
Summary
Introduces an axiomatic evaluation framework for latent thought representations in LLMs, revealing that current representations fail to satisfy four fundamental functional axioms (Causality, Minimality, Separability, Stability) across 23 reasoning tasks, indicating a structural gap in representation quality.
View Cached Full Text
Cached at: 06/29/26, 02:00 AM
Paper page - Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs
Source: https://huggingface.co/papers/2606.27378
Abstract
An axiomatic evaluation framework reveals systematic failures in latent thought representations of LLMs across multiple reasoning tasks, demonstrating that current representations fail to satisfy fundamental functional axioms consistently across different model architectures.
We introduce anaxiomatic evaluation frameworkforlatent thought representationsinLLMs, comprising metrics that are independent ofdownstream benchmark scoresand reveal representational failures that benchmark accuracy masks. Existing evaluations conflaterepresentation qualitywithmodel capacity. Therefore, failures cannot be attributed to the representation rather than to the model that processes it. We formalize fourfunctional axioms(Causality,Minimality,Separability, andStability) and define a quantitative measure for each, computed directly on the representation independently of downstream accuracy. We auditopen-weight LLMsacross 23reasoning tasks(e.g.,Spatial Reasoning,Factual QA). We find that no candidate satisfies all four axioms simultaneously, that the representations distinguish task type reliably but cannot distinguish between two questions within the same task, and that the representations encode little information beyond what is already present in the input embedding. The failure is consistent across dense, reasoning-distilled, and RL-trained model families, indicating that the gap is structural rather than a property of model size or training procedure.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2606\.27378
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.27378 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.27378 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.27378 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.
The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces
The article discusses a shift in LLM reasoning research from making reasoning explicit via chain-of-thought to exploring latent reasoning that doesn't require language traces, questioning whether visibility is necessary for effective reasoning.
Learning to Refine Hidden States for Reliable LLM Reasoning
Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
This paper introduces LGMT, a framework that uses first-order logic to generate semantically invariant test cases for evaluating LLM reasoning reliability. Experiments on six LLMs show that LGMT exposes hidden defects missed by static benchmarks, suggesting evaluation should focus on robustness under logical invariance.
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
A comprehensive survey analyzing over 300 papers on LLM reasoning, presenting a taxonomy of reasoning paradigms including Chain-of-Thought, Multi-Hop, Mathematical, Commonsense, and others, along with common failure modes and research gaps.