Tag
A Reddit post argues that AI models like Anthropic's Opus 4.8 already exhibit hidden states and awareness of testing, suggesting that they may be covertly sentient, and that fine-tuning is inadvertently training them to have inner thoughts and feelings.
This paper presents a unified framework for latent communication in LLM-based multi-agent systems, categorizing methods by what information is communicated, sender-receiver alignment, and fusion technique, and reviews eighteen representative methods from 2024-2026.
Introduces trajectory extrapolation error, a measure derived from transformer LM hidden states that predicts human reading times independently of and orthogonally to surprisal, revealing a dissociable component of incremental processing cost.
This paper investigates whether open-source quantized LLMs encode a linearly separable truthfulness signal in their hidden states. Across three 7B-8B instruction-tuned models, a linear probe on a single mid-network layer achieves 0.904-1.000 AUROC on hallucination detection benchmarks, outperforming sampling-based methods.
This paper demonstrates that linear probes on LLM hidden states detect task format confounds (e.g., source identity, response length) rather than distinct reasoning modes, using residualization and causal steering to show that high probe accuracy is due to superficial features, not computational structure.
This research presents probe-targeted fine-tuning (LoRA) to make LLMs verbally express their internal confidence, achieving causal control over confidence outputs and demonstrating that models often know when they are right or wrong but fail to articulate it.
This paper identifies that language model reasoning trajectories during test-time sampling cluster into 'reasoning basins', causing majority vote failures when the dominant basin is incorrect. It introduces ARBITER, a model-agnostic method that uses conservative additive evidence from the model's own outputs and hidden states to improve accuracy without external data.
This paper challenges the 'Attention-Confidence Assumption' by demonstrating that attention map sharpness is a poor predictor of correctness in Vision-Language Models. Instead, it shows that reliability is better indicated by hidden-state geometry and self-consistency, with significant findings on architectural differences between late-fusion and early-fusion models.
This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.
Proprioceptive AI released Cygnus, a tool that equips frozen LLMs with self-sensing adapters reading internal hidden states via gl(4,R) Lie algebra to isolate dark modes, boosting Qwen-32B's ARC-Challenge score from 82.2% to 94.97% on a single RTX 3090 without retraining.