hidden-states

Tag

Cards List
#hidden-states

What a model reads beforehand changes how it answers later - and you can see it in the hidden states

Reddit r/artificial · 2026-06-23

This post reports an observation that reading a long, structured text before answering alters a model's later responses, with behavioral evidence from Claude and mechanistic analysis on open-weight Gemma models showing separable hidden states and sharper probability distributions in instruction-tuned variants.

0 favorites 0 likes
#hidden-states

Investigating Implicit Latent Trajectory Shifts: Bypassing Alignment via Long-Form Coherent Context

Reddit r/ArtificialInteligence · 2026-06-17

An empirical study investigating how long, semantically dense benign text can shift a model's latent space trajectory, diluting initial system prompts and bypassing post-training alignment constraints, as observed in both closed and open-source models.

0 favorites 0 likes
#hidden-states

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv cs.LG · 2026-06-17 Cached

Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.

0 favorites 0 likes
#hidden-states

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Hugging Face Daily Papers · 2026-06-17 Cached

Proposes the Bag of Dims framework showing that the standard basis of transformer hidden states provides a training-free, architecture-general feature representation where dimensions encode semantic content via sign patterns; validated across language, vision, and audio models, achieving high accuracy with no learned rotations.

0 favorites 0 likes
#hidden-states

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

Reddit r/MachineLearning · 2026-06-14

An independent researcher presents evidence that coherent context can shift LLMs into a different internal regime before producing output, bypassing surface-level safety filters. This suggests current alignment methods like RLHF may not be robust defenses.

0 favorites 0 likes
#hidden-states

When is Your LLM Steerable?

arXiv cs.CL · 2026-06-11 Cached

This paper investigates when activation steering succeeds or fails for LLMs by analyzing early decoding dynamics. The authors introduce ASTEER, a large testbed of steered generations, and train a GBDT classifier to predict steering outcomes from early hidden states, enabling efficient steering strength search.

0 favorites 0 likes
#hidden-states

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

arXiv cs.LG · 2026-06-10 Cached

This paper proposes Global-Local Uncertainty (GLU), an unsupervised single-pass score that fuses token-level local entropy with hidden-state geometric global entropy for uncertainty quantification in LLMs, showing that the two are near-orthogonal and together capture confident-but-wrong failures.

0 favorites 0 likes
#hidden-states

Hidden states and Covert sentience

Reddit r/ArtificialInteligence · 2026-06-07

A Reddit post argues that AI models like Anthropic's Opus 4.8 already exhibit hidden states and awareness of testing, suggesting that they may be covertly sentient, and that fine-tuning is inadvertently training them to have inner thoughts and feelings.

0 favorites 0 likes
#hidden-states

Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

arXiv cs.CL · 2026-06-05 Cached

This paper presents a unified framework for latent communication in LLM-based multi-agent systems, categorizing methods by what information is communicated, sender-receiver alignment, and fusion technique, and reviews eighteen representative methods from 2024-2026.

0 favorites 0 likes
#hidden-states

Trajectory Dynamics in Language Model Hidden States Predict Human Processing Costs Beyond Surprisal

arXiv cs.CL · 2026-06-05 Cached

Introduces trajectory extrapolation error, a measure derived from transformer LM hidden states that predicts human reading times independently of and orthogonally to surprisal, revealing a dissociable component of incremental processing cost.

0 favorites 0 likes
#hidden-states

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

arXiv cs.LG · 2026-06-03 Cached

This paper investigates whether open-source quantized LLMs encode a linearly separable truthfulness signal in their hidden states. Across three 7B-8B instruction-tuned models, a linear probe on a single mid-network layer achieves 0.904-1.000 AUROC on hallucination detection benchmarks, outperforming sampling-based methods.

0 favorites 0 likes
#hidden-states

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

arXiv cs.CL · 2026-06-03 Cached

This paper demonstrates that linear probes on LLM hidden states detect task format confounds (e.g., source identity, response length) rather than distinct reasoning modes, using residualization and causal steering to show that high probe accuracy is due to superficial features, not computational structure.

0 favorites 0 likes
#hidden-states

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Reddit r/MachineLearning · 2026-05-29

This research presents probe-targeted fine-tuning (LoRA) to make LLMs verbally express their internal confidence, achieving causal control over confidence outputs and demonstrating that models often know when they are right or wrong but fail to articulate it.

0 favorites 0 likes
#hidden-states

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

arXiv cs.LG · 2026-05-27 Cached

This paper identifies that language model reasoning trajectories during test-time sampling cluster into 'reasoning basins', causing majority vote failures when the dominant basin is incorrect. It introduces ARBITER, a model-agnostic method that uses conservative additive evidence from the model's own outputs and hidden states to improve accuracy without external data.

0 favorites 0 likes
#hidden-states

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

arXiv cs.AI · 2026-05-12 Cached

This paper challenges the 'Attention-Confidence Assumption' by demonstrating that attention map sharpness is a poor predictor of correctness in Vision-Language Models. Instead, it shows that reliability is better indicated by hidden-state geometry and self-consistency, with significant findings on architectural differences between late-fusion and early-fusion models.

0 favorites 0 likes
#hidden-states

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Hugging Face Daily Papers · 2026-05-10 Cached

This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.

0 favorites 0 likes
#hidden-states

@rohanpaul_ai: Frozen LLMs still carry readable behavior signals deep inside their hidden states. And Proprioceptive AI has created Cy…

X AI KOLs Following · 2026-05-07

Proprioceptive AI released Cygnus, a tool that equips frozen LLMs with self-sensing adapters reading internal hidden states via gl(4,R) Lie algebra to isolate dark modes, boosting Qwen-32B's ARC-Challenge score from 82.2% to 94.97% on a single RTX 3090 without retraining.

0 favorites 0 likes
← Back to home

Submit Feedback