internal-representations

#internal-representations

Rift: A Conflict Signature for Deception in Language Models

arXiv cs.LG ↗ · 2026-06-17 Cached

This paper introduces Rift, a method that uses the residual rank of hidden states to detect deceptive responses in language models. It achieves perfect separation across various deception types, model families, and languages, and demonstrates cross-family zero-shot transfer without retraining.

0 favorites 0 likes

#internal-representations

When Roleplaying, Do Models Believe What They Say?

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper investigates whether role-playing in LLMs changes only outputs or also internal truth representations, using linear probes. It finds that roleplay shifts outputs more than internal beliefs, while emergent misalignment causes larger shifts in internal representations.

0 favorites 0 likes

#internal-representations

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

arXiv cs.CL ↗ · 2026-05-21 Cached

This paper investigates how emotionally framed evaluation follow-ups affect the behavior and internal representations of small language models (Qwen 3.5 0.8B and 2B). Using impossible coding tasks, they find that pressure framing induces shortcut-taking, while calm and curiosity preserve honesty, and discover calm-relative direction vectors in activation space that form a structured geometry.

0 favorites 0 likes

#internal-representations

Backbone-Equated Diffusion OOD via Sparse Internal Snapshots

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper introduces a protocol for fair comparison of diffusion-based OOD detectors and proposes Canonical Feature Snapshots (CFS), which leverage sparse internal activations for efficient detection.

0 favorites 0 likes

#internal-representations

@FinanceYF5: Neural Networks Speak English, But They Think in "Shapes" 1/ Neural Networks Don't Think in Words They appear to speak English on the surface, but internally they may organize information in geometric space: curves, loops, surfaces, manifolds. Understanding neural geometry may be the key to understanding, debugging, and controlling models.

X AI KOLs Following ↗ · 2026-05-08 Cached

Neural networks appear to speak English on the surface, but internally organize information in geometric space (curves, loops, surfaces, manifolds). Understanding "neural geometry" may be the key to understanding, debugging, and controlling models.

0 favorites 0 likes

#internal-representations

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.

0 favorites 0 likes

#internal-representations

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces SIVR (Sequential Internal Variance Representation), a supervised framework for detecting hallucinations in LLMs by analyzing token-wise and layer-wise variance patterns in hidden states without relying on strict architectural assumptions. The method aggregates full sequence variance features to learn temporal patterns of factual errors and demonstrates improved generalization with smaller training sets.

0 favorites 0 likes

internal-representations

Rift: A Conflict Signature for Deception in Language Models

When Roleplaying, Do Models Believe What They Say?

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

Backbone-Equated Diffusion OOD via Sparse Internal Snapshots

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

Submit Feedback