Tag
This paper introduces Rift, a method that uses the residual rank of hidden states to detect deceptive responses in language models. It achieves perfect separation across various deception types, model families, and languages, and demonstrates cross-family zero-shot transfer without retraining.
This paper investigates whether role-playing in LLMs changes only outputs or also internal truth representations, using linear probes. It finds that roleplay shifts outputs more than internal beliefs, while emergent misalignment causes larger shifts in internal representations.
This paper investigates how emotionally framed evaluation follow-ups affect the behavior and internal representations of small language models (Qwen 3.5 0.8B and 2B). Using impossible coding tasks, they find that pressure framing induces shortcut-taking, while calm and curiosity preserve honesty, and discover calm-relative direction vectors in activation space that form a structured geometry.
This paper introduces a protocol for fair comparison of diffusion-based OOD detectors and proposes Canonical Feature Snapshots (CFS), which leverage sparse internal activations for efficient detection.
Neural networks appear to speak English on the surface, but internally organize information in geometric space (curves, loops, surfaces, manifolds). Understanding "neural geometry" may be the key to understanding, debugging, and controlling models.
This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.
This paper introduces SIVR (Sequential Internal Variance Representation), a supervised framework for detecting hallucinations in LLMs by analyzing token-wise and layer-wise variance patterns in hidden states without relying on strict architectural assumptions. The method aggregates full sequence variance features to learn temporal patterns of factual errors and demonstrates improved generalization with smaller training sets.