Tag
This paper presents a case study using unsupervised articulatory probing to examine how self-supervised speech models encode phonetic features across Mandarin sub-dialects, finding that salient features like labiality remain stable while finer spectral distinctions show dialect-dependent variation.
MemTrace is a benchmark that evaluates LLM agent memory at the knowledge point level, probing how facts behave under varying memory age, question type, and evidence conditions. It reveals that pooled accuracy hides distinct failure modes, and that the main bottleneck is evidence use rather than retrieval.
This paper investigates why instruction-tuned language models give different answers to causal reasoning questions when variable names are replaced with placeholders, finding that the issue stems from representational misalignment rather than information loss. The authors introduce Vernier, a method using paired-view weight updates and mechanism inspection to reveal that answer-relevant content is still present in the placeholder view but misaligned.
This paper introduces 'fragility', a complementary metric to probe accuracy that measures activation-noise level at which probe accuracy collapses, enabling analysis of representation evolution during LLM pre-training even after accuracy saturates.
The article introduces a technique that extracts hidden states from an LLM at the last prompt token to perform classification without text generation, using a small MLP to read the model's internal decision, enabling fast and cheap zero-shot classifiers.
This paper investigates whether open-source quantized LLMs encode a linearly separable truthfulness signal in their hidden states. Across three 7B-8B instruction-tuned models, a linear probe on a single mid-network layer achieves 0.904-1.000 AUROC on hallucination detection benchmarks, outperforming sampling-based methods.
This paper introduces the readout-mediator angle to demonstrate that linear probes can decode information from language model activations that is orthogonal to the model's actual causal computation, undermining probe-based interpretability. The finding replicates across model scales and families, revealing a fundamental failure mode in using probes for mechanistic understanding or safety monitoring.
This paper investigates whether large language models encode syntactic abstractions like phase boundaries that are not captured by Universal Dependencies, using structural probes on wh-movement stimuli with invariant UD distances, finding evidence across 13 LLMs for phase-structure representations that are causally active.
This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.
This paper proposes a Polar Probe that linearly recovers semantic structures from LLM activations by representing entity relations through distance and direction in a learned subspace. Testing across arithmetic, visual scenes, family trees, metro maps, and social interactions shows the code emerges in middle layers, generalizes to new entities, and causally influences model predictions.
This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.
Researchers probe language model representations to predict human reading times across five languages, finding early layers outperform surprisal for early-pass measures while surprisal remains superior for late-pass measures.