model-interpretability

#model-interpretability

Visual Access Boundaries in Vision-Language Model Reasoning

arXiv cs.AI ↗ · 2026-07-15 Cached

This paper introduces Visual Access Sweep, a causal intervention method to measure the minimal image-token access needed for Vision-Language Model reasoning, and finds that Chain-of-Thought prompting does not primarily improve performance by prolonging direct image access but by enabling extended language-side computation over visual information.

0 favorites 0 likes

#model-interpretability

From Direction to Magnitude: How Multimodal Instruction-Tuning Reorganizes the Geometric Encoding of Identity-Specifying Prompts in Transformer Hidden States

arXiv cs.LG ↗ · 2026-07-14 Cached

This paper investigates how multimodal instruction-tuning reorganizes the geometric encoding of identity-specifying prompts in transformer hidden states, finding a shift from direction-based to magnitude-based encoding after instruction tuning.

0 favorites 0 likes

#model-interpretability

Interactive Jacobian-Lens visualizer and live steerer for GGUF models on llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-07-12

An interactive Jacobian-lens visualizer and live steerer for GGUF models running on llama.cpp, enabling real-time model interpretability and control.

0 favorites 0 likes

#model-interpretability

Overthinking: Amplifying Reasoning Weights to Extract Learned Secrets

arXiv cs.AI ↗ · 2026-07-10 Cached

Introduces 'overthinking', a technique that amplifies reasoning weights from reasoning-distilled models to induce disclosure of hidden information in language models, demonstrating up to 10x greater secret leakage across 2B-32B models.

0 favorites 0 likes

#model-interpretability

WARP: Weight-Space Analysis for Recovering Training Data Portfolios

arXiv cs.LG ↗ · 2026-07-03 Cached

WARP is a framework that recovers the domain mixture weights of a fine-tuned model from its released weights by generating pseudo-checkpoints via model merging and extracting geometric features. It achieves low mean absolute error on BERT and GPT-2, outperforming membership inference.

0 favorites 0 likes

#model-interpretability

Do Thinking Tokens Help with Safety?

Hugging Face Daily Papers ↗ · 2026-06-23 Cached

This paper investigates whether reasoning models' thinking tokens genuinely improve safety alignment, finding that safety outcomes are predictable from early hidden representations and that deliberation is largely superficial, with current safety interventions causing over-refusal.

0 favorites 0 likes

#model-interpretability

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

arXiv cs.LG ↗ · 2026-06-18 Cached

This paper proposes a post-hoc certification framework for sparse autoencoder (SAE) based interpretability, deriving an upper bound on the frozen language model's risk using measurable quantities. The framework is validated on GPT-2 Small, Gemma-2B, and Llama-3-8B, showing non-vacuous bounds and revealing depth-dependent behavior.

0 favorites 0 likes

#model-interpretability

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

This paper demonstrates that Whisper's hallucination failures on silence, noise, or music can be detected and mitigated purely from internal activations using sparse autoencoders, achieving large reductions in hallucination rate without fine-tuning.

0 favorites 0 likes

#model-interpretability

A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper examines counterfactual behavior in ML models through a geometric lens, showing that models with similar predictive performance can differ substantially in counterfactual outcomes due to the interaction between decision-boundary proximity and local data support. The findings identify counterfactual behavior as a distinct dimension from predictive performance, with implications for model selection and reliability of counterfactual explanation methods.

0 favorites 0 likes

#model-interpretability

Function-Valued Causal Influence in Nonlinear Time Series

arXiv cs.LG ↗ · 2026-05-27 Cached

This paper argues that scalar edge scores in nonlinear causal discovery obscure state-dependent effects, and proposes function-valued causal influence using Neural Additive Vector Autoregression and Individual Conditional Expectation.

0 favorites 0 likes

#model-interpretability

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

arXiv cs.CL ↗ · 2026-05-27 Cached

A framework for evaluating and steering cultural values in LLMs using scenario-based behavioral probing and activation steering, revealing latent entanglement of value dimensions.

0 favorites 0 likes

#model-interpretability

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

arXiv cs.LG ↗ · 2026-05-20

This paper provides the first systematic analysis of error sources in trajectory-based data attribution methods, identifies optimizer mismatch as the dominant error, proposes AdamW-influence to address it, and offers practical guidelines for data selection via a K-step look-ahead framework.

0 favorites 0 likes

#model-interpretability

Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI

arXiv cs.AI ↗ · 2026-05-08 Cached

This paper evaluates explainability methods in safety-critical Automatic Target Recognition (ATR) systems, highlighting the limitations of post-hoc techniques like saliency and attention maps. It proposes a taxonomy and assessment framework to address issues such as spurious explanations and instability, advocating for more robust, causally grounded XAI approaches.

0 favorites 0 likes

#model-interpretability

How confessions can keep language models honest

OpenAI Blog ↗ · 2025-12-03 Cached

OpenAI proposes a novel 'confessions' training method where AI models are incentivized to explicitly admit when they engage in undesirable behaviors like hallucinating, reward-hacking, or violating instructions, achieving a 4.4% false negative rate in detecting misbehavior across stress-test evaluations.

0 favorites 0 likes

model-interpretability

Submit Feedback