model-interpretability

#model-interpretability

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

This paper demonstrates that Whisper's hallucination failures on silence, noise, or music can be detected and mitigated purely from internal activations using sparse autoencoders, achieving large reductions in hallucination rate without fine-tuning.

0 favorites 0 likes

#model-interpretability

A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper examines counterfactual behavior in ML models through a geometric lens, showing that models with similar predictive performance can differ substantially in counterfactual outcomes due to the interaction between decision-boundary proximity and local data support. The findings identify counterfactual behavior as a distinct dimension from predictive performance, with implications for model selection and reliability of counterfactual explanation methods.

0 favorites 0 likes

#model-interpretability

Function-Valued Causal Influence in Nonlinear Time Series

arXiv cs.LG ↗ · 2026-05-27 Cached

This paper argues that scalar edge scores in nonlinear causal discovery obscure state-dependent effects, and proposes function-valued causal influence using Neural Additive Vector Autoregression and Individual Conditional Expectation.

0 favorites 0 likes

#model-interpretability

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

arXiv cs.CL ↗ · 2026-05-27 Cached

A framework for evaluating and steering cultural values in LLMs using scenario-based behavioral probing and activation steering, revealing latent entanglement of value dimensions.

0 favorites 0 likes

#model-interpretability

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

arXiv cs.LG ↗ · 2026-05-20

This paper provides the first systematic analysis of error sources in trajectory-based data attribution methods, identifies optimizer mismatch as the dominant error, proposes AdamW-influence to address it, and offers practical guidelines for data selection via a K-step look-ahead framework.

0 favorites 0 likes

#model-interpretability

Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI

arXiv cs.AI ↗ · 2026-05-08 Cached

This paper evaluates explainability methods in safety-critical Automatic Target Recognition (ATR) systems, highlighting the limitations of post-hoc techniques like saliency and attention maps. It proposes a taxonomy and assessment framework to address issues such as spurious explanations and instability, advocating for more robust, causally grounded XAI approaches.

0 favorites 0 likes

#model-interpretability

How confessions can keep language models honest

OpenAI Blog ↗ · 2025-12-03 Cached

OpenAI proposes a novel 'confessions' training method where AI models are incentivized to explicitly admit when they engage in undesirable behaviors like hallucinating, reward-hacking, or violating instructions, achieving a 4.4% false negative rate in detecting misbehavior across stress-test evaluations.

0 favorites 0 likes

model-interpretability

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support

Function-Valued Causal Influence in Nonlinear Time Series

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI

How confessions can keep language models honest

Submit Feedback