circuit-analysis

#circuit-analysis

How do you analyze the relative "strength" of probes? [R]

Reddit r/MachineLearning ↗ · 2026-06-17

The author asks how to analyze the relative 'strength' of probes in neural networks, discussing challenges such as limited vocabulary size and model capacity, and using an example from Google Gemini to illustrate failure cases.

0 favorites 0 likes

#circuit-analysis

Localizing Anchoring Pathways in Language Models

arXiv cs.CL ↗ · 2026-06-12 Cached

This paper investigates how irrelevant numbers in prompts cause anchoring effects in language models and localizes the internal pathways carrying this signal using attribution-based circuit methods on Qwen and Llama models.

0 favorites 0 likes

#circuit-analysis

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper investigates how toxic lexical perturbations in prompts reduce the factual accuracy and increase uncertainty of LLMs, and uses attribution-graph analyses to trace internal changes. It finds that increasing toxicity amplifies perturbation-sensitive variant nodes while core reasoning nodes remain invariant.

0 favorites 0 likes

#circuit-analysis

Architecture, Not Scale: Circuit Localization in Large Language Models

arXiv cs.CL ↗ · 2026-05-12 Cached

This paper challenges the assumption that mechanistic interpretability becomes harder as models scale, showing that architecture (specifically Grouped Query Attention vs. Multi-Head Attention) matters more than parameter count for circuit localization and stability.

0 favorites 0 likes

#circuit-analysis

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper challenges the 'Locate-then-Update' paradigm in LLM post-training by demonstrating that static mechanistic localization is insufficient due to the dynamic evolution of neural circuits during fine-tuning. It introduces new metrics to analyze circuit stability and proposes the need for predictive frameworks in mechanistic localization.

0 favorites 0 likes

#circuit-analysis

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

0 favorites 0 likes

#circuit-analysis

Understanding neural networks through sparse circuits

OpenAI Blog ↗ · 2025-11-13 Cached

OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.

0 favorites 0 likes

circuit-analysis

How do you analyze the relative "strength" of probes? [R]

Localizing Anchoring Pathways in Language Models

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Architecture, Not Scale: Circuit Localization in Large Language Models

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Understanding neural networks through sparse circuits

Submit Feedback