circuit-analysis

#circuit-analysis

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

arXiv cs.CL ↗ · 2d ago Cached

This paper challenges the 'Locate-then-Update' paradigm in LLM post-training by demonstrating that static mechanistic localization is insufficient due to the dynamic evolution of neural circuits during fine-tuning. It introduces new metrics to analyze circuit stability and proposes the need for predictive frameworks in mechanistic localization.

0 favorites 0 likes

#circuit-analysis

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

0 favorites 0 likes

#circuit-analysis

Understanding neural networks through sparse circuits

OpenAI Blog ↗ · 2025-11-13 Cached

OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.

0 favorites 0 likes

circuit-analysis

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Understanding neural networks through sparse circuits

Submit Feedback