@NousResearch: To check that CNA isolates only the intended behavior, we evaluate steered models on MMLU across a range of steering st…
Summary
Nous Research released Contrastive Neuron Attribution (CNA), a method to steer LLM behavior by identifying and ablating sparse circuits in MLP neurons without training sparse autoencoders or degrading general benchmarks, validated on multiple large language models.
View Cached Full Text
Cached at: 05/19/26, 04:50 PM
Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks.
Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade.
Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.
The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.
Similar Articles
@NousResearch: Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating s…
NousResearch releases Contrastive Neuron Attribution (CNA), a method to steer LLM behavior by ablating sparse MLP circuits without training autoencoders or degrading benchmarks, validated on refusal circuits across models up to 70B parameters.
Targeted Neuron Modulation via Contrastive Pair Search
Contrastive neuron attribution (CNA) identifies a sparse set of MLP neurons that distinguish harmful from benign prompts, enabling effective behavioral steering in instruction-tuned LLMs without degrading output quality. The method reduces refusal rates by over 50% on jailbreak benchmarks while preserving fluency.
@AnthropicAI: To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on…
Anthropic and Neuronpedia have partnered to release Natural Language Autoencoders (NLAs) on open models, allowing researchers to gain hands-on experience with this interpretability tool.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Researchers apply contrastive LRP-based attribution to analyze why LLMs fail on realistic benchmarks, finding the method gives useful signals in some cases but is not universally reliable.
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
This research demonstrates that safety alignment in large language models can be bypassed by targeting single neurons responsible for refusal, revealing that safety mechanisms are not robustly distributed but mediated by individual neurons.