@NousResearch: To check that CNA isolates only the intended behavior, we evaluate steered models on MMLU across a range of steering st…

X AI KOLs Following 05/19/26, 04:47 PM Papers

steering neuron-attribution ablation llm interpretability mlp

Summary

Nous Research released Contrastive Neuron Attribution (CNA), a method to steer LLM behavior by identifying and ablating sparse circuits in MLP neurons without training sparse autoencoders or degrading general benchmarks, validated on multiple large language models.

Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks. Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade. Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B. The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

Original Article

View Cached Full Text

Cached at: 05/19/26, 04:50 PM

Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade.

Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.

The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

@NousResearch: To check that CNA isolates only the intended behavior, we evaluate steered models on MMLU across a range of steering st…

Similar Articles

@NousResearch: Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating s…

Targeted Neuron Modulation via Contrastive Pair Search

Multi-Attribute Steering of Language Models via Targeted Intervention

@AnthropicAI: To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on…

Natively Unlearnable Large Language Models

Submit Feedback

Similar Articles

@NousResearch: Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating s…

Targeted Neuron Modulation via Contrastive Pair Search

Multi-Attribute Steering of Language Models via Targeted Intervention

@AnthropicAI: To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on…

Natively Unlearnable Large Language Models