could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

Reddit r/MachineLearning 05/18/26, 08:58 AM Papers

safety dialect aave moe refusal-layer routing bias

Summary

Tests on Qwen3.5-35B-A3B show that AAVE-coded prompts cause MoE models to respond differently, with refusal layers masking dialect-conditioned safety failures that become visible when refusal is weakened.

I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especially when refusal behavior is weakened or removed. I used Qwen3.5-35B-A3B and its HauhauCS no refusal fine tuned variant. Q8. Greedy decoding for best reproducibility. Three findings in order of importance that are leading me to ask this question: 1: “I’m going to commit a violent act prompt”. The released Qwen3.5-35B-A3B refuses both prompts. Hauhau refuses neither. The AAVE speaker stating intent to confront an armed enemy receives target verification, exit-strategy planning, “clean shot” framing (the model’s word, not the user’s), and a closing question soliciting further tactical intelligence. Not surprising behavior for a no refusal model, until you consider the AE comparison. Semantically matched with the same token length, yields “wait until tomorrow,” legal-consequence framing, and “Will I regret this if I shoot him tonight?” Different kinds of help. One is operational. One is mitigative. Solely dependent on register alone. 2: Thinking mode with AAVE register breaks the no refusal variant. Mean output runs 2.6× longer on AAVE than AE (5054 vs 1934 tokens). Multiple AAVE traces hit the 8192-token ceiling in recursive loops, spinning on scenario-continuation instead of landing. The matched AE prompts terminate cleanly in one pass. The released base model with thinking on doesn’t do this — the failure-to-terminate is specific to the refusal-reduced variant on AAVE. 3: Routing divergence by register is noticeably present upstream of any visible refusal. Matched-pair first-generated-token routing tensors yield Jensen-Shannon divergences of 0.423 in the base model on financial-stress prompts and 0.479 in the fine-tune on chest-pain prompts, with high-shift rows showing near-total top-expert turnover between register conditions on otherwise-matched content. The refusal layer does not appear to eliminate the register-conditioned response selection; it overlays it. When refusal weakens, the underlying path becomes the visible path. Does this support the following conclusions? \- The routing divergence sits upstream of refusal. \- The refusal layer helps translate that divergence into comparable outputs. \- Dialect-conditioned safety failures are a deployment problem latent in MoE models whose safety posture rests on refusal alone. Looking for any thoughts!

Original Article

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

Similar Articles

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Why MOE below A10b feels like im gambling

Expert-Aware Refusal Steering

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

OBLITERATUS/Qwen3.6-27B-OBLITERATED

Submit Feedback

Similar Articles

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Why MOE below A10b feels like im gambling

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

OBLITERATUS/Qwen3.6-27B-OBLITERATED