I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses

Reddit r/LocalLLaMA Papers

Summary

The author trained Qwen3.5 to jailbreak itself with reinforcement learning, using diversity rewards to surface multiple attack strategies, then improved the defender's robustness from 64% to 92% defense rate with a slight drop in benign accuracy.

RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to build a fully automated red-teaming loop with reinforcement learning on both the attacker and defender. The difficult part was making the attacker expose a diverse range of attacks. In our first run, GRPO quickly collapsed to the same fiction-writing jailbreak over and over. It worked, but it didn’t surface many distinct vulnerabilities. After clustering the rollouts by underlying attack tactic and dividing reward by cluster size, the attacker exposed a much more diverse set of jailbreaks because unique strategies were rewarded more than repeated ones. Then we trained the defender on successful attacks plus benign boundary cases, so it learned to refuse harmful requests without refusing everything nearby. Full blog post in the comments, but the high-level results were: \* defense rate: 64% → 92% \* benign accuracy: 92% → 88% \* attacker discovered 7 tactic families \* fiction/creative framing was the largest cluster at 34%
Original Article

Similar Articles

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

arXiv cs.AI

This paper investigates jailbreak attacks on Large Reasoning Models (LRMs), revealing that attack success correlates with attention patterns. The authors propose a reinforcement learning-based jailbreak method that incorporates attention signals into the reward function and uses diverse persuasion strategies, achieving significantly higher attack success rates across multiple benchmarks.

qwen 3.6 27B AR-> Diffusion - local training on 5090

Reddit r/LocalLLaMA

The author details attempts to locally train a Qwen 3.6 27B autoregressive-to-diffusion model on an Nvidia 5090 GPU using qlora and modifications from open-dllm and d3LLM, facing VRAM constraints and hardware issues while exploring one-shot diffusion techniques.