I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses
Summary
The author trained Qwen3.5 to jailbreak itself with reinforcement learning, using diversity rewards to surface multiple attack strategies, then improved the defender's robustness from 64% to 92% defense rate with a slight drop in benign accuracy.
Similar Articles
CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
CHASE introduces a co-evolutionary red-blue teaming framework that uses reinforcement learning to harden LLMs against adaptive black-box adversarial attacks, reducing jailbreak success by 43.2% on benchmarks while maintaining zero false refusals on benign prompts.
I created an LLM post-training method called RPS. Preliminary results show that it improved Qwen3-8b's program synthesis reliability. [R]
RPS is a two-stage LLM post-training method inspired by neuroscience, combining curriculum learning with learning rate decay. Preliminary results show improved program synthesis reliability on Qwen3-8b compared to equal learning rate training.
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
This paper investigates jailbreak attacks on Large Reasoning Models (LRMs), revealing that attack success correlates with attention patterns. The authors propose a reinforcement learning-based jailbreak method that incorporates attention signals into the reward function and uses diverse persuasion strategies, achieving significantly higher attack success rates across multiple benchmarks.
qwen 3.6 27B AR-> Diffusion - local training on 5090
The author details attempts to locally train a Qwen 3.6 27B autoregressive-to-diffusion model on an Nvidia 5090 GPU using qlora and modifications from open-dllm and d3LLM, facing VRAM constraints and hardware issues while exploring one-shot diffusion techniques.
LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification
This paper presents an iterative imbalance-aware fine-tuning approach using Qwen3-8B with QLoRA for psychological defense mechanism classification, achieving a macro F1 of 0.3917 and ranking 4th out of 21 teams in the PsyDefDetect 2026 shared task.