Uncategorised

Bypassing Refusal Behavior in Qwen Models via Activation Steering

SummaryPreventing AI misalignment, potentially by bad actors, is the most important goal of AI safety. I ran experiments exploring refusal behavior in the Qwen3 models through activation steering. I found that small Qwen3 models can easily be steered a…