cs.AI

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

arXiv:2509.25843v2 Announce Type: replace
Abstract: Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models ref…