cs.AI, cs.CL, cs.CR

Jailbreaking Frontier Foundation Models Through Intention Deception

arXiv:2604.24082v1 Announce Type: cross
Abstract: Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between sa…