Mitigating Many-Shot Jailbreaking
arXiv:2504.09604v3 Announce Type: cross
Abstract: Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a “fake” a…