An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
arXiv:2604.18775v1 Announce Type: cross
Abstract: Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of…