cs.CL, cs.LG

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv:2604.18775v1 Announce Type: cross
Abstract: Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of…