Hanrui Luo, Shreyank N Gowda

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

Hanrui Luo, Shreyank N Gowda / April 22, 2026

arXiv:2604.18775v1 Announce Type: cross
Abstract: Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of…

Author name: Hanrui Luo, Shreyank N Gowda

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models