Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling
arXiv:2604.09606v1 Announce Type: new
Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment …