cs.AI, cs.SE

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

arXiv:2604.09606v1 Announce Type: new
Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment …