Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers
arXiv:2605.10901v1 Announce Type: new
Abstract: Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such m…