SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
arXiv:2604.01473v2 Announce Type: replace-cross
Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal feature…