cs.AI, cs.CR

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

arXiv:2604.01473v2 Announce Type: replace-cross
Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal feature…