cs.AI, cs.CL

A Lightweight Explainable Guardrail for Prompt Safety

arXiv:2602.15853v2 Announce Type: replace-cross
Abstract: We propose a lightweight explainable guardrail (LEG) method to detect unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classif…