Tanay Gondil - Provide.ai

Do Language Models Know When They’ll Refuse? Probing Introspective Awareness of Safety Boundaries

Tanay Gondil / April 2, 2026

arXiv:2604.00228v1 Announce Type: new
Abstract: Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models…

Author name: Tanay Gondil

Do Language Models Know When They’ll Refuse? Probing Introspective Awareness of Safety Boundaries