SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
arXiv:2604.20930v1 Announce Type: cross
Abstract: Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously gener…