cs.LG

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

arXiv:2605.12726v1 Announce Type: new
Abstract: Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is…