Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
arXiv:2509.26238v4 Announce Type: replace
Abstract: Monitoring large language models’ (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amo…