Do Linear Probes Generalize Better in Persona Coordinates?
arXiv:2605.09391v2 Announce Type: replace
Abstract: It is becoming increasingly necessary to have monitors check for harmful behaviors during language model interactions, but text-only monitoring has not been sufficient. This is because models sometim…