Shifting the Gradient: Understanding How Defensive Training Methods Protect Language Model Integrity
arXiv:2604.16423v1 Announce Type: new
Abstract: Defensive training methods such as positive preventative steering (PPS) and inoculation prompting (IP) offer surprising results through seemingly similar processes: both add trait-inducing objects to lar…