Satchel Grant, Victor Gillioz, Jake Ward, Thomas McGrath

Shifting the Gradient: Understanding How Defensive Training Methods Protect Language Model Integrity

Satchel Grant, Victor Gillioz, Jake Ward, Thomas McGrath / April 21, 2026

arXiv:2604.16423v1 Announce Type: new
Abstract: Defensive training methods such as positive preventative steering (PPS) and inoculation prompting (IP) offer surprising results through seemingly similar processes: both add trait-inducing objects to lar…

Author name: Satchel Grant, Victor Gillioz, Jake Ward, Thomas McGrath

Shifting the Gradient: Understanding How Defensive Training Methods Protect Language Model Integrity