Test-Time Safety Alignment
arXiv:2604.26167v1 Announce Type: new
Abstract: Recent work has shown that a model’s input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been d…