AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations
arXiv:2601.07473v4 Announce Type: replace
Abstract: As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods sat…