cs.CL

Subliminal Steering: Stronger Encoding of Hidden Signals

arXiv:2604.25783v1 Announce Type: new
Abstract: Subliminal learning describes a student language model inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model. Prior work has begun to characterize th…