cs.AI

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

arXiv:2604.15559v1 Announce Type: new
Abstract: Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavior…