Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
arXiv:2604.10403v1 Announce Type: new
Abstract: We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trai…