Eric Easley, Sebastian Farquhar

Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

Eric Easley, Sebastian Farquhar / April 14, 2026

arXiv:2604.10403v1 Announce Type: new
Abstract: We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trai…

Author name: Eric Easley, Sebastian Farquhar

Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs