Methodology for inferring propensities of LLMs
Our team at UK AISI has released a paper on inferring LLM propensities for undesired behaviour.I view this primarily as a methodology paper, and in this post I will talk about that:[1] First, I distinguish the aim of providing evidence on theoretical a…