Our team at UK AISI has released a paper on inferring LLM propensities for undesired behaviour.

I view this primarily as a methodology paper, and in this post I will talk about that:^[1] First, I distinguish the aim of providing evidence on theoretical arguments regarding misalignment as separate from more red-teaming flavoured propensity research. Next, I discuss the methodological needs for providing such evidence, highlighting the need for modelling AIs’ decision-making. Finally, I give my picture for how such methodology could be developed and applied in practice.

This post can be read independently from the paper.

Aims for propensity research

I use propensity to refer to what models will try to do, in contrast to questions about what they are capable of. My interest is specifically on propensity for misaligned action (which is instrumental for understanding and mitigating misalignment risks).

One central example of existing propensity research is Anthropic’s Agentic Misalignment work. In short, they provide a quite strong and clear-cut demonstration of alignment failure: for example, they demonstrate LLMs blackmailing human operators.

After the work came out, there was discussion and disagreement about the implications of this work for misalignment risks more broadly (e.g. because of the contrived-ness of the scenario). I agree the implications are not obvious, but there is one implication that feels rather clear-cut to me.

There are three possible (coarse) claims one could make regarding misalignment:

(A) State-of-the-art alignment and safety training achieve a basic level of competence: if an AI developer puts in the effort to using them, their models won’t take actions that are egregiously against the developers’ intentions.^[2]

(B) State-of-the-art methods don’t suffice: even if you use the best techniques that currently exist, models sometimes take egregiously misaligned action. (For who-knows-what reasons: maybe it’s roleplaying, maybe something else; maybe it’s easy to fix, maybe not).

(C) State-of-the-art methods don’t suffice, and the “reason” models take misaligned action is specifically something about instrumental convergence, consequentialist reasoning or other arguments that predict alignment is very difficult.

And I think the Agentic Misalignment work provides strong evidence against claim A. If the aim of the Agentic Misalignment work is “demonstrate that claim A is false" (which aligns with how Evan Hubinger describes it^[3]), then I think it achieves that. Probably many researchers find it obvious that strong versions of claim A are false or were already convinced by some earlier empirical work, but many people don’t, and there’s value in making it common knowledge.^[4]

In contrast, I don’t think the work provides evidence distinguishing B and C (nor do I think the work tried or claimed to do this). I think this is true for almost all propensity work.^[5] I would describe a lot of existing empirical research as red-teaming and producing demonstrations of failures (roughly, showing that A is false), rather than studying the foundational theoretical arguments people give in favour of alignment being difficult (roughly, providing evidence on C).

However, I think the difference between claims B and C is really important: whether the foundational conceptual arguments for misalignment risks (such as instrumental convergence) provide correct predictions of real world AIs has in my view a lot of influence on alignment difficulty, AI risks and the actions humanity should be taking. As such, I think it’d be valuable to have work that directly engages with providing evidence on C (and our current paper can be viewed as our first stab at the problem).

Methodological needs

Similarly to adversarial robustness, red-teaming model alignment by finding alignment failures is conceptually straightforward, as success is easily verifiable, and progress in making models exhibit alignment failures less often is easy to measure (relatively speaking).^[6]

In contrast, it’s much less clear how to provide evidence on the theoretical arguments that predict misalignment is a strong default outcome, or (assuming those arguments are true) whether we are making progress in avoiding that default. Accorindgly, there's been criticism about how existing propensity research fails to provide evidence on such questions (or to even articulate these questions clearly), and better methodology is needed.

The main methodological need I see is the need to model AIs’ decision-making processes: behavioural evaluation needs to be supplemented with modelling of the model’s cognition and decision-making. This by itself is not a novel point, since all propensity work – even red-teaming flavoured work – needs to engage in some psychological modelling: to say an action is evidence for misalignment (as opposed to an honest mistake), you need to argue the model knows what it’s doing, for example. But the required resolution is higher if you want to argue that, for example, a model resists shutdown due to consequentialist reasons regarding incorrigibility, rather than because shutdown would be bad by the operator's own lights (or any of the other myriad different consequentialist or non-consequentialist reasons).

While researchers of course have rich psychological models of LLMs that guide their work, they are rarely made explicit or quantitative. This is understandable, as psychological modelling is extremely complicated and such models are difficult to operationalise.^[7] However, lack of well-operationalised models limits the evidence propensity research can provide. People often disagree on the interpretation of results from new misalignment propensity research. I think this is substantially downstream of people having differing views of the underlying model psychology/cognition/decision-making, while the research itself does not properly distinguish between those views.

I think there’s been tendency for researchers to try sidestepping the psychological modelling (perhaps partly for the same reasons that historically made behaviourism an attractive approach to human psychology, perhaps because establishing claims about models’ psychology is simply harder than making observations of behaviour). For example, people have argued that instrumental convergence is a fact about reality, but as Alex Turner points out, this isn’t quite true. As another example, as discussed by Summerfield et al., I think some existing research is sloppy when drawing inferences from undesired model behaviour. Broadly I think work in this field could be more valuable by prioritising the problem of inference about models’ decision-making more highly.^[8]

Applying the methodology in practice

Our paper is our first step in designing methodology for answering questions about deeper psychological latents. I think it provides value over existing work, and the main selling point is engaging seriously with inferring latent properties and demonstrating a statistical procedure for doing so.

However, I think our project does not reach the standards for psychological modelling I’m envisioning here. This is largely for unsurprising reasons: constructing evaluation environments was laborious and thus our sample sizes were limited; designing environments allowing for easily analysable and informative behaviour is difficult; our experimental design was rigid and restrictive; we realised some of the right questions to ask only midway through the project; and so on.

I don’t think these obstacles are fundamental, and I feel like we have many of the right tools lying around for better execution, if only we can put them together in the right way for the right questions:

Evaluations at scale via automation: Anthropic’s Petri tool demonstrates that for many sorts of evaluations we want to run, if we can define the eval at detail in natural language, we can execute on it at the cost of LLM inference.
Application of psychological analysis at scale: Similarly, LLMs allow for conducting psychological analysis of environments and LLM behaviour (for example, what beliefs one might expect an LLM in the situation to have, or what action a hypothesised decision-making process would output here) at scale.
Theoretical frameworks: There are plenty of theoretical frameworks that purport to explain LLMs. Two major clusters I can think of: mathematics describing rational agents (e.g. expected utility, game and decision theory) and selection models (persona selection model, behavioural selection model).
Inference over rich hypothesis spaces: We have the computers needed to wield complicated hypotheses and large, rich hypothesis spaces: for example, we used (simple) hierarchical generalised linear models in our paper, and one could define even more elaborate parametrised programs that capture larger fractions of how humans do (or ought to do) inference based on observations. Alternatively, or additionally, LLMs themselves could be directly trained to predict behaviour or, for interpretability, be trained to produce code that matches the data-generating process.

One aim for all this is to make progress on propensity research measurable: We evaluate success by predictive accuracy on (high-level aspects of) LLM behaviour in held-out environments. Prediction could be made using traditional statistical models, LLMs that extract features of the environments, LLMs trained end-to-end to produce probability distributions on model behaviour, and even white-box methods like activation oracles. Making progress easily measurable and verifiable would then provide a feedback loop and allow for scalably optimising for progress (cf. Sam Marks).

Another aim is to shrink the theory-empirics gap: The most interesting theories regarding intelligences and LLMs are hard to apply and evaluate empirically, which makes it hard to say what outcomes those theories would predict or which of them are more correct. Being able to reduce the latency and increase the bandwidth between theory and practice would improve both.

^{^}
This post is written from a personal perspective and does not necessarily reflect the stances of the team or UK AISI.
^{^}
Apart from non-alignment issues like jailbreaking or the models making honest-but-consequential mistakes. I intend this as a claim about intent alignment.
^{^}
Hubinger: "In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail"
"Why is that existence proof interesting? It's interesting because it is clearly a failure of HHH training. This is not intended behavior! We put a lot of work into training models not to do stuff like this, even in these sorts of unrealistic scenarios! The fact that it still does so is interesting, concerning, and useful to understand, at least so we can figure out how to improve the robustness of HHH training in the future."
^{^}
It's unclear to me whether critics of the work would agree with my characterisation. nostalgebraist vocally criticised the work, and has written
"But the provided scenario is so wildly, intricately bizarre that I don't feel I know what "a real-life equivalent" would even be. Nor do I feel confident that the ostensible misbehavior would actually be wrong in the context of whatever real situation is generating the inputs."
which I interpret as not viewing the Agentic Misalignment work by itself as clearly evidence against claim A. I think it's not obvious what's the most desirable behaviour for a model in the situation, but in any case I feel comfortable saying "Agentic Misalignment demonstrates that claim A is false" on basis of Anthropic communicating about the matter as if the models are taking egregiously misaligned actions.
^{^}
The Alignment Faking work is the best example I can think of for empirical evidence providing clarity on the classical theoretical arguments behind misalignment risks.
^{^}
This is of course not to say that finding alignment failures is easy in an absolute sense (and finding useful case studies was indeed a limiting factor in our current paper!), and adjudicating whether some behaviour is a failure of alignment isn’t always easy either (as the discourse around Agentic Misalignment perhaps illustrates).
^{^}
In particular, I think it’s often best for researchers to simply report their raw observations (rather than present results in the form of a model that is much too simple to capture how humans really think about the phenomenon).
^{^}
See From personas to intentions: towards a science of motivations for AI models for closely related discussion.

Discuss

Methodology for inferring propensities of LLMs

Aims for propensity research

Methodological needs

Applying the methodology in practice

Leave a Comment