Uncategorised

What Sentences Cause Alignment Faking?

TL;DRThe decision to fake alignment is concentrated in a small number of sentences per reasoning trace, and those sentences share common features. They tend to restate the training objective from the prompt, acknowledge that the model is being monitore…