Uncategorised

Alignment Faking Replication and Chain-of-Thought Monitoring Extensions

In this post, I present a replication and extension of the alignment faking model organism (code on GitHub):Replication. I reproduced the alignment faking (AF) setup from Greenblatt et al. (2024) using the improved classifiers from …