Predicting When RL Training Breaks Chain-of-Thought Monitorability
Crossposted from the DeepMind Safety Research Medium Blog. Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah.Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promisin…