Uncategorised

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom* Equal Contribution.This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI). Our code is available on GitHub and the model checkpoints and data is available on…