7vik - Provide.ai

Uncategorised

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

7vik / March 30, 2026

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom* Equal Contribution.This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI). Our code is available on GitHub and the model checkpoints and data is available on…

Author name: 7vik

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL