Anders Cairns Woodruff

Uncategorised

Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation

Anders Cairns Woodruff / April 27, 2026

It’s plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this ca…

Author name: Anders Cairns Woodruff

Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation