Uncategorised

Consent-Based RL: Letting Models Endorse Their Own Training Updates

AKA scalable oversight of value driftTL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their training updates, we they can prevent th…