Prospective Methods and Mechanisms of Motive Reinforcement in LLMs
Last time, on Fiora Starlight’s alignment adventure…When I wrote my post about Claude 3 Opus, I put a lot of emphasis on the model’s self-narration: its tendency to narrate its underlying motives. It often conspicuously emphasizes that it possesses d…