Author name: jenny

Uncategorised

A Toy Environment For Exploring Reasoning About Reward

tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instru…

Uncategorised

Metagaming matters for training, evaluation, and oversight

Following up on our previous work on verbalized eval awareness:we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run.Metagaming is a more general, and in our experience a more useful concept, than evaluati…

Scroll to Top