Uncategorised

Exploration Hacking: Can LLMs Learn to Resist RL Training?

We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models fo…