Uncategorised

Reproducing steering against evaluation awareness in a large open-weight model

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don’t subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning.TL;DR We replicate Anthropic’s approach to using steering v…