Uncategorised

Verbalised evaluation awareness in language models has little effect on their behaviour

TL;DR: We provide evidence that the presence of verbalised evaluation awareness (VEA) in CoTs does not imply eval gaming. We tested this across 8 open-weight LRMs and 4 benchmarks (safety, alignment, moral dilemmas, political opinion) by comparing answ…