Robert Kirk, Alexandra Souly, Kai Fronsdal, Abby D'Cruz, Xander Davies

Evaluating whether AI models would sabotage AI safety research

Robert Kirk, Alexandra Souly, Kai Fronsdal, Abby D'Cruz, Xander Davies / April 28, 2026

arXiv:2604.24618v1 Announce Type: new
Abstract: We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluations…

Author name: Robert Kirk, Alexandra Souly, Kai Fronsdal, Abby D'Cruz, Xander Davies

Evaluating whether AI models would sabotage AI safety research