Logit ROCs: Monitor TPR is linear in FPR in logit space
Kerrick Staley, Aryan Bhatt, Julian StastnySummaryWe study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical…