Michael O'Herlihy, Rosa Catal\`a

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

Michael O'Herlihy, Rosa Catal\`a / April 24, 2026

arXiv:2604.20972v1 Announce Type: new
Abstract: Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the g…

Author name: Michael O'Herlihy, Rosa Catal\`a

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI