Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
arXiv:2605.06161v1 Announce Type: new
Abstract: LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agen…