Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
arXiv:2604.09189v1 Announce Type: cross
Abstract: LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not…