Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
arXiv:2604.11581v3 Announce Type: replace
Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet standard confidence intervals ignore variability from prompt phr…